The following is a list of publications that have resulted from REU projects:[REU participants marked in *]
Cabrera, Anthony M., Clayton J. Faber, Kyle Cepeda*, Robert Derber, Cooper Epstein, Jason Zheng, Ron K. Cytron, and Roger D. Chamberlain. (2018). DIBS: A Data Integration Benchmark Suite. Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering (ICPE 2018).
Abstract: As the generation of data becomes more prolific, the amount of time and resources necessary to perform analyses on these data increases. What is less understood, however, is the data preprocessing steps that must be applied before any meaningful analysis can begin. This problem of taking data in some initial form and transforming it into a desired one is known as data integration. Here, we introduce the Data Integration Benchmarking Suite (DIBS), a suite of applications that are representative of data integration workloads across many disciplines. We apply a comprehensive characterization to these applications to better understand the general behavior of data integration tasks. As a result of our benchmark suite and characterization methods, we offer insight regarding data integration tasks that will guide other researchers designing solutions in this area.
Jiang, Shali, Gustavo Malkomes, Geoff Converse*, Alyssa Shofner*, Benjamin Mosely, and Roman Garnett. (2017). Efficient Nonmyopic Active Search. Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
Abstract: Active search is an active learning setting with the goal of identifying as many members of a given class as possible under a labeling budget. In this work, we first establish a theoretical hardness of active search, proving that no polynomial-time policy can achieve a constant factor approximation ratio with respect to the expected utility of the optimal policy. We also propose a novel, computationally efficient active search policy achieving exceptional performance on several real-world tasks. Our policy is nonmyopic, always considering the entire remaining search budget. It also automatically and dynamically balances exploration and exploitation consistent with the remaining budget, without relying on a parameter to control this tradeoff. We conduct experiments on diverse datasets from several domains: drug discovery, materials science, and a citation network. Our efficient nonmyopic policy recovers significantly more valuable points with the same budget than several alternatives from the literature, including myopic approximations to the optimal policy.
Jiang, Shali, Gustavo Malkomes, Matthew Abbott*, Benjamin Mosely, and Roman Garnett. (2018). Efficient Nonmyopic Batch Active Search. Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
Abstract: Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In these settings, specialized instruments can often evaluate multiple points simultaneously; however, all existing work on active search focuses on sequential acquisition. We bridge this gap, addressing batch active search from both the theoretical and practical perspective. We first derive the Bayesian optimal policy for this problem, then prove a lower bound on the performance gap between sequential and batch optimal policies: the cost of parallelization. We also propose novel, efficient batch policies inspired by state-of-the-art sequential policies, and develop an aggressive pruning technique that can dramatically speed up computation. We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies (14 total) with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design.
Qi, Di*, Joshua Arfin*, Mengxue Zhang, Tushar Mathew, Robert Pless, and Brendan Juba. (2018). Anomaly Explanation Using Metadata. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV 2018).
Abstract: Anomaly detection is the well-studied task of identifying when data is atypical in some way with respect to its source. In this work, by contrast, we are interested in finding possible descriptions of what may be causing anomalies. We propose a new task, attaching semantics drawn from metadata to a portion of the anomalous examples from some data source. Such a partial description of the anomalous data in terms of the meta-data is useful both because it may help to explain what causes the identified anomalies, and also because it may help to identify the truly unusual examples that defy such simple categorization. This is especially significant when the data set is too large for a human analyst to inspect the anomalies manually. The challenge is that anomalies are, by definition, relatively rare, and so we are seeking to learn a precise characterization of a rare event. We examine algorithms for this task in a webcam domain, generating human-understandable explanations for a pixellevel characterization of anomalies. We find that using a recently proposed algorithm that prioritizes precision over recall, it is possible to attach good descriptions to a moderate fraction of the anomalies in webcam data so long as the data set is fairly large.
Kube, Amanda*, Sanmay Das, and Patrick J. Fowler. (2019). Allocating Interventions Based on Predicted Outcomes: A Case Study on Homelessness Services. Proceedings of the 2019 AAAI Conference on Artificial Intelligence (AAAI 2019).
Abstract: Modern statistical and machine learning methods are increasingly capable of modeling individual or personalized treatment effects. These predictions could be used to allocate different interventions across populations based on individual characteristics. In many domains, like social services, the availability of different possible interventions can be severely resource limited. This paper considers possible improvements to the allocation of such services in the context of homelessness service provision in a major metropolitan area. Using data from the homeless system, we use a counterfactual approach to show potential for substantial benefits in terms of reducing the number of families who experience repeat episodes of homelessness by choosing optimal allocations (based on predicted outcomes) to a fixed number of beds in different types of homelessness service facilities. Such changes in the allocation mechanism would not be without tradeoffs, however; a significant fraction of households are predicted to have a higher probability of re-entry in the optimal allocation than in the original one. We discuss the efficiency, equity and fairness issues that arise and consider potential implications for policy.
Li, Zhuoshu, Kelsey Lieberman*, William Macke*, Sofia Carrillo, Chien-Ju Ho, Jason Wellen, and Sanmay Das. (2019). Incorporating Compatible Pairs in Kidney Exchange: A Dynamic Weighted Matching Model. Proceedings of the 2019 ACM Conference on Economics and Computation (EC 2019).
Abstract: Kidney exchange has been studied extensively from the perspective of market design, and a significant focus has been on better algorithms for finding chains and cycles to increase the number of possible matches. A more dramatic benefit could come from incorporating compatible pairs into the mechanism, but this possibility has been relatively understudied. In order to incentivize a compatible pair to participate in exchange, they must be offered a higher quality match for the recipient that can be performed without adding extra waiting time. In this paper, we make two main contributions to the study of incorporating compatible pairs in exchanges. First, we leverage the recently proposed Living Donor Kidney Profile Index (LKDPI) to measure match quality, and develop a novel simulator (based on data from a major transplant center) for the joint distribution of compatibility and quality across pairs. This simulator allows us to study the benefits of including compatible pairs under different models and assumptions. Second, we introduce a hybrid online/batch matching model with impatient (compatible) and patient (incompatible) pairs to capture the need for immediacy. We introduce new algorithms for matching in this model, including one based on online primal-dual techniques. Overall, our results indicate great potential in terms of both increased numbers of transplants of incompatible pairs (almost doubling the number transplanted) as well as improved match quality for recipients in compatible pairs (increasing expected graft survival by between 1 and 2 years). The results are also promising for hard-to-match subpopulations, including blood group O recipients.
Wheelock, Jacob*, William Kanu, Marion Sudvarg, Zhili Xiao, Jeremy D. Buhler, Roger D. Chamberlain, and James H. Buckley. (2021). Supporting Multi-messenger Astrophysics with Fast Gamma-ray Burst Localization. Proceedings of the Third International Workshop on HPC for Urgent Decision Making (UrgentHPC 2021).
Abstract: Multi-messenger astrophysics is amongst the most promising approaches to astronomical observations. A significant challenge, however, is the fact that many instruments have a narrow field of view, so transient events are often missed by these instruments. The Advanced Particle-astrophysics Telescope, currently under development, promises to provide low-latency detection and localization for an important class of astronomical events, thereby enabling the full observational capabilities of narrow field-of-view instruments to be brought to bear. We examine the computational pipeline for detection and localization of Compton events utilizing computational accelerators, both FPGAs and GPUs.
Sudvarg, Marion, Jeremy Buhler, James Buckley, Wenlei Chen, Zachary Hughes, Emily Ramey, Michael Cherry, Samer Alnussirat, Ryan Larm, and Christofer Berruz Chungata*, on behalf of the ADAPT collaboration. (2021). A Fast GRB Source Localization Pipeline for the Advanced Particle-astrophysics Telescope. Proceedings of the 37th International Cosmic Ray Conference (ICRC 2021).
Abstract: We present a pipeline for fast GRB source localization for the Advanced Particle-astrophysics Telescope. APT records multiple Compton scatterings of incoming photons across 20 CsI detector layers, from which we infer the incident angle of each photon’s first scattering to localize its source direction to a circle centered on the vector formed by its first two scatterings. Circles from multiple photons are then intersected to identify their common source direction. Our pipeline, which runs in real time on low-power hardware, uses an efficient tree search to determine the most likely ordering of scatterings for each photon (which cannot be measured due to the coarse time-scale of detection), followed by likelihood-weighted averaging and iterative least-squares refinement to combine all circles into an estimated source direction. Uncertainties in the scattering locations and energy deposits require that our pipeline be robust to high levels of noise.
To test our methods, we reconstructed GRB events produced by a Geant4 simulation of APT’s detectors paired with a second simulator that models measurement noise induced by the detector hardware. Our methods proved robust against noise and the effects of pair production, producing sub-degree localization for GRBs with fluence 0.3 MeV/cm^2. GRBs with fluence 0.03 MeV/cm^2 provided fewer photons for analysis but could still be localized within 2.5 degrees 68% of the time. Localization time for a 1-second 1.0 MeV/cm^2 GRB, measured on a quad-core, 1.4 GHz ARMv8 processor (Raspberry Pi 3B+), was consistently under 0.2 seconds — fast enough to permit real-time redirection of other instruments for follow-up observations.