My statistical research is science-first and, thus, eclectic. I study, create, and apply a diverse array of statistical methods guided by the scientific problems I am working on, often in environmental health, global health, and public health policy. Across these areas, I develop statistical methods when existing tools are inadequate to address the scientific question, with an emphasis on methods that are both mathematically principled and practically useful. The different strands of research are unified by a common goal: to make statistical inference more reliable, scalable, and scientifically interpretable in complex public health applications. Details on some of my research topics are provided below.
Machine learning methods have been widely adopted for environmental and climate analysis. However, these data are characterized by dependence across space and time, and off the shelf use of machine learning for these datasets ignores this correlation. I have studied the performance of machine learning for geospatial data and shown that ignoring spatial correlation can lead to biased and inefficient inference. We have developed new spatial machine learning methods that retain the inferential strengths of spatial statistics. This includes methods that combine neural networks, random forests, graph neural networks, and Gaussian processes so that flexible prediction tools can account for spatial dependence rather than treating observations as independent.
Neural networks for geospatial data. JASA
Random forests for spatially dependent data. JASA
I have been working on understanding and formalizing when population level exposure effects can be identified or consistently estimated in environmental epidemiology studies in the presence of unmeasured spatially structured confounders. This strand of work clarifies the limits of common adjustment methods and develops new theory for causal inference with spatially dependent exposures, outcomes, and confounders.
When is it possible to account for spatial confounding between Gaussian random fields? Ann. Stat.
Consistency of GLS under spatial confounding. Biometrika
Nonparametric identification under spatial confounding. arXiv
Did COVID 19 lockdowns impact air quality? Am. J. Epi.
My global health research focuses on making valid inference from imperfect cause of death prediction systems. Working on child mortality estimation using verbal autopsy based cause of death data, I have developed general purpose methods that correct for classifier misclassification bias under label shift and combine large amounts of noisy predicted cause of death data with smaller amounts of high quality validation data. This work has contributed to United Nations child mortality estimation and related BMJ work on global cause specific mortality.
Model based correction of verbal autopsy misclassification bias. Biostatistics
Bias correction of probabilistic classifiers and theory of classifier bias correction under label shift. JASA
Modeling heterogeneity in verbal autopsy misclassification matrices. Ann. Appl. Stat.
Country specific estimates of verbal autopsy misclassification rates. BMJ Global Health
Bias-corrcted child and neonatal mortality estimates for Mozambique. Am. J. Trop. Med.
United Nations 2025 Child Mortality Estimates. Paper to appear in BMJ.
I have recently started using variational inference methods for Bayesian models where standard MCMC methods are computationally impractical or where modular uncertainty propagation is required. A recent focus is using universal distributional approximants such as normalizing flows in variational inference for cut Bayes, which propagates upstream uncertainty into downstream analyses without requiring access to the upstream data or allowing feedback from the downstream model to distort the upstream inference. This work has also quantified the approximation error of variational normalizing flow families.
Method and theory of variational normalizing flows for approximating cut Bayes posteriors. arXiv
Fast variational Bayes for large spatial data. JCGS, to appear. arXiv
In environmental source apportionment, the scientific goal is to identify and quantify the sources of pollution in an area from multi pollutant concentration measurements collected over time. This is a classical unsupervised nonnegative matrix factorization problem where individual factors are generally not identifiable. I have recently started leveraging geometric ideas for this problem by viewing the observed data cloud as lying in a conical hull generated by latent source profiles. This perspective makes it possible to identify scientifically meaningful source attribution targets, even when the full factorization itself is not unique. The resulting methods estimate source attribution percentages that are invariant to scale ambiguities and grounded in population level quantities. These methods have been used in work on air pollution and coal dust exposure in Curtis Bay, Baltimore.
Identification in source apportionment using geometry. arXiv
Geometric source apportionment of air pollution burden in Curtis Bay. arXiv
I develop statistical methods for analyzing air pollution data from low cost sensor networks, where measurements are spatially dense and temporally resolved but often noisy, biased, and heterogeneous across sensor platforms. This work focuses on calibration, data fusion, spatial mapping, and detection of pollution spikes by combining low cost sensor data with regulatory monitor data and spatial statistical models. These methods are motivated by air quality monitoring in Baltimore and are designed to produce reliable, neighborhood scale pollution estimates that can support environmental health research, community engagement, and public facing data products.
A Gaussian process filter for mitigating underestimation bias in air sensor calibration. Ann. Appl. Stat.
Fusion and calibration of multiple air sensor networks using GP filter. Environmetrics
The Baltimore Air Quality Dashboard
I also work on spatial graphical models for areal, lattice, and multivariate spatial data. These methods are designed for spatial dependence structures arising in disease mapping, air pollution modeling, and public health surveillance, where interpretability, computational scalability, and valid uncertainty quantification are all essential.
Directed acyclic graph autoregression. Bayes An.
Visibility graph based covariance functions for nonconvex spatial domains. Biometrics
Graphical models with Gaussian processes as nodes. Biometrika
I have developed and applied statistical methods for infectious disease applications where public health decisions require combining incomplete surveillance data, spatial information, and multiple imperfect data sources. This work includes estimation of disease incidence and key population sizes relevant to HIV prevention, where direct enumeration is difficult because of stigma, criminalization, incomplete data, and geographic misalignment.
Pan Africa cholera incidence mapping. Nature Medicine
Incomplete and misaligned capture recapture. Ann. Appl. Stat.
In my early research, I developed the nearest neighbor Gaussian process framework for scalable Gaussian process modeling. NNGP reduces the computational burden of traditional Gaussian process inference while preserving much of its modeling flexibility, and it has been widely used in environmental health, ecology, climate science, neuroimaging, forestry, and other fields.
Nearest neighbor Gaussian process. JASA
NNGP for spatio temporal data. Ann. Appl. Stat.
Efficient algorithms for NNGP. JCGS
Multivariate NNGP. Stat. Sinica
I have also worked on a variety of other statistical problems, often motivated by applications where standard modeling assumptions are violated. A sampler is given below.
Gaussian process regression for distribution-valued inputs. EJS
Linear regression for compositional data. Biometrics
High-dimensional error-in-variables regression. Ann. Stat.