-
Moving Towards Automated Interstellar Boundary Explorer Data Selection with LOTUS
Authors:
Madeline A. Stricklin,
Lauren J. Beesley,
Brian P. Weaver,
Kelly R. Moran,
Dave Osthus,
Paul H. Janzen,
Grant David Meadors,
Daniel B. Reisenfeld
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that provide insight into the heliosphere, the region surrounding our solar system and separating it from interstellar space. IBEX collects information on these particles and on extraneous ``background'' particles. While IBEX records how and when the different particles are observed, it does not dis…
▽ More
The Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that provide insight into the heliosphere, the region surrounding our solar system and separating it from interstellar space. IBEX collects information on these particles and on extraneous ``background'' particles. While IBEX records how and when the different particles are observed, it does not distinguish between heliospheric ENA particles and incidental background particles. To address this issue, all IBEX data has historically been manually labeled as ``good'' ENA data, or ``bad'' background data. This manual culling process is incredibly time-intensive and contingent on subjective, manually-induced decision thresholds. In this paper, we develop a three-stage automated culling process, called LOTUS, that uses random forests to expedite and standardize the labelling process. In Stage 1, LOTUS uses random forests to obtain probabilities of observing true ENA particles on a per-observation basis. In Stage 2, LOTUS aggregates these probabilities to obtain predictions within small windows of time. In Stage 3, LOTUS refines these predictions. We compare the labels generated by LOTUS to those manually generated by the subject matter expert. We use various metrics to demonstrate that LOTUS is a useful automated process for supplementing and standardizing the manual culling process.
△ Less
Submitted 21 March, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Sensitivity Analysis in the Presence of Intrinsic Stochasticity for Discrete Fracture Network Simulations
Authors:
Alexander C. Murph,
Justin D. Strait,
Kelly R. Moran,
Jeffrey D. Hyman,
Hari S. Viswanathan,
Philip H. Stauffer
Abstract:
Large-scale discrete fracture network (DFN) simulators are standard fare for studies involving the sub-surface transport of particles since direct observation of real world underground fracture networks is generally infeasible. While these simulators have seen numerous successes over several engineering applications, estimations on quantities of interest (QoI) - such as breakthrough time of partic…
▽ More
Large-scale discrete fracture network (DFN) simulators are standard fare for studies involving the sub-surface transport of particles since direct observation of real world underground fracture networks is generally infeasible. While these simulators have seen numerous successes over several engineering applications, estimations on quantities of interest (QoI) - such as breakthrough time of particles reaching the edge of the system - suffer from a two distinct types of uncertainty. A run of a DFN simulator requires several parameter values to be set that dictate the placement and size of fractures, the density of fractures, and the overall permeability of the system; uncertainty on the proper parameter choices will lead to some amount of uncertainty in the QoI, called epistemic uncertainty. Furthermore, since DFN simulators rely on stochastic processes to place fractures and govern flow, understanding how this randomness affects the QoI requires several runs of the simulator at distinct random seeds. The uncertainty in the QoI attributed to different realizations (i.e. different seeds) of the same random process leads to a second type of uncertainty, called aleatoric uncertainty. In this paper, we perform a Sensitivity Analysis, which directly attributes the uncertainty observed in the QoI to the epistemic uncertainty from each input parameter and to the aleatoric uncertainty. We make several design choices to handle an observed heteroskedasticity in DFN simulators, where the aleatoric uncertainty changes for different inputs, since the quality makes several standard statistical methods inadmissible. Beyond the specific takeaways on which input variables affect uncertainty the most for DFN simulators, a major contribution of this paper is the introduction of a statistically rigorous workflow for characterizing the uncertainty in DFN flow simulations that exhibit heteroskedasticity.
△ Less
Submitted 4 January, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Empirical Validation of a New Data Product from the Interstellar Boundary Explorer Satellite
Authors:
Kelly R. Moran,
Dave Osthus,
Brian P. Weaver,
Lauren J. Beesley,
Madeline A. Stricklin,
Paul H. Janzen,
Daniel B. Reisenfeld
Abstract:
Since 2008, the Interstellar Boundary Explorer (IBEX) satellite has been gathering data on heliospheric energetic neutral atoms (ENAs) while being exposed to various sources of background noise, such as cosmic rays and solar energetic particles. The IBEX mission initially released only a qualified triple-coincidence (qABC) data product, which was designed to provide observations of ENAs free of ba…
▽ More
Since 2008, the Interstellar Boundary Explorer (IBEX) satellite has been gathering data on heliospheric energetic neutral atoms (ENAs) while being exposed to various sources of background noise, such as cosmic rays and solar energetic particles. The IBEX mission initially released only a qualified triple-coincidence (qABC) data product, which was designed to provide observations of ENAs free of background contamination. Further measurements revealed that the qABC data was in fact susceptible to contamination, having relatively low ENA counts and high background rates. Recently, the mission team considered releasing a certain qualified double-coincidence (qBC) data product, which has roughly twice the detection rate of the qABC data product. This paper presents a simulation-based validation of the new qBC data product against the already-released qABC data product. The results show that the qBCs can plausibly be said to share the same signal rate as the qABCs up to an average absolute deviation of 3.6%. Visual diagnostics at an orbit, map, and full mission level provide additional confirmation of signal rate coherence across data products. These approaches are generalizable to other scenarios in which one wishes to test whether multiple observations could plausibly be generated by some underlying shared signal.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Cosmic-Enu: An emulator for the non-linear neutrino power spectrum
Authors:
Amol Upadhye,
Juliana Kwan,
Ian G. McCarthy,
Jaime Salcido,
Kelly R. Moran,
Earl Lawrence,
Yvonne Y. Y. Wong
Abstract:
Cosmology is poised to measure the neutrino mass sum $M_ν$ and has identified several smaller-scale observables sensitive to neutrinos, necessitating accurate predictions of neutrino clustering over a wide range of length scales. The FlowsForTheMasses non-linear perturbation theory for the massive neutrino power spectrum, $Δ^2_ν(k)$, agrees with its companion N-body simulation at the $10\%-15\%$ l…
▽ More
Cosmology is poised to measure the neutrino mass sum $M_ν$ and has identified several smaller-scale observables sensitive to neutrinos, necessitating accurate predictions of neutrino clustering over a wide range of length scales. The FlowsForTheMasses non-linear perturbation theory for the massive neutrino power spectrum, $Δ^2_ν(k)$, agrees with its companion N-body simulation at the $10\%-15\%$ level for $k \leq 1~h/$Mpc. Building upon the Mira-Titan IV emulator for the cold matter, we use FlowsForTheMasses to construct an emulator for $Δ^2_ν(k)$ covering a large range of cosmological parameters and neutrino fractions $Ω_{ν,0} h^2 \leq 0.01$, which corresponds to $M_ν\leq 0.93$~eV. Consistent with FlowsForTheMasses at the $3.5\%$ level, it returns a power spectrum in milliseconds. Ranking the neutrinos by initial momenta, we also emulate the power spectra of momentum deciles, providing information about their perturbed distribution function. Comparing a $M_ν=0.15$~eV model to a wide range of N-body simulation methods, we find agreement to $3\%$ for $k \leq 3 k_\mathrm{FS} = 0.17~h/$Mpc and to $19\%$ for $k \leq 0.4~h/$Mpc. We find that the enhancement factor, the ratio of $Δ^2_ν(k)$ to its linear-response equivalent, is most strongly correlated with $Ω_{ν,0} h^2$, and also with the clustering amplitude $σ_8$. Furthermore, non-linearities enhance the free-streaming-limit scaling $\partial \log(Δ^2_ν/ Δ^2_{\rm m}) / \partial \log(M_ν)$ beyond its linear value of 4, increasing the $M_ν$-sensitivity of the small-scale neutrino density.
△ Less
Submitted 19 November, 2023;
originally announced November 2023.
-
Statistical methods for partitioning ribbon and globally-distributed flux using data from the Interstellar Boundary Explorer
Authors:
Lauren J. Beesley,
Dave Osthus,
Kelly R. Moran,
Madeline A. Ausdemore,
Grant David Meadors,
Paul H. Janzen,
Eric J. Zirnstein,
Brian P. Weaver,
Daniel B. Reisenfeld
Abstract:
ASA's Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that can provide insight into the heliosphere boundary between our solar system and interstellar space. Using these data, scientists can construct maps of the ENA intensities (often, expressed in terms of flux) observed in all directions. The ENA flux observed in these maps is believed to come fro…
▽ More
ASA's Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that can provide insight into the heliosphere boundary between our solar system and interstellar space. Using these data, scientists can construct maps of the ENA intensities (often, expressed in terms of flux) observed in all directions. The ENA flux observed in these maps is believed to come from at least two distinct sources: one source which manifests as a ribbon of concentrated ENA flux and one source (or possibly several) that manifest as smoothly-varying globally-distributed flux. Each ENA source type and its corresponding ENA intensity map is of separate scientific interest. In this paper, we develop statistical methods for separating the total ENA intensity maps into two source-specific maps (ribbon and globally-distributed flux) and estimating corresponding uncertainty. Key advantages of the proposed method include enhanced model flexibility and improved propagation of estimation uncertainty. We evaluate the proposed methods on simulated data designed to mimic realistic data settings. We also propose new methods for estimating the center of the near-elliptical ribbon in the sky, which can be used in the future to study the location and variation of the local interstellar magnetic field.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
Towards Improved Heliosphere Sky Map Estimation with Theseus
Authors:
Dave Osthus,
Brian P. Weaver,
Lauren J. Beesley,
Kelly R. Moran,
Madeline A. Ausdemore,
Eric J. Zirnstein,
Paul H. Janzen,
Daniel B. Reisenfeld
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite has been in orbit since 2008 and detects energy-resolved energetic neutral atoms (ENAs) originating from the heliosphere. Different regions of the heliosphere generate ENAs at different rates. It is of scientific interest to take the data collected by IBEX and estimate spatial maps of heliospheric ENA rates (referred to as sky maps) at higher res…
▽ More
The Interstellar Boundary Explorer (IBEX) satellite has been in orbit since 2008 and detects energy-resolved energetic neutral atoms (ENAs) originating from the heliosphere. Different regions of the heliosphere generate ENAs at different rates. It is of scientific interest to take the data collected by IBEX and estimate spatial maps of heliospheric ENA rates (referred to as sky maps) at higher resolutions than before. These sky maps will subsequently be used to discern between competing theories of heliosphere properties that are not currently possible. The data IBEX collects present challenges to sky map estimation. The two primary challenges are noisy and irregularly spaced data collection and the IBEX instrumentation's point spread function. In essence, the data collected by IBEX are both noisy and biased for the underlying sky map of inferential interest. In this paper, we present a two-stage sky map estimation procedure called Theseus. In Stage 1, Theseus estimates a blurred sky map from the noisy and irregularly spaced data using an ensemble approach that leverages projection pursuit regression and generalized additive models. In Stage 2, Theseus deblurs the sky map by deconvolving the PSF with the blurred map using regularization. Unblurred sky map uncertainties are computed via bootstrap**. We compare Theseus to a method closely related to the one operationally used today by the IBEX Science Operation Center (ISOC) on both simulated and real data. Theseus outperforms ISOC in nearly every considered metric on simulated data, indicating that Theseus is an improvement over the current state of the art.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
The Mira-Titan Universe IV. High Precision Power Spectrum Emulation
Authors:
Kelly R. Moran,
Katrin Heitmann,
Earl Lawrence,
Salman Habib,
Derek Bingham,
Amol Upadhye,
Juliana Kwan,
David Higdon,
Richard Payne
Abstract:
Modern cosmological surveys are delivering datasets characterized by unprecedented quality and statistical completeness; this trend is expected to continue into the future as new ground- and space-based surveys come online. In order to maximally extract cosmological information from these observations, matching theoretical predictions are needed. At low redshifts, the surveys probe the nonlinear r…
▽ More
Modern cosmological surveys are delivering datasets characterized by unprecedented quality and statistical completeness; this trend is expected to continue into the future as new ground- and space-based surveys come online. In order to maximally extract cosmological information from these observations, matching theoretical predictions are needed. At low redshifts, the surveys probe the nonlinear regime of structure formation where cosmological simulations are the primary means of obtaining the required information. The computational cost of sufficiently resolved large-volume simulations makes it prohibitive to run very large ensembles. Nevertheless, precision emulators built on a tractable number of high-quality simulations can be used to build very fast prediction schemes to enable a variety of cosmological inference studies. We have recently introduced the Mira-Titan Universe simulation suite designed to construct emulators for a range of cosmological probes. The suite covers the standard six cosmological parameters $\{ω_m,ω_b, σ_8, h, n_s, w_0\}$ and, in addition, includes massive neutrinos and a dynamical dark energy equation of state, $\{ω_ν, w_a\}$. In this paper we present the final emulator for the matter power spectrum based on 111 cosmological simulations, each covering a (2.1Gpc)$^3$ volume and evolving 3200$^3$ particles. An additional set of 1776 lower-resolution simulations and TimeRG perturbation theory results for the power spectrum are used to cover scales straddling the linear to mildly nonlinear regimes. The emulator provides predictions at the two to three percent level of accuracy over a wide range of cosmological parameters and is publicly released as part of this paper.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Fast increased fidelity approximate Gibbs samplers for Bayesian Gaussian process regression
Authors:
Kelly R. Moran,
Matthew W. Wheeler
Abstract:
The use of Gaussian processes (GPs) is supported by efficient sampling algorithms, a rich methodological literature, and strong theoretical grounding. However, due to their prohibitive computation and storage demands, the use of exact GPs in Bayesian models is limited to problems containing at most several thousand observations. Sampling requires matrix operations that scale at…
▽ More
The use of Gaussian processes (GPs) is supported by efficient sampling algorithms, a rich methodological literature, and strong theoretical grounding. However, due to their prohibitive computation and storage demands, the use of exact GPs in Bayesian models is limited to problems containing at most several thousand observations. Sampling requires matrix operations that scale at $\mathcal{O}(n^3),$ where $n$ is the number of unique inputs. Storage of individual matrices scales at $\mathcal{O}(n^2),$ and can quickly overwhelm the resources of most modern computers. To overcome these bottlenecks, we develop a sampling algorithm using $\mathcal{H}$ matrix approximation of the matrices comprising the GP posterior covariance. These matrices can approximate the true conditional covariance matrix within machine precision and allow for sampling algorithms that scale at $\mathcal{O}(n \ \mbox{log}^2 n)$ time and storage demands scaling at $\mathcal{O}(n \ \mbox{log} \ n).$ We also describe how these algorithms can be used as building blocks to model higher dimensional surfaces at $\mathcal{O}(d \ n \ \mbox{log}^2 n)$, where $d$ is the dimension of the surface under consideration, using tensor products of one-dimensional GPs. Though various scalable processes have been proposed for approximating Bayesian GP inference when $n$ is large, to our knowledge, none of these methods show that the approximation's Kullback-Leibler divergence to the true posterior can be made arbitrarily small and may be no worse than the approximation provided by finite computer arithmetic. We describe $\mathcal{H}-$matrices, give an efficient Gibbs sampler using these matrices for one-dimensional GPs, offer a proposed extension to higher dimensional surfaces, and investigate the performance of this fast increased fidelity approximate GP, FIFA-GP, using both simulated and real data sets.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
Bayesian joint modeling of chemical structure and dose response curves
Authors:
Kelly R. Moran,
David Dunson,
Matthew W. Wheeler,
Amy H. Herring
Abstract:
Today there are approximately 85,000 chemicals regulated under the Toxic Substances Control Act, with around 2,000 new chemicals introduced each year. It is impossible to screen all of these chemicals for potential toxic effects either via full organism in vivo studies or in vitro high-throughput screening (HTS) programs. Toxicologists face the challenge of choosing which chemicals to screen, and…
▽ More
Today there are approximately 85,000 chemicals regulated under the Toxic Substances Control Act, with around 2,000 new chemicals introduced each year. It is impossible to screen all of these chemicals for potential toxic effects either via full organism in vivo studies or in vitro high-throughput screening (HTS) programs. Toxicologists face the challenge of choosing which chemicals to screen, and predicting the toxicity of as-yet-unscreened chemicals. Our goal is to describe how variation in chemical structure relates to variation in toxicological response to enable in silico toxicity characterization designed to meet both of these challenges. With our Bayesian partially Supervised Sparse and Smooth Factor Analysis ($\text{BS}^3\text{FA}$) model, we learn a distance between chemicals targeted to toxicity, rather than one based on molecular structure alone. Our model also enables the prediction of chemical dose-response profiles based on chemical structure (that is, without in vivo or in vitro testing) by taking advantage of a large database of chemicals that have already been tested for toxicity in HTS programs. We show superior simulation performance in distance learning and modest to large gains in predictive ability compared to existing methods. Results from the high-throughput screening data application elucidate the relationship between chemical structure and a toxicity-relevant high-throughput assay. An R package for $\text{BS}^3\text{FA}$ is available online at https://github.com/kelrenmor/bs3fa.
△ Less
Submitted 18 October, 2020; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Multiscale Influenza Forecasting
Authors:
Dave Osthus,
Kelly R Moran
Abstract:
Influenza forecasting in the United States (US) is complex and challenging for reasons including substantial spatial and temporal variability, nested geographic scales of forecast interest, and heterogeneous surveillance participation. Here we present a flexible influenza forecasting model called Dante, a multiscale flu forecasting model that learns rather than prescribes spatial, temporal, and su…
▽ More
Influenza forecasting in the United States (US) is complex and challenging for reasons including substantial spatial and temporal variability, nested geographic scales of forecast interest, and heterogeneous surveillance participation. Here we present a flexible influenza forecasting model called Dante, a multiscale flu forecasting model that learns rather than prescribes spatial, temporal, and surveillance data structure. Forecasts at the Health and Human Services (HHS) regional and national scales are generated as linear combinations of state forecasts with weights proportional to US Census population estimates, resulting in coherent forecasts across nested geographic scales. We retrospectively compare Dante's short-term and seasonal forecasts at the state, regional, and national scales for the 2012 through 2017 flu seasons in the US to the Dynamic Bayesian Model (DBM), a leading flu forecasting model. Dante outperformed DBM for nearly all spatial units, flu seasons, geographic scales, and forecasting targets. The improved performance is due to Dante making forecasts, especially short-term forecasts, more confidently and accurately than DBM, suggesting Dante's improved forecast scores will also translate to more useful forecasts for the public health sector. Dante participated in the prospective 2018/19 FluSight challenge hosted by the Centers for Disease Control and Prevention and placed 1st in both the national and regional competition and the state competition. The methodology underpinning Dante can be used in other disease forecasting contexts where nested geographic scales of interest exist.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Bayesian Hierarchical Factor Regression Models to Infer Cause of Death From Verbal Autopsy Data
Authors:
Kelly R. Moran,
Elizabeth L. Turner,
David Dunson,
Amy H. Herring
Abstract:
In low-resource settings where vital registration of death is not routine it is often of critical interest to determine and study the cause of death (COD) for individuals and the cause-specific mortality fraction (CSMF) for populations. Post-mortem autopsies, considered the gold standard for COD assignment, are often difficult or impossible to implement due to deaths occurring outside the hospital…
▽ More
In low-resource settings where vital registration of death is not routine it is often of critical interest to determine and study the cause of death (COD) for individuals and the cause-specific mortality fraction (CSMF) for populations. Post-mortem autopsies, considered the gold standard for COD assignment, are often difficult or impossible to implement due to deaths occurring outside the hospital, expense, and/or cultural norms. For this reason, Verbal Autopsies (VAs) are commonly conducted, consisting of a questionnaire administered to next of kin recording demographic information, known medical conditions, symptoms, and other factors for the decedent. This article proposes a novel class of hierarchical factor regression models that avoid restrictive assumptions of standard methods, allow both the mean and covariance to vary with COD category, and can include covariate information on the decedent, region, or events surrounding death. Taking a Bayesian approach to inference, this work develops an MCMC algorithm and validates the FActor Regression for Verbal Autopsy (FARVA) model in simulation experiments. An application of FARVA to real VA data shows improved goodness-of-fit and better predictive performance in inferring COD and CSMF over competing methods. Code and a user manual are made available at https://github.com/kelrenmor/farva.
△ Less
Submitted 18 October, 2020; v1 submitted 20 August, 2019;
originally announced August 2019.
-
Deceptiveness of internet data for disease surveillance
Authors:
Reid Priedhorsky,
Dave Osthus,
Ashlynn R. Daughton,
Kelly R. Moran,
Aron Culotta
Abstract:
Quantifying how many people are or will be sick, and where, is a critical ingredient in reducing the burden of disease because it helps the public health system plan and implement effective outbreak response. This process of disease surveillance is currently based on data gathering using clinical and laboratory methods; this distributed human contact and resulting bureaucratic data aggregation yie…
▽ More
Quantifying how many people are or will be sick, and where, is a critical ingredient in reducing the burden of disease because it helps the public health system plan and implement effective outbreak response. This process of disease surveillance is currently based on data gathering using clinical and laboratory methods; this distributed human contact and resulting bureaucratic data aggregation yield expensive procedures that lag real time by weeks or months. The promise of new surveillance approaches using internet data, such as web event logs or social media messages, is to achieve the same goal but faster and cheaper. However, prior work in this area lacks a rigorous model of information flow, making it difficult to assess the reliability of both specific approaches and the body of work as a whole.
We model disease surveillance as a Shannon communication. This new framework lets any two disease surveillance approaches be compared using a unified vocabulary and conceptual model. Using it, we describe and compare the deficiencies suffered by traditional and internet-based surveillance, introduce a new risk metric called deceptiveness, and offer mitigations for some of these deficiencies. This framework also makes the rich tools of information theory applicable to disease surveillance. This better understanding will improve the decision-making of public health practitioners by hel** to leverage internet-based surveillance in a way complementary to the strengths of traditional surveillance.
△ Less
Submitted 31 July, 2018; v1 submitted 16 November, 2017;
originally announced November 2017.