-
Designing a Data Science simulation with MERITS: A Primer
Authors:
Corrine F Elliott,
James Duncan,
Tiffany M Tang,
Merle Behr,
Karl Kumbier,
Bin Yu
Abstract:
Simulations play a crucial role in the modern scientific process. Yet despite (or due to) their ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a Data Science simulation s…
▽ More
Simulations play a crucial role in the modern scientific process. Yet despite (or due to) their ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a Data Science simulation should satisfy. Modularity and Efficiency support the Computability of a study, encouraging clean and flexible implementation. Realism and Stability address the conceptualization of the research problem: How well does a study Predict reality, such that its conclusions generalize to new data/contexts? Finally, Intuitiveness and Transparency encourage good communication and trustworthiness of study design and results. Drawing an analogy between simulation and cooking, we moreover offer (a) a conceptual framework for thinking about the anatomy of a simulation 'recipe'; (b) a baker's dozen in guidelines to aid the Data Science practitioner in designing one; and (c) a case study deconstructing a simulation through the lens of our framework to demonstrate its practical utility. By contributing this "PCS primer" for high-quality Data Science simulation, we seek to distill and enrich the best practices of simulation across disciplines into a cohesive recipe for trustworthy, veridical Data Science.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Multiscale quantile segmentation
Authors:
Laura Jula Vanegas,
Merle Behr,
Axel Munk
Abstract:
We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) probability for selecting the correct number of segments S at a given…
▽ More
We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) probability for selecting the correct number of segments S at a given error level, which serves as a tuning parameter. For a proper choice of this parameter, this tends exponentially fast to the true S, as sample size increases. We further show that the location and size of segments are estimated at minimax optimal rate (compared to a Gaussian setting) up to a log-factor. Thereby, our approach leads to (asymptotically) uniform confidence bands for the entire quantile regression function in a fully nonparametric setup. The procedure is efficiently implemented using dynamic programming techniques with double heap structures, and software is provided. Simulations and data examples from genetic sequencing and ion channel recordings confirm the robustness of the proposed procedure, which at the same hand reliably detects changes in quantiles from arbitrary distributions with precise statistical guarantees.
△ Less
Submitted 7 September, 2020; v1 submitted 25 February, 2019;
originally announced February 2019.
-
Multiscale Blind Source Separation
Authors:
Merle Behr,
Chris Holmes,
Axel Munk
Abstract:
We provide a new methodology for statistical recovery of single linear mixtures of piecewise constant signals (sources) with unknown mixing weights and change points in a multiscale fashion. We show exact recovery within an $ε$-neighborhood of the mixture when the sources take only values in a known finite alphabet. Based on this we provide the SLAM (Separates Linear Alphabet Mixtures) estimators…
▽ More
We provide a new methodology for statistical recovery of single linear mixtures of piecewise constant signals (sources) with unknown mixing weights and change points in a multiscale fashion. We show exact recovery within an $ε$-neighborhood of the mixture when the sources take only values in a known finite alphabet. Based on this we provide the SLAM (Separates Linear Alphabet Mixtures) estimators for the mixing weights and sources. For Gaussian error, we obtain uniform confidence sets and optimal rates (up to log-factors) for all quantities. SLAM is efficiently computed as a nonconvex optimization problem by a dynamic program tailored to the finite alphabet assumption. Its performance is investigated in a simulation study. Finally, it is applied to assign copy-number aberrations from genetic sequencing data to different clones and to estimate their proportions.
△ Less
Submitted 30 August, 2017; v1 submitted 25 August, 2016;
originally announced August 2016.
-
Identifiability for Blind Source Separation of Multiple Finite Alphabet Linear Mixtures
Authors:
Merle Behr,
Axel Munk
Abstract:
We give under weak assumptions a complete combinatorial characterization of identifiability for linear mixtures of finite alphabet sources, with unknown mixing weights and unknown source signals, but known alphabet. This is based on a detailed treatment of the case of a single linear mixture. Notably, our identifiability analysis applies also to the case of unknown number of sources. We provide su…
▽ More
We give under weak assumptions a complete combinatorial characterization of identifiability for linear mixtures of finite alphabet sources, with unknown mixing weights and unknown source signals, but known alphabet. This is based on a detailed treatment of the case of a single linear mixture. Notably, our identifiability analysis applies also to the case of unknown number of sources. We provide sufficient and necessary conditions for identifiability and give a simple sufficient criterion together with an explicit construction to determine the weights and the source signals for deterministic data by taking advantage of the hierarchical structure within the possible mixture values. We show that the probability of identifiability is related to the distribution of a hitting time and converges exponentially fast to one when the underlying sources come from a discrete Markov process. Finally, we explore our theoretical results in a simulation study. Our work extends and clarifies the scope of scenarios for which blind source separation becomes meaningful.
△ Less
Submitted 30 August, 2017; v1 submitted 20 May, 2015;
originally announced May 2015.