-
Asymptotic Theory for Estimation of the Husler-Reiss Distribution via Block Maxima Method
Authors:
Hank Flury,
Jan Hannig,
Richard Smith
Abstract:
The Hüsler-Reiss distribution describes the limit of the pointwise maxima of a bivariate normal distribution. This distribution is defined by a single parameter, $λ$. We provide asymptotic theory for maximum likelihood estimation of $λ$ under a block maxima approach. Our work assumes independent and identically distributed bivariate normal random variables, grouped into blocks where the block size…
▽ More
The Hüsler-Reiss distribution describes the limit of the pointwise maxima of a bivariate normal distribution. This distribution is defined by a single parameter, $λ$. We provide asymptotic theory for maximum likelihood estimation of $λ$ under a block maxima approach. Our work assumes independent and identically distributed bivariate normal random variables, grouped into blocks where the block size and number of blocks increase simultaneously. With these assumptions our results provide conditions for the asymptotic normality of the Maximum Likelihood Estimator (MLE). We characterize the bias of the MLE, provide conditions under which this bias is asymptotically negligible, and discuss how to choose the block size to minimize a bias-variance trade-off. The proofs are an extension of previous results for choosing the block size in the estimation of univariate extreme value distributions (Dombry and Ferreria 2019), providing a potential basis for extensions to multivariate cases where both the marginal and dependence parameters are unknown. The proofs rely on the Argmax Theorem applied to a localized loglikelihood function, combined with a Lindeberg-Feller Central Limit Theorem argument to establish asymptotic normality. Possible applications of the method include composite likelihood estimation in Brown-Resnick processes, where it is known that the bivariate distributions are of Hüsler-Reiss form.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Semiparametric fiducial inference
Authors:
Yifan Cui,
Jan Hannig,
Paul Edlefsen
Abstract:
R. A. Fisher introduced the concept of fiducial as a potential replacement for the Bayesian posterior distribution in the 1930s. During the past century, fiducial approaches have been explored in various parametric and nonparametric settings. However, to the best of our knowledge, no fiducial inference has been developed in the realm of semiparametric statistics. In this paper, we propose a novel…
▽ More
R. A. Fisher introduced the concept of fiducial as a potential replacement for the Bayesian posterior distribution in the 1930s. During the past century, fiducial approaches have been explored in various parametric and nonparametric settings. However, to the best of our knowledge, no fiducial inference has been developed in the realm of semiparametric statistics. In this paper, we propose a novel fiducial approach for semiparametric models. To streamline our presentation, we use the Cox proportional hazards model, which is the most popular model for the analysis of survival data, as a running example. Other models and extensions are also discussed. In our experiments, we find our method to perform well especially in situations when the maximum likelihood estimator fails.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
AutoGFI: Streamlined Generalized Fiducial Inference for Modern Inference Problems
Authors:
Wei Du,
Jan Hannig,
Thomas C. M. Lee,
Yi Su,
Chunzhe Zhang
Abstract:
The origins of fiducial inference trace back to the 1930s when R. A. Fisher first introduced the concept as a response to what he perceived as a limitation of Bayesian inference - the requirement for a subjective prior distribution on model parameters in cases where no prior information was available. However, Fisher's initial fiducial approach fell out of favor as complications arose, particularl…
▽ More
The origins of fiducial inference trace back to the 1930s when R. A. Fisher first introduced the concept as a response to what he perceived as a limitation of Bayesian inference - the requirement for a subjective prior distribution on model parameters in cases where no prior information was available. However, Fisher's initial fiducial approach fell out of favor as complications arose, particularly in multi-parameter problems. In the wake of 2000, amidst a renewed interest in contemporary adaptations of fiducial inference, generalized fiducial inference (GFI) emerged to extend Fisher's fiducial argument, providing a promising avenue for addressing numerous crucial and practical inference challenges. Nevertheless, the adoption of GFI has been limited due to its often demanding mathematical derivations and the necessity for implementing complex Markov Chain Monte Carlo algorithms. This complexity has impeded its widespread utilization and practical applicability. This paper presents a significant advancement by introducing an innovative variant of GFI designed to alleviate these challenges. Specifically, this paper proposes AutoGFI, an easily implementable algorithm that streamlines the application of GFI to a broad spectrum of inference problems involving additive noise. AutoGFI can be readily implemented as long as a fitting routine is available, making it accessible to a broader audience of researchers and practitioners. To demonstrate its effectiveness, AutoGFI is applied to three contemporary and challenging problems: tensor regression, matrix completion, and regression with network cohesion. These case studies highlight the immense potential of GFI and illustrate AutoGFI's promising performance when compared to specialized solutions for these problems. Overall, this research paves the way for a more accessible and powerful application of GFI in a range of practical domains.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Dempster-Shafer P-values: Thoughts on an Alternative Approach for Multinomial Inference
Authors:
Kentaro Hoffman,
Kai Zhang,
Tyler McCormick,
Jan Hannig
Abstract:
In this paper, we demonstrate that a new measure of evidence we developed called the Dempster-Shafer p-value which allow for insights and interpretations which retain most of the structure of the p-value while covering for some of the disadvantages that traditional p- values face. Moreover, we show through classical large-sample bounds and simulations that there exists a close connection between o…
▽ More
In this paper, we demonstrate that a new measure of evidence we developed called the Dempster-Shafer p-value which allow for insights and interpretations which retain most of the structure of the p-value while covering for some of the disadvantages that traditional p- values face. Moreover, we show through classical large-sample bounds and simulations that there exists a close connection between our form of DS hypothesis testing and the classical frequentist testing paradigm. We also demonstrate how our approach gives unique insights into the dimensionality of a hypothesis test, as well as models the effects of adversarial attacks on multinomial data. Finally, we demonstrate how these insights can be used to analyze text data for public health through an analysis of the Population Health Metrics Research Consortium dataset for verbal autopsies.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
A Bernstein-von Mises Theorem for Generalized Fiducial Distributions
Authors:
J. E. Borgert,
Jan Hannig
Abstract:
We prove a Bernstein-von Mises result for generalized fiducial distributions following the approach based on quadratic mean differentiability in Le Cam (1986); van der Vaart (1998). Building on their approach, we introduce only two additional conditions for the generalized fiducial setting. While asymptotic normality of generalized fiducial distributions has been studied before, particularly in Ha…
▽ More
We prove a Bernstein-von Mises result for generalized fiducial distributions following the approach based on quadratic mean differentiability in Le Cam (1986); van der Vaart (1998). Building on their approach, we introduce only two additional conditions for the generalized fiducial setting. While asymptotic normality of generalized fiducial distributions has been studied before, particularly in Hannig (2009) and Sonderegger and Hannig (2014), this work significantly extends the usefulness of such a result by the much more general condition of quadratic mean differentiability. We demonstrate the applicability of our result with two examples that necessitate these more general assumptions: the triangular distributions and free-knot spline models.
△ Less
Submitted 24 April, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
Bayes Watch: Bayesian Change-point Detection for Process Monitoring with Fault Detection
Authors:
Alexander C. Murph,
Curtis B. Storlie,
Patrick M. Wilson,
Jonathan P. Williams,
Jan Hannig
Abstract:
When a predictive model is in production, it must be monitored in real-time to ensure that its performance does not suffer due to drift or abrupt changes to data. Ideally, this is done long before learning that the performance of the model itself has dropped by monitoring outcome data. In this paper we consider the problem of monitoring a predictive model that identifies the need for palliative ca…
▽ More
When a predictive model is in production, it must be monitored in real-time to ensure that its performance does not suffer due to drift or abrupt changes to data. Ideally, this is done long before learning that the performance of the model itself has dropped by monitoring outcome data. In this paper we consider the problem of monitoring a predictive model that identifies the need for palliative care currently in production at the Mayo Clinic in Rochester, MN. We introduce a framework, called \textit{Bayes Watch}, for detecting change-points in high-dimensional longitudinal data with mixed variable types and missing values and for determining in which variables the change-point occurred. Bayes Watch fits an array of Gaussian Graphical Mixture Models to grou**s of homogeneous data in time, called regimes, which are modeled as the observed states of a Markov process with unknown transition probabilities. In doing so, Bayes Watch defines a posterior distribution on a vector of regime assignments, which gives meaningful expressions on the probability of every possible change-point. Bayes Watch also allows for an effective and efficient fault detection system that assesses what features in the data where the most responsible for a given change-point.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Introduction to Generalized Fiducial Inference
Authors:
Alexander C. Murph,
Jan Hannig,
Jonathan P. Williams
Abstract:
Fiducial inference was introduced in the first half of the 20th century by Fisher (1935) as a means to get a posterior-like distribution for a parameter without having to arbitrarily define a prior. While the method originally fell out of favor due to non-exactness issues in multivariate cases, the method has garnered renewed interest in the last decade. This is partly due to the development of ge…
▽ More
Fiducial inference was introduced in the first half of the 20th century by Fisher (1935) as a means to get a posterior-like distribution for a parameter without having to arbitrarily define a prior. While the method originally fell out of favor due to non-exactness issues in multivariate cases, the method has garnered renewed interest in the last decade. This is partly due to the development of generalized fiducial inference, which is a fiducial perspective on generalized confidence intervals: a method used to find approximate confidence distributions. In this chapter, we illuminate the usefulness of the fiducial philosophy, introduce the definition of a generalized fiducial distribution, and apply it to interesting, non-trivial inferential examples.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Data Integration Via Analysis of Subspaces (DIVAS)
Authors:
Jack B. Prothero,
Meilei Jiang,
Jan Hannig,
Quoc Tran-Dinh,
Andrew Ackerman,
J. S. Marron
Abstract:
Modern data collection in many data paradigms, including bioinformatics, often incorporates multiple traits derived from different data types (i.e. platforms). We call this data multi-block, multi-view, or multi-omics data. The emergent field of data integration develops and applies new methods for studying multi-block data and identifying how different data types relate and differ. One major fron…
▽ More
Modern data collection in many data paradigms, including bioinformatics, often incorporates multiple traits derived from different data types (i.e. platforms). We call this data multi-block, multi-view, or multi-omics data. The emergent field of data integration develops and applies new methods for studying multi-block data and identifying how different data types relate and differ. One major frontier in contemporary data integration research is methodology that can identify partially-shared structure between sub-collections of data types. This work presents a new approach: Data Integration Via Analysis of Subspaces (DIVAS). DIVAS combines new insights in angular subspace perturbation theory with recent developments in matrix signal processing and convex-concave optimization into one algorithm for exploring partially-shared structure. Based on principal angles between subspaces, DIVAS provides built-in inference on the results of the analysis, and is effective even in high-dimension-low-sample-size (HDLSS) situations.
△ Less
Submitted 17 January, 2024; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Algorithm for detection of illegal discounting in North Carolina Education Lottery
Authors:
Jiayi Fu,
Jack B Prothero,
Jan Hannig
Abstract:
The lottery is a very lucrative industry. Popular fascination often focuses on the largest prizes. However, less attention has been paid to detecting unusual lottery buying behaviors at lower stakes. Our paper introduces a new model to detect illegal discounting in the North Carolina Education Lottery using statistical analysis of net gains and ticket buying habits. Nine outlying players are flagg…
▽ More
The lottery is a very lucrative industry. Popular fascination often focuses on the largest prizes. However, less attention has been paid to detecting unusual lottery buying behaviors at lower stakes. Our paper introduces a new model to detect illegal discounting in the North Carolina Education Lottery using statistical analysis of net gains and ticket buying habits. Nine outlying players are flagged and are further examined using a proposed stochastic model to calculate the range of their possible losses in the lottery. The unusual buying patterns of the players flagged as outliers are further confirmed using a K-means clustering analysis of lottery store visiting behaviors.
△ Less
Submitted 6 November, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
A Geometric Perspective on Bayesian and Generalized Fiducial Inference
Authors:
Yang Liu,
Jan Hannig,
Alexander C Murph
Abstract:
Post-data statistical inference concerns making probability statements about model parameters conditional on observed data. When a priori knowledge about parameters is available, post-data inference can be conveniently made from Bayesian posteriors. In the absence of prior information, we may still rely on objective Bayes or generalized fiducial inference (GFI). Inspired by approximate Bayesian co…
▽ More
Post-data statistical inference concerns making probability statements about model parameters conditional on observed data. When a priori knowledge about parameters is available, post-data inference can be conveniently made from Bayesian posteriors. In the absence of prior information, we may still rely on objective Bayes or generalized fiducial inference (GFI). Inspired by approximate Bayesian computation, we propose a novel characterization of post-data inference with the aid of differential geometry. Under suitable smoothness conditions, we establish that Bayesian posteriors and generalized fiducial distributions (GFDs) can be respectively characterized by absolutely continuous distributions supported on the same differentiable manifold: The manifold is uniquely determined by the observed data and the data generating equation of the fitted model. Our geometric analysis not only sheds light on the connection and distinction between Bayesian inference and GFI, but also allows us to sample from posteriors and GFDs using manifold Markov chain Monte Carlo algorithms. A repeated-measures analysis of variance example is presented to illustrate the sampling procedure.
△ Less
Submitted 30 September, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Generalized Fiducial Inference on Differentiable Manifolds
Authors:
Alexander C Murph,
Jan Hannig,
Jonathan P Williams
Abstract:
We introduce a novel approach to inference on parameters that take values in a Riemannian manifold embedded in a Euclidean space. Parameter spaces of this form are ubiquitous across many fields, including chemistry, physics, computer graphics, and geology. This new approach uses generalized fiducial inference to obtain a posterior-like distribution on the manifold, without needing to know a parame…
▽ More
We introduce a novel approach to inference on parameters that take values in a Riemannian manifold embedded in a Euclidean space. Parameter spaces of this form are ubiquitous across many fields, including chemistry, physics, computer graphics, and geology. This new approach uses generalized fiducial inference to obtain a posterior-like distribution on the manifold, without needing to know a parameterization that maps the constrained space to an unconstrained Euclidean space. The proposed methodology, called the constrained generalized fiducial distribution (CGFD), is obtained by using mathematical tools from Riemannian geometry. A Bernstein-von Mises-type result for the CGFD, which provides intuition for how the desirable asymptotic qualities of the unconstrained generalized fiducial distribution are inherited by the CGFD, is provided. To demonstrate the practical use of the CGFD, we provide three proof-of-concept examples: inference for data from a multivariate normal density with the mean parameters on a sphere, a linear logspline density estimation problem, and a reimagined approach to the AR(1) model, all of which exhibit desirable coverages via simulation. We discuss two Markov chain Monte Carlo algorithms for the exploration of these constrained parameter spaces and adapt them for the CGFD.
△ Less
Submitted 8 December, 2022; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Demystifying Inferential Models: A Fiducial Perspective
Authors:
Yifan Cui,
Jan Hannig
Abstract:
Inferential models have recently gained in popularity for valid uncertainty quantification. In this paper, we investigate inferential models by exploring relationships between inferential models, fiducial inference, and confidence curves. In short, we argue that from a certain point of view, inferential models can be viewed as fiducial distribution based confidence curves. We show that all probabi…
▽ More
Inferential models have recently gained in popularity for valid uncertainty quantification. In this paper, we investigate inferential models by exploring relationships between inferential models, fiducial inference, and confidence curves. In short, we argue that from a certain point of view, inferential models can be viewed as fiducial distribution based confidence curves. We show that all probabilistic uncertainty quantification of inferential models is based on a collection of sets we name principle sets and principle assertions.
△ Less
Submitted 12 May, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
A New String Edit Distance and Applications
Authors:
Taylor Petty,
Jan Hannig,
Tunde I Huszar,
Hari Iyer
Abstract:
String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying erro…
▽ More
String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying errors consist of more than just single base differences. More often the phenomenon of ``stutter'' is observed, where the number of repeated units differs (by whole units) from the template. To adapt the Levenshtein distance to be suitable for forensic applications where DNA sequence similarity is of interest, a generalized string edit distance is defined that accommodates the addition or deletion of whole motifs in addition to single-nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences. The novelty of this algorithm is in handling the complex interactions that arise between multiple- and single-character edits. Forensic examples illustrate the purpose and use of the Restricted Forensic Levenshtein (RFL) distance measure, but applications extend to sequence alignment and string similarity in other biological areas, as well as dynamic programming algorithms more broadly.
△ Less
Submitted 11 May, 2022; v1 submitted 11 March, 2022;
originally announced March 2022.
-
A unified nonparametric fiducial approach to interval-censored data
Authors:
Yifan Cui,
Jan Hannig,
Michael Kosorok
Abstract:
Censored data, where the event time is partially observed, are challenging for survival probability estimation. In this paper, we introduce a novel nonparametric fiducial approach to interval-censored data, including right-censored, current status, case II censored, and mixed case censored data. The proposed approach leveraging a simple Gibbs sampler has a useful property of being "one size fits a…
▽ More
Censored data, where the event time is partially observed, are challenging for survival probability estimation. In this paper, we introduce a novel nonparametric fiducial approach to interval-censored data, including right-censored, current status, case II censored, and mixed case censored data. The proposed approach leveraging a simple Gibbs sampler has a useful property of being "one size fits all", i.e., the proposed approach automatically adapts to all types of non-informative censoring mechanisms. As shown in the extensive simulations, the proposed fiducial confidence intervals significantly outperform existing methods in terms of both coverage and length. In addition, the proposed fiducial point estimator has much smaller estimation errors than the nonparametric maximum likelihood estimator. Furthermore, we apply the proposed method to Austrian rubella data and a study of hemophiliacs infected with the human immunodeficiency virus. The strength of the proposed fiducial approach is not only estimation and uncertainty quantification but also its automatic adaptation to a variety of censoring mechanisms.
△ Less
Submitted 28 November, 2021;
originally announced November 2021.
-
Jackstraw Inference for AJIVE Data Integration
Authors:
Xi Yang,
Katherine A. Hoadley,
Jan Hannig,
J. S. Marron
Abstract:
In the age of big data, data integration is a critical step especially in the understanding of how diverse data types work together and work separately. Among data integration methods, the Angle-Based Joint and Individual Variation Explained (AJIVE) approach is particularly attractive because it not only studies joint behavior but also individual behavior. Typically AJIVE scores indicate important…
▽ More
In the age of big data, data integration is a critical step especially in the understanding of how diverse data types work together and work separately. Among data integration methods, the Angle-Based Joint and Individual Variation Explained (AJIVE) approach is particularly attractive because it not only studies joint behavior but also individual behavior. Typically AJIVE scores indicate important relationships between data objects, such as clusters. An important challenge is understanding which features, i.e. variables, are associated with those relationships. This challenge is addressed by the proposal of a hypothesis test for assessing statistical significance of features. The new test is inspired by the related jackstraw method developed for Principal Component Analysis. We use a high-dimensional muti-genomic cancer data set as our strong motivation and deep illustration of the methodology.
△ Less
Submitted 5 November, 2022; v1 submitted 25 September, 2021;
originally announced September 2021.
-
Comments on: A Gibbs sampler for a class of random convex polytopes
Authors:
Kentaro Hoffman,
Jan Hannig,
Kai Zhang
Abstract:
In this comment we discuss relative strengths and weaknesses of simplex and Dirichlet Dempster-Shafer inference as applied to multi-resolution tests of independence.
In this comment we discuss relative strengths and weaknesses of simplex and Dirichlet Dempster-Shafer inference as applied to multi-resolution tests of independence.
△ Less
Submitted 15 April, 2021;
originally announced May 2021.
-
A Conceptual Framework for Establishing Trust in Real World Intelligent Systems
Authors:
Michael Guckert,
Nils Gumpfer,
Jennifer Hannig,
Till Keller,
Neil Urquhart
Abstract:
Intelligent information systems that contain emergent elements often encounter trust problems because results do not get sufficiently explained and the procedure itself can not be fully retraced. This is caused by a control flow depending either on stochastic elements or on the structure and relevance of the input data. Trust in such algorithms can be established by letting users interact with the…
▽ More
Intelligent information systems that contain emergent elements often encounter trust problems because results do not get sufficiently explained and the procedure itself can not be fully retraced. This is caused by a control flow depending either on stochastic elements or on the structure and relevance of the input data. Trust in such algorithms can be established by letting users interact with the system so that they can explore results and find patterns that can be compared with their expected solution. Reflecting features and patterns of human understanding of a domain against algorithmic results can create awareness of such patterns and may increase the trust that a user has in the solution. If expectations are not met, close inspection can be used to decide whether a solution conforms to the expectations or whether it goes beyond the expected. By either accepting or rejecting a solution, the user's set of expectations evolves and a learning process for the users is established. In this paper we present a conceptual framework that reflects and supports this process. The framework is the result of an analysis of two exemplary case studies from two different disciplines with information systems that assist experts in their complex tasks.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
New Perspectives on Centering
Authors:
Jack B. Prothero,
Jan Hannig,
J. S. Marron
Abstract:
Data matrix centering is an ever-present yet under-examined aspect of data analysis. Functional data analysis (FDA) often operates with a default of centering such that the vectors in one dimension have mean zero. We find that centering along the other dimension identifies a novel useful mode of variation beyond those familiar in FDA. We explore ambiguities in both matrix orientation and nomenclat…
▽ More
Data matrix centering is an ever-present yet under-examined aspect of data analysis. Functional data analysis (FDA) often operates with a default of centering such that the vectors in one dimension have mean zero. We find that centering along the other dimension identifies a novel useful mode of variation beyond those familiar in FDA. We explore ambiguities in both matrix orientation and nomenclature. Differences between centerings and their potential interaction can be easily misunderstood. We propose a unified framework and new terminology for centering operations. We clearly demonstrate the intuition behind and consequences of each centering choice with informative graphics. We also propose a new direction energy hypothesis test as part of a series of diagnostics for determining which choice of centering is best for a data set. We explore the application of these diagnostics in several FDA settings.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Measure of Strength of Evidence for Visually Observed Differences between Subpopulations
Authors:
Xi Yang,
Jan Hannig,
Katherine A. Hoadley,
Iain Carmichael,
J. S. Marron
Abstract:
For measuring the strength of visually-observed subpopulation differences, the Population Difference Criterion is proposed to assess the statistical significance of visually observed subpopulation differences. It addresses the following challenges: in high-dimensional contexts, distributional models can be dubious; in high-signal contexts, conventional permutation tests give poor pairwise comparis…
▽ More
For measuring the strength of visually-observed subpopulation differences, the Population Difference Criterion is proposed to assess the statistical significance of visually observed subpopulation differences. It addresses the following challenges: in high-dimensional contexts, distributional models can be dubious; in high-signal contexts, conventional permutation tests give poor pairwise comparisons. We also make two other contributions: Based on a careful analysis we find that a balanced permutation approach is more powerful in high-signal contexts than conventional permutations. Another contribution is the quantification of uncertainty due to permutation variation via a bootstrap confidence interval. The practical usefulness of these ideas is illustrated in the comparison of subpopulations of modern cancer data.
△ Less
Submitted 19 September, 2023; v1 submitted 1 January, 2021;
originally announced January 2021.
-
Generalized fiducial factor: an alternative to the Bayes factor for forensic identification of source problems
Authors:
Jonathan P Williams,
Danica M Ommen,
Jan Hannig
Abstract:
One formulation of forensic identification of source problems is to determine the source of trace evidence, for instance, glass fragments found on a suspect for a crime. The current state of the science is to compute a Bayes factor (BF) comparing the marginal distribution of measurements of trace evidence under two competing propositions for whether or not the unknown source evidence originated fr…
▽ More
One formulation of forensic identification of source problems is to determine the source of trace evidence, for instance, glass fragments found on a suspect for a crime. The current state of the science is to compute a Bayes factor (BF) comparing the marginal distribution of measurements of trace evidence under two competing propositions for whether or not the unknown source evidence originated from a specific source. The obvious problem with such an approach is the ability to tailor the prior distributions (placed on the features/parameters of the statistical model for the measurements of trace evidence) in favor of the defense or prosecution, which is further complicated by the fact that the typical number of measurements of trace evidence is typically sufficiently small that prior choice/specification has a strong influence on the value of the BF. To remedy this problem of prior specification and choice, we develop an alternative to the BF, within the framework of generalized fiducial inference (GFI), that we term a {\em generalized fiducial factor} (GFF). Furthermore, we demonstrate empirically, on the synthetic and real Netherlands Forensic Institute (NFI) casework data, deficiencies in the BF and classical/frequentist likelihood ratio (LR) approaches.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Deep Fiducial Inference
Authors:
Gang Li,
Jan Hannig
Abstract:
Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when th…
▽ More
Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when the unnormalized fiducial density (needed for MCMC), we design a fiducial autoencoder (FAE). The fitted autoencoder is used to generate generalized fiducial samples of the unknown parameters. To increase accuracy, we then apply an approximate fiducial computation (AFC) algorithm, by rejecting samples that when plugged into a decoder do not replicate the observed data well enough. Our numerical experiments show the effectiveness of our FAE-based inverse solution and the excellent coverage performance of the AFC corrected FAE solution.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
A fiducial approach to nonparametric deconvolution problem: discrete case
Authors:
Yifan Cui,
Jan Hannig
Abstract:
Fiducial inference, as generalized by Hannig et al. (2016), is applied to nonparametric g-modeling (Efron, 2016) in the discrete case. We propose a computationally efficient algorithm to sample from the fiducial distribution, and use the generated samples to construct point estimates and confidence intervals. We study the theoretical properties of the fiducial distribution and perform extensive si…
▽ More
Fiducial inference, as generalized by Hannig et al. (2016), is applied to nonparametric g-modeling (Efron, 2016) in the discrete case. We propose a computationally efficient algorithm to sample from the fiducial distribution, and use the generated samples to construct point estimates and confidence intervals. We study the theoretical properties of the fiducial distribution and perform extensive simulations in various scenarios. The proposed approach yields good statistical performance in terms of the mean squared error of point estimators and the coverage of confidence intervals. Furthermore, we apply the proposed fiducial method to estimate the probability of each satellite site being malignant using gastric adenocarcinoma data with 844 patients (Efron, 2016).
△ Less
Submitted 20 December, 2022; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Joint and individual analysis of breast cancer histologic images and genomic covariates
Authors:
Iain Carmichael,
Benjamin C. Calhoun,
Katherine A. Hoadley,
Melissa A. Troester,
Joseph Geradts,
Heather D. Couture,
Linnea Olsson,
Charles M. Perou,
Marc Niethammer,
Jan Hannig,
J. S. Marron
Abstract:
A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little…
▽ More
A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little overlap between these two fields. We aim to bridge this gap by develo** methods based on Angle-based Joint and Individual Variation Explained (AJIVE) to directly explore similarities and differences between these two modalities. Our approach exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction to address some of the challenges presented by statistical analysis of histopathology image data. CNNs raise issues of interpretability that we address by develo** novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features. Our results provide many interpretable connections and contrasts between histopathology and genetics.
△ Less
Submitted 13 April, 2020; v1 submitted 1 December, 2019;
originally announced December 2019.
-
Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial Inference
Authors:
Suofei Wu,
Jan Hannig,
Thomas C. M. Lee
Abstract:
Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on honest random forests, which add honesty to the original form of random forests and are proved to have better statistical properties. The main contribution is a new…
▽ More
Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on honest random forests, which add honesty to the original form of random forests and are proved to have better statistical properties. The main contribution is a new method that quantifies the uncertainties of the estimates and predictions produced by honest random forests. The proposed method is based on the generalized fiducial methodology, and provides a fiducial density function that measures how likely each single honest tree is the true model. With such a density function, estimates and predictions, as well as their confidence/prediction intervals, can be obtained. The promising empirical properties of the proposed method are demonstrated by numerical comparisons with several state-of-the-art methods, and by applications to a few real data sets. Lastly, the proposed method is theoretically backed up by a strong asymptotic guarantee.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
A Note on Optimal Sampling Strategy for Structural Variant Detection Using Optical Map**
Authors:
Weiwei Li,
Jan Hannig,
Corbin Jones
Abstract:
Structural variants compose the majority of human genetic variation, but are difficult to assess using current genomic sequencing technologies. Optical map** technologies, which measure the size of chromosomal fragments between labeled markers, offer an alternative approach. As these technologies mature towards becoming clinical tools, there is a need to develop an approach for determining the o…
▽ More
Structural variants compose the majority of human genetic variation, but are difficult to assess using current genomic sequencing technologies. Optical map** technologies, which measure the size of chromosomal fragments between labeled markers, offer an alternative approach. As these technologies mature towards becoming clinical tools, there is a need to develop an approach for determining the optimal strategy for sampling biological material in order to detect a variant at some threshold. Here we develop an optimization approach using a simple, yet realistic, model of the genomic map** process using a hyper-geometric distribution and {probabilistic} concentration inequalities. Our approach is both computationally and analytically tractable and includes a novel approach to getting tail bounds of hyper-geometric distribution. We show that if a genomic map** technology can sample most of the chromosomal fragments within a sample, comparatively little biological material is needed to detect a variant at high confidence.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
The EAS approach for graphical selection consistency in vector autoregression models
Authors:
Jonathan P Williams,
Yuying Xie,
Jan Hannig
Abstract:
As evidenced by various recent and significant papers within the frequentist literature, along with numerous applications in macroeconomics, genomics, and neuroscience, there continues to be substantial interest to understand the theoretical estimation properties of high-dimensional vector autoregression (VAR) models. To date, however, while Bayesian VAR (BVAR) models have been developed and studi…
▽ More
As evidenced by various recent and significant papers within the frequentist literature, along with numerous applications in macroeconomics, genomics, and neuroscience, there continues to be substantial interest to understand the theoretical estimation properties of high-dimensional vector autoregression (VAR) models. To date, however, while Bayesian VAR (BVAR) models have been developed and studied empirically (primarily in the econometrics literature) there exist very few theoretical investigations of the repeated sampling properties for BVAR models in the literature. In this direction, we construct methodology via the $\varepsilon$-$admissible$ subsets (EAS) approach for posterior-like inference based on a generalized fiducial distribution of relative model probabilities over all sets of active/inactive components (graphs) of the VAR transition matrix. We provide a mathematical proof of $pairwise$ and $strong$ graphical selection consistency for the EAS approach for stable VAR(1) models which is robust to model misspecification, and demonstrate numerically that it is an effective strategy in high-dimensional settings.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
Subspace Clustering through Sub-Clusters
Authors:
Weiwei Li,
Jan Hannig,
Sayan Mukherjee
Abstract:
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or la…
▽ More
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.
△ Less
Submitted 11 June, 2020; v1 submitted 15 November, 2018;
originally announced November 2018.
-
Method G: Uncertainty Quantification for Distributed Data Problems using Generalized Fiducial Inference
Authors:
Randy C. S. Lai,
J. Hannig,
Thomas C. M. Lee
Abstract:
It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This presents new challenges to statisticians as even computing simple summary statistics such as the median becomes computationally challenging. Furthermore, if ot…
▽ More
It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This presents new challenges to statisticians as even computing simple summary statistics such as the median becomes computationally challenging. Furthermore, if other advanced statistical methods are desired, novel computational strategies are needed. In this paper we propose a new approach for distributed analysis of massive data that is suitable for generalized fiducial inference and is based on a careful implementation of a "divide and conquer" strategy combined with importance sampling. The proposed approach requires only small amount of communication between nodes, and is shown to be asymptotically equivalent to using the whole data set. Unlike most existing methods, the proposed approach produces uncertainty measures (such as confidence intervals) in addition to point estimates for parameters of interest. The proposed approach is also applied to the analysis of a large set of solar images.
△ Less
Submitted 18 May, 2018;
originally announced May 2018.
-
A Bayesian Approach to Multi-State Hidden Markov Models: Application to Dementia Progression
Authors:
Jonathan P Williams,
Curtis B Storlie,
Terry M Therneau,
Clifford R Jack Jr,
Jan Hannig
Abstract:
People are living longer than ever before, and with this arises new complications and challenges for humanity. Among the most pressing of these challenges is of understanding the role of aging in the development of dementia. This paper is motivated by the Mayo Clinic Study of Aging data for 4742 subjects since 2004, and how it can be used to draw inference on the role of aging in the development o…
▽ More
People are living longer than ever before, and with this arises new complications and challenges for humanity. Among the most pressing of these challenges is of understanding the role of aging in the development of dementia. This paper is motivated by the Mayo Clinic Study of Aging data for 4742 subjects since 2004, and how it can be used to draw inference on the role of aging in the development of dementia. We construct a hidden Markov model (HMM) to represent progression of dementia from states associated with the buildup of amyloid plaque in the brain, and the loss of cortical thickness. A hierarchical Bayesian approach is taken to estimate the parameters of the HMM with a truly time-inhomogeneous infinitesimal generator matrix, and response functions of the continuous-valued biomarker measurements are cut-point agnostic. A Bayesian approach with these features could be useful in many disease progression models. Additionally, an approach is illustrated for correcting a common bias in delayed enrollment studies, in which some or all subjects are not observed at baseline. Standard software is incapable of accounting for this critical feature, so code to perform the estimation of the model described below is made available online.
△ Less
Submitted 6 August, 2018; v1 submitted 7 February, 2018;
originally announced February 2018.
-
Covariance Estimation via Fiducial Inference
Authors:
W. Jenny Shi,
Jan Hannig,
Randy C. S. Lai,
Thomas C. M. Lee
Abstract:
As a classical problem, covariance estimation has drawn much attention from the statistical community for decades. Much work has been done under the frequentist and the Bayesian frameworks. Aiming to quantify the uncertainty of the estimators without having to choose a prior, we have developed a fiducial approach to the estimation of covariance matrix. Built upon the Fiducial Berstein-von Mises Th…
▽ More
As a classical problem, covariance estimation has drawn much attention from the statistical community for decades. Much work has been done under the frequentist and the Bayesian frameworks. Aiming to quantify the uncertainty of the estimators without having to choose a prior, we have developed a fiducial approach to the estimation of covariance matrix. Built upon the Fiducial Berstein-von Mises Theorem (Sonderegger and Hannig 2014), we show that the fiducial distribution of the covariate matrix is consistent under our framework. Consequently, the samples generated from this fiducial distribution are good estimators to the true covariance matrix, which enable us to define a meaningful confidence region for the covariance matrix. Lastly, we also show that the fiducial approach can be a powerful tool for identifying clique structures in covariance matrices.
△ Less
Submitted 16 August, 2017;
originally announced August 2017.
-
Nonparametric generalized fiducial inference for survival functions under censoring
Authors:
Yifan Cui,
Jan Hannig
Abstract:
Fiducial Inference, introduced by Fisher in the 1930s, has a long history, which at times aroused passionate disagreements. However, its application has been largely confined to relatively simple parametric problems. In this paper, we present what might be the first time fiducial inference, as generalized by Hannig et al. (2016), is systematically applied to estimation of a nonparametric survival…
▽ More
Fiducial Inference, introduced by Fisher in the 1930s, has a long history, which at times aroused passionate disagreements. However, its application has been largely confined to relatively simple parametric problems. In this paper, we present what might be the first time fiducial inference, as generalized by Hannig et al. (2016), is systematically applied to estimation of a nonparametric survival function under right censoring. We find that the resulting fiducial distribution gives rise to surprisingly good statistical procedures applicable to both one sample and two sample problems. In particular, we use the fiducial distribution of a survival function to construct pointwise and curvewise confidence intervals for the survival function, and propose tests based on the curvewise confidence interval. We establish a functional Bernstein-von Mises theorem, and perform thorough simulation studies in scenarios with different levels of censoring. The proposed fiducial based confidence intervals maintain coverage in situations where asymptotic methods often have substantial coverage problems. Furthermore, the average length of the proposed confidence intervals is often shorter than the length of competing methods that maintain coverage. Finally, the proposed fiducial test is more powerful than various types of log-rank tests and sup log-rank tests in some scenarios. We illustrate the proposed fiducial test comparing chemotherapy against chemotherapy combined with radiotherapy using data from the treatment of locally unresectable gastric cancer.
△ Less
Submitted 24 March, 2018; v1 submitted 17 July, 2017;
originally announced July 2017.
-
Angle-Based Joint and Individual Variation Explained
Authors:
Qing Feng,
Meilei Jiang,
Jan Hannig,
J. S. Marron
Abstract:
Integrative analysis of disparate data blocks measured on a common set of experimental subjects is a major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Geno…
▽ More
Integrative analysis of disparate data blocks measured on a common set of experimental subjects is a major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. In this paper we introduce Angle-Based Joint and Individual Variation Explained capturing both joint and individual variation within each data block. This is a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Important mathematical contributions are the use of score subspaces as the principal descriptors of variation structure and the use of perturbation theory as the guide for variation segmentation. This leads to an exploratory data analysis method which is insensitive to the heterogeneity among data blocks and does not require separate normalization. An application to cancer data reveals different behaviors of each type of signal in characterizing tumor subtypes. An application to a mortality data set reveals interesting historical lessons. Software and data are available at GitHub <https://github.com/MeileiJiang/AJIVE_Project>.
△ Less
Submitted 18 March, 2018; v1 submitted 6 April, 2017;
originally announced April 2017.
-
Non-penalized variable selection in high-dimensional linear model settings via generalized fiducial inference
Authors:
Jonathan P Williams,
Jan Hannig
Abstract:
Standard penalized methods of variable selection and parameter estimation rely on the magnitude of coefficient estimates to decide which variables to include in the final model. However, coefficient estimates are unreliable when the design matrix is collinear. To overcome this challenge an entirely new perspective on variable selection is presented within a generalized fiducial inference framework…
▽ More
Standard penalized methods of variable selection and parameter estimation rely on the magnitude of coefficient estimates to decide which variables to include in the final model. However, coefficient estimates are unreliable when the design matrix is collinear. To overcome this challenge an entirely new perspective on variable selection is presented within a generalized fiducial inference framework. This new procedure is able to effectively account for linear dependencies among subsets of covariates in a high-dimensional setting where $p$ can grow almost exponentially in $n$, as well as in the classical setting where $p \le n$. It is shown that the procedure very naturally assigns small probabilities to subsets of covariates which include redundancies by way of explicit $L_{0}$ minimization. Furthermore, with a typical sparsity assumption, it is shown that the proposed method is consistent in the sense that the probability of the true sparse subset of covariates converges in probability to 1 as $n \to \infty$, or as $n \to \infty$ and $p \to \infty$. Very reasonable conditions are needed, and little restriction is placed on the class of possible subsets of covariates to achieve this consistency result.
△ Less
Submitted 9 February, 2018; v1 submitted 23 February, 2017;
originally announced February 2017.
-
Higher order asymptotics of Generalized Fiducial Distribution
Authors:
Abhishek Pal Majumder,
Jan Hannig
Abstract:
Generalized Fiducial Inference (GFI) is motivated by R.A. Fisher's approach of obtaining posterior-like distributions when there is no prior information available for the unknown parameter. Without the use of Bayes' theorem GFI proposes a distribution on the parameter space using a technique called increasing precision asymptotics \cite{hannig2013generalized}. In this article we analyzed the regul…
▽ More
Generalized Fiducial Inference (GFI) is motivated by R.A. Fisher's approach of obtaining posterior-like distributions when there is no prior information available for the unknown parameter. Without the use of Bayes' theorem GFI proposes a distribution on the parameter space using a technique called increasing precision asymptotics \cite{hannig2013generalized}. In this article we analyzed the regularity conditions under which the Generalized Fiducial Distribution (GFD) will be first and second order exact in a frequentist sense. We used a modification of an ingenious technique named "Shrinkage method" \cite{bickel1990decomposition}, which has been extensively used in the probability matching prior contexts, to find the higher order expansion of the frequentist coverage of Fiducial quantile. We identified when the higher order terms of one-sided coverage of Fiducial quantile will vanish and derived a workable recipe for obtaining such GFDs. These ideas are demonstrated on several examples.
△ Less
Submitted 25 August, 2016;
originally announced August 2016.
-
A Note on Automatic Data Transformation
Authors:
Qing Feng,
Jan Hannig,
J. S. Marron
Abstract:
Modern data analysis frequently involves variables with highly non-Gaussian marginal distributions. However, commonly used analysis methods are most effective with roughly Gaussian data. This paper introduces an automatic transformation that improves the closeness of distributions to normality. For each variable, a new family of parametrizations of the shifted logarithm transformation is proposed,…
▽ More
Modern data analysis frequently involves variables with highly non-Gaussian marginal distributions. However, commonly used analysis methods are most effective with roughly Gaussian data. This paper introduces an automatic transformation that improves the closeness of distributions to normality. For each variable, a new family of parametrizations of the shifted logarithm transformation is proposed, which is unique in treating the data as real-valued, and in allowing transformation for both left and right skewness within the single family. This also allows an automatic selection of the parameter value (which is crucial for high dimensional data with many variables to transform) by minimizing the Anderson-Darling test statistic of the transformed data. An application to image features extracted from melanoma microscopy slides demonstrate the utility of the proposed transformation in addressing data with excessive skewness, heteroscedasticity and influential observations.
△ Less
Submitted 8 January, 2016;
originally announced January 2016.
-
Non-iterative Joint and Individual Variation Explained
Authors:
Qing Feng,
Jan Hannig,
J. S. Marron
Abstract:
Integrative analysis of disparate data blocks measured on a common set of experimental subjects is one major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Ge…
▽ More
Integrative analysis of disparate data blocks measured on a common set of experimental subjects is one major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas (TCGA) to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. In this paper we introduce Non-iterative Joint and Individual Variation Explained (Non-iterative JIVE), capturing both joint and individual variation within each data block. This is a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Important mathematical contributions are the use of score subspaces as the principal descriptors of variation structure and the use of perturbation theory as the guide for variation segmentation. This leads to a method which is robust against the heterogeneity among data blocks without a need for normalization. An application to TCGA data reveals different behaviors of each type of signal in characterizing tumor subtypes. An application to a mortality data set reveals interesting historical lessons.
△ Less
Submitted 25 April, 2016; v1 submitted 13 December, 2015;
originally announced December 2015.
-
Source detection algorithms for dynamic contaminants based on the analysis of a hydrodynamic limit
Authors:
Sergio A. Almada Monter,
Amarjit Budhiraja,
Jan Hannig
Abstract:
In this work we propose and numerically analyze an algorithm for detection of a contaminant source using a dynamic sensor network. The algorithm is motivated using a global probabilistic optimization problem and is based on the analysis of the hydrodynamic limit of a discrete time evolution equation on the lattice under a suitable scaling of time and space. Numerical results illustrating the effec…
▽ More
In this work we propose and numerically analyze an algorithm for detection of a contaminant source using a dynamic sensor network. The algorithm is motivated using a global probabilistic optimization problem and is based on the analysis of the hydrodynamic limit of a discrete time evolution equation on the lattice under a suitable scaling of time and space. Numerical results illustrating the effectiveness of the algorithm are presented.
△ Less
Submitted 19 October, 2015; v1 submitted 17 October, 2015;
originally announced October 2015.
-
Discussion of "On the Birnbaum Argument for the Strong Likelihood Principle"
Authors:
Jan Hannig
Abstract:
In this discussion we demonstrate that fiducial distributions provide a natural example of an inference paradigm that does not obey Strong Likelihood Principle while still satisfying the Weak Conditionality Principle. [arXiv:1302.7021]
In this discussion we demonstrate that fiducial distributions provide a natural example of an inference paradigm that does not obey Strong Likelihood Principle while still satisfying the Weak Conditionality Principle. [arXiv:1302.7021]
△ Less
Submitted 4 November, 2014;
originally announced November 2014.
-
Generalized Fiducial Inference for Ultrahigh Dimensional Regression
Authors:
Randy C. S. Lai,
Jan Hannig,
Thomas C. M. Lee
Abstract:
In recent years the ultrahigh dimensional linear regression problem has attracted enormous attentions from the research community. Under the sparsity assumption most of the published work is devoted to the selection and estimation of the significant predictor variables. This paper studies a different but fundamentally important aspect of this problem: uncertainty quantification for parameter estim…
▽ More
In recent years the ultrahigh dimensional linear regression problem has attracted enormous attentions from the research community. Under the sparsity assumption most of the published work is devoted to the selection and estimation of the significant predictor variables. This paper studies a different but fundamentally important aspect of this problem: uncertainty quantification for parameter estimates and model choices. To be more specific, this paper proposes methods for deriving a probability density function on the set of all possible models, and also for constructing confidence intervals for the corresponding parameters. These proposed methods are developed using the generalized fiducial methodology, which is a variant of Fisher's controversial fiducial idea. Theoretical properties of the proposed methods are studied, and in particular it is shown that statistical inference based on the proposed methods will have exact asymptotic frequentist property. In terms of empirical performances, the proposed methods are tested by simulation experiments and an application to a real data set. Lastly this work can also be seen as an interesting and successful application of Fisher's fiducial idea to an important and contemporary problem. To the best of the authors' knowledge, this is the first time that the fiducial idea is being applied to a so-called "large p small n" problem.
△ Less
Submitted 29 April, 2013;
originally announced April 2013.
-
The importance sampling technique for understanding rare events in Erdős-Rényi random graphs
Authors:
Shankar Bhamidi,
Jan Hannig,
Chia Ying Lee,
James Nolen
Abstract:
In dense Erdős-Rényi random graphs, we are interested in the events where large numbers of a given subgraph occur. The mean behavior of subgraph counts is known, and only recently were the related large deviations results discovered. Consequently, it is natural to ask, can one develop efficient numerical schemes to estimate the probability of an Erdős-Rényi graph containing an excessively large nu…
▽ More
In dense Erdős-Rényi random graphs, we are interested in the events where large numbers of a given subgraph occur. The mean behavior of subgraph counts is known, and only recently were the related large deviations results discovered. Consequently, it is natural to ask, can one develop efficient numerical schemes to estimate the probability of an Erdős-Rényi graph containing an excessively large number of a fixed given subgraph? Using the large deviation principle we study an importance sampling scheme as a method to numerically compute the small probabilities of large triangle counts occurring within Erdős-Rényi graphs. We show that the exponential tilt suggested directly by the large deviation principle does not always yield an optimal scheme. The exponential tilt used in the importance sampling scheme comes from a generalized class of exponential random graphs. Asymptotic optimality, a measure of the efficiency of the importance sampling scheme, is achieved by a special choice of the parameters in the exponential random graph that makes it indistinguishable from an Erdős-Rényi graph conditioned to have many triangles in the large network limit. We show how this choice can be made for the conditioned Erdős-Rényi graphs both in the replica symmetric phase as well as in parts of the replica breaking phase to yield asymptotically optimal numerical schemes to estimate this rare event probability.
△ Less
Submitted 2 April, 2014; v1 submitted 26 February, 2013;
originally announced February 2013.
-
Generalized fiducial inference for normal linear mixed models
Authors:
Jessi Cisewski,
Jan Hannig
Abstract:
While linear mixed modeling methods are foundational concepts introduced in any statistical education, adequate general methods for interval estimation involving models with more than a few variance components are lacking, especially in the unbalanced setting. Generalized fiducial inference provides a possible framework that accommodates this absence of methodology. Under the fabric of generalized…
▽ More
While linear mixed modeling methods are foundational concepts introduced in any statistical education, adequate general methods for interval estimation involving models with more than a few variance components are lacking, especially in the unbalanced setting. Generalized fiducial inference provides a possible framework that accommodates this absence of methodology. Under the fabric of generalized fiducial inference along with sequential Monte Carlo methods, we present an approach for interval estimation for both balanced and unbalanced Gaussian linear mixed models. We compare the proposed method to classical and Bayesian results in the literature in a simulation study of two-fold nested models and two-factor crossed designs with an interaction term. The proposed method is found to be competitive or better when evaluated based on frequentist criteria of empirical coverage and average length of confidence intervals for small sample sizes. A MATLAB implementation of the proposed algorithm is available from the authors.
△ Less
Submitted 6 November, 2012;
originally announced November 2012.
-
Continuum Limits of Markov Chains with Application to Network Modeling
Authors:
Yang Zhang,
Edwin K. P. Chong,
Jan Hannig,
Donald Estep
Abstract:
In this paper we investigate the continuum limits of a class of Markov chains. The investigation of such limits is motivated by the desire to model very large networks. We show that under some conditions, a sequence of Markov chains converges in some sense to the solution of a partial differential equation. Based on such convergence we approximate Markov chains modeling networks with a large numbe…
▽ More
In this paper we investigate the continuum limits of a class of Markov chains. The investigation of such limits is motivated by the desire to model very large networks. We show that under some conditions, a sequence of Markov chains converges in some sense to the solution of a partial differential equation. Based on such convergence we approximate Markov chains modeling networks with a large number of components by partial differential equations. While traditional Monte Carlo simulation for very large networks is practically infeasible, partial differential equations can be solved with reasonable computational overhead using well-established mathematical tools.
△ Less
Submitted 21 June, 2011;
originally announced June 2011.