-
Smoothness Adaptive Hypothesis Transfer Learning
Authors:
Haotian Lin,
Matthew Reimherr
Abstract:
Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Trans…
▽ More
Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Differentially Private Synthetic Heavy-tailed Data
Authors:
Tran Tran,
Matthew Reimherr,
Aleksandra Slavković
Abstract:
The U.S. Census Longitudinal Business Database (LBD) product contains employment and payroll information of all U.S. establishments and firms dating back to 1976 and is an invaluable resource for economic research. However, the sensitive information in LBD requires confidentiality measures that the U.S. Census in part addressed by releasing a synthetic version (SynLBD) of the data to protect firms…
▽ More
The U.S. Census Longitudinal Business Database (LBD) product contains employment and payroll information of all U.S. establishments and firms dating back to 1976 and is an invaluable resource for economic research. However, the sensitive information in LBD requires confidentiality measures that the U.S. Census in part addressed by releasing a synthetic version (SynLBD) of the data to protect firms' privacy while ensuring its usability for research activities, but without provable privacy guarantees. In this paper, we propose using the framework of differential privacy (DP) that offers strong provable privacy protection against arbitrary adversaries to generate synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility. We propose using the K-Norm Gradient Mechanism (KNG) with quantile regression for DP synthetic data generation. The proposed methodology offers the flexibility of the well-known exponential mechanism while adding less noise. We propose implementing KNG in a stepwise and sandwich order, such that new quantile estimation relies on previously sampled quantiles, to more efficiently use the privacy-loss budget. Generating synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility is a challenging problem for data curators and researchers. However, we show that the proposed methods can achieve better data utility relative to the original KNG at the same privacy-loss budget through a simulation study and an application to the Synthetic Longitudinal Business Database.
△ Less
Submitted 14 October, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Pure Differential Privacy for Functional Summaries via a Laplace-like Process
Authors:
Haotian Lin,
Matthew Reimherr
Abstract:
Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release:…
▽ More
Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release: the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional objects, thereby addressing several limitations of existing mechanisms. We establish the feasibility of the proposed mechanism in multiple function spaces. Several statistical estimation problems are considered, and we demonstrate one can enhance the utility of sanitized summaries by oversmoothing their non-private counterpart. Numerical experiments on synthetic and real datasets demonstrate the efficacy of the proposed mechanism.
△ Less
Submitted 3 March, 2024; v1 submitted 31 August, 2023;
originally announced September 2023.
-
FAStEN: an efficient adaptive method for feature selection and estimation in high-dimensional functional regressions
Authors:
Tobia Boschi,
Lorenzo Testa,
Francesca Chiaromonte,
Matthew Reimherr
Abstract:
Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-funct…
▽ More
Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study.
△ Less
Submitted 4 September, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Shape And Structure Preserving Differential Privacy
Authors:
Carlos Soto,
Karthik Bharath,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively cur…
▽ More
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively curved manifold, such as Kendall's 2D shape space, is significantly influences by the curvature. Focusing on the problem of sanitizing the Fréchet mean of a sample of points on a manifold, we exploit the characterisation of the mean as the minimizer of an objective function comprised of the sum of squared distances and develop a K-norm gradient mechanism on Riemannian manifolds that favors values that produce gradients close to the the zero of the objective function. For the case of positively curved manifolds, we describe how using the gradient of the squared distance function offers better control over sensitivity than the Laplace mechanism, and demonstrate this numerically on a dataset of shapes of corpus callosa. Further illustrations of the mechanism's utility on a sphere and the manifold of symmetric positive definite matrices are also presented.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
On Hypothesis Transfer Learning of Functional Linear Models
Authors:
Haotian Lin,
Matthew Reimherr
Abstract:
We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity…
▽ More
We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity across tasks using RKHS distance, allowing the type of information being transferred tied to the properties of the imposed RKHS. Building on the hypothesis offset transfer learning paradigm, two algorithms are proposed: one conducts the transfer when positive sources are known, while the other leverages aggregation techniques to achieve robust transfer without prior information about the sources. We establish lower bounds for this learning problem and show the proposed algorithms enjoy a matching asymptotic upper bound. These analyses provide statistical insights into factors that contribute to the dynamics of the transfer. We also extend the results to functional generalized linear models. The effectiveness of the proposed algorithms is demonstrated on extensive synthetic data as well as a financial data application.
△ Less
Submitted 22 February, 2024; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can speci…
▽ More
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can specify privacy budgets in advance and implement samplers with exact privacy guarantees. In this paper, we use tools from ergodic theory and perfect simulation to design exact finite runtime sampling algorithms for the exponential mechanism by introducing an intermediate modified target distribution using artificial atoms. We propose an additional modification of this sampling algorithm that maintains its $ε$-DP guarantee and has improved runtime at the cost of some utility. We then compare these methods in scenarios where we can explicitly calculate a $δ$ cost (as in $(ε, δ)$-DP) incurred when using standard MCMC techniques. Much as there is a well known trade-off between privacy and utility, we demonstrate that there is also a trade-off between privacy guarantees and runtime.
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
Formal Privacy for Partially Private Data
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize t…
▽ More
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize the privacy loss attributable to the entire collection of releases. In this paper, we present a privacy formalism, $(ε, \{ Θ_z\}_{z \in \mathcal{Z}})$-Pufferfish ($ε$-TP for short when $\{ Θ_z\}_{z \in \mathcal{Z}}$ is implied), a collection of Pufferfish mechanisms indexed by realizations of a random variable $Z$ representing public information not protected with DP noise. First, we prove that this definition has similar properties to DP. Next, we introduce mechanisms for releasing partially private data (PPD) satisfying $ε$-TP and prove their desirable properties. We provide algorithms for sampling from the posterior of a parameter given PPD. We then compare this inference approach to the alternative where noisy statistics are deterministically combined with Z. We derive mild conditions under which using our algorithms offers both theoretical and computational improvements over this more common approach. Finally, we demonstrate all the effects above on a case study on COVID-19 data.
△ Less
Submitted 14 December, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
Differential Privacy Over Riemannian Manifolds
Authors:
Matthew Reimherr,
Karthik Bharath,
Carlos Soto
Abstract:
In this work we consider the problem of releasing a differentially private statistical summary that resides on a Riemannian manifold. We present an extension of the Laplace or K-norm mechanism that utilizes intrinsic distances and volumes on the manifold. We also consider in detail the specific case where the summary is the Fréchet mean of data residing on a manifold. We demonstrate that our mecha…
▽ More
In this work we consider the problem of releasing a differentially private statistical summary that resides on a Riemannian manifold. We present an extension of the Laplace or K-norm mechanism that utilizes intrinsic distances and volumes on the manifold. We also consider in detail the specific case where the summary is the Fréchet mean of data residing on a manifold. We demonstrate that our mechanism is rate optimal and depends only on the dimension of the manifold, not on the dimension of any ambient space, while also showing how ignoring the manifold structure can decrease the utility of the sanitized summary. We illustrate our framework in two examples of particular interest in statistics: the space of symmetric positive definite matrices, which is used for covariance matrices, and the sphere, which can be used as a space for modeling discrete distributions.
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
Modern Non-Linear Function-on-Function Regression
Authors:
Aniruddha Rajendra Rao,
Matthew Reimherr
Abstract:
We introduce a new class of non-linear function-on-function regression models for functional data using neural networks. We propose a framework using a hidden layer consisting of continuous neurons, called a continuous hidden layer, for functional response modeling and give two model fitting strategies, Functional Direct Neural Network (FDNN) and Functional Basis Neural Network (FBNN). Both are de…
▽ More
We introduce a new class of non-linear function-on-function regression models for functional data using neural networks. We propose a framework using a hidden layer consisting of continuous neurons, called a continuous hidden layer, for functional response modeling and give two model fitting strategies, Functional Direct Neural Network (FDNN) and Functional Basis Neural Network (FBNN). Both are designed explicitly to exploit the structure inherent in functional data and capture the complex relations existing between the functional predictors and the functional response. We fit these models by deriving functional gradients and implement regularization techniques for more parsimonious results. We demonstrate the power and flexibility of our proposed method in handling complex functional models through extensive simulation studies as well as real data examples.
△ Less
Submitted 7 October, 2023; v1 submitted 29 July, 2021;
originally announced July 2021.
-
Non-linear Functional Modeling using Neural Networks
Authors:
Aniruddha Rajendra Rao,
Matthew Reimherr
Abstract:
We introduce a new class of non-linear models for functional data based on neural networks. Deep learning has been very successful in non-linear modeling, but there has been little work done in the functional data setting. We propose two variations of our framework: a functional neural network with continuous hidden layers, called the Functional Direct Neural Network (FDNN), and a second version t…
▽ More
We introduce a new class of non-linear models for functional data based on neural networks. Deep learning has been very successful in non-linear modeling, but there has been little work done in the functional data setting. We propose two variations of our framework: a functional neural network with continuous hidden layers, called the Functional Direct Neural Network (FDNN), and a second version that utilizes basis expansions and continuous hidden layers, called the Functional Basis Neural Network (FBNN). Both are designed explicitly to exploit the structure inherent in functional data. To fit these models we derive a functional gradient based optimization algorithm. The effectiveness of the proposed methods in handling complex functional models is demonstrated by comprehensive simulation studies and real data examples.
△ Less
Submitted 3 May, 2023; v1 submitted 19 April, 2021;
originally announced April 2021.
-
Modern Multiple Imputation with Functional Data
Authors:
Aniruddha Rajendra Rao,
Matthew Reimherr
Abstract:
This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with…
▽ More
This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with the sample size, whereas, we show numerically that a modified approach with more modern multiple imputation methods can produce better estimates in general. We also propose a new imputation approach that combines the ideas of {\it MissForest} with {\it Local Linear Forest} and compare their performance with {\it PACE} and several other multivariate multiple imputation methods. This work is motivated by a longitudinal study on smoking cessation, in which the Electronic Health Records (EHR) from Penn State PaTH to Health allow for the collection of a great deal of data, with highly variable sampling. To illustrate our approach, we explore the relation between relapse and diastolic blood pressure. We also consider a variety of simulation schemes with varying levels of sparsity to validate our methods.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
An Efficient Semi-smooth Newton Augmented Lagrangian Method for Elastic Net
Authors:
Tobia Boschi,
Matthew Reimherr,
Francesca Chiaromonte
Abstract:
Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dim…
▽ More
Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dimensional settings. Our new algorithm exploits both the sparsity induced by the Elastic Net penalty and the sparsity due to the second order information of the augmented Lagrangian. This greatly reduces the computational cost of the problem. Using simulations on both synthetic and real datasets, we demonstrate that our approach outperforms its best competitors by at least an order of magnitude in terms of CPU time. We also apply our approach to a Genome Wide Association Study on childhood obesity.
△ Less
Submitted 6 June, 2020;
originally announced June 2020.
-
Fast and Fair Simultaneous Confidence Bands for Functional Parameters
Authors:
Dominik Liebl,
Matthew Reimherr
Abstract:
Quantifying uncertainty using confidence regions is a central goal of statistical inference. Despite this, methodologies for confidence bands in Functional Data Analysis are still underdeveloped compared to estimation and hypothesis testing. In this work, we present a new methodology for constructing simultaneous confidence bands for functional parameter estimates. Our bands possess a number of po…
▽ More
Quantifying uncertainty using confidence regions is a central goal of statistical inference. Despite this, methodologies for confidence bands in Functional Data Analysis are still underdeveloped compared to estimation and hypothesis testing. In this work, we present a new methodology for constructing simultaneous confidence bands for functional parameter estimates. Our bands possess a number of positive qualities: (1) they are not based on resampling and thus are fast to compute, (2) they are constructed under the fairness constraint of balanced false positive rates across partitions of the bands' domain which facilitates the typical global, but also novel local interpretations, and (3) they do not require an estimate of the full covariance function and thus can be used in the case of fragmentary functional data. Simulations show the excellent finite-sample behavior of our bands in comparison to existing alternatives. The practical use of our bands is demonstrated in two case studies on sports biomechanics and fragmentary growth curves.
△ Less
Submitted 11 November, 2022; v1 submitted 30 September, 2019;
originally announced October 2019.
-
Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net
Authors:
Ardalan Mirshani,
Matthew Reimherr
Abstract:
This paper presents a new methodology, called AFSSEN, to simultaneously select significant predictors and produce smooth estimates in a high-dimensional function-on-scalar linear model with a sub-Gaussian errors. Outcomes are assumed to lie in a general real separable Hilbert space, H, while parameters lie in a subspace known as a Cameron Martin space, K, which are closely related to Reproducing K…
▽ More
This paper presents a new methodology, called AFSSEN, to simultaneously select significant predictors and produce smooth estimates in a high-dimensional function-on-scalar linear model with a sub-Gaussian errors. Outcomes are assumed to lie in a general real separable Hilbert space, H, while parameters lie in a subspace known as a Cameron Martin space, K, which are closely related to Reproducing Kernel Hilbert Spaces, so that parameter estimates inherit particular properties, such as smoothness or periodicity, without enforcing such properties on the data. We propose a regularization method in the style of an adaptive Elastic Net penalty that involves mixing two types of functional norms, providing a fine tune control of both the smoothing and variable selection in the estimated model. Asymptotic theory is provided in the form of a functional oracle property, and the paper concludes with a simulation study demonstrating the advantage of using AFSSEN over existing methods in terms of prediction error and variable selection.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
KNG: The K-Norm Gradient Mechanism
Authors:
Matthew Reimherr,
Jordan Awan
Abstract:
This paper presents a new mechanism for producing sanitized statistical summaries that achieve \emph{differential privacy}, called the \emph{K-Norm Gradient} Mechanism, or KNG. This new approach maintains the strong flexibility of the exponential mechanism, while achieving the powerful utility performance of objective perturbation. KNG starts with an inherent objective function (often an empirical…
▽ More
This paper presents a new mechanism for producing sanitized statistical summaries that achieve \emph{differential privacy}, called the \emph{K-Norm Gradient} Mechanism, or KNG. This new approach maintains the strong flexibility of the exponential mechanism, while achieving the powerful utility performance of objective perturbation. KNG starts with an inherent objective function (often an empirical risk), and promotes summaries that are close to minimizing the objective by weighting according to how far the gradient of the objective function is from zero. Working with the gradient instead of the original objective function allows for additional flexibility as one can penalize using different norms. We show that, unlike the exponential mechanism, the noise added by KNG is asymptotically negligible compared to the statistical error for many problems. In addition to theoretical guarantees on privacy and utility, we confirm the utility of KNG empirically in the settings of linear and quantile regression through simulations.
△ Less
Submitted 2 August, 2021; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA
Authors:
Jordan Awan,
Ana Kenney,
Matthew Reimherr,
Aleksandra Slavković
Abstract:
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the outpu…
▽ More
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the output space, such as a Guassian process. We provide a positive result that establishes a Central Limit Theorem for the exponential mechanism quite broadly. We also provide an apparent negative result, showing that the magnitude of the noise introduced for privacy is asymptotically non-negligible relative to the statistical estimation error. We develop an \ep-DP mechanism for functional principal component analysis, applicable in separable Hilbert spaces. We demonstrate its performance via simulations and applications to two datasets.
△ Less
Submitted 30 January, 2019;
originally announced January 2019.
-
Highly Irregular Functional Generalized Linear Regression with Electronic Health Records
Authors:
Justin Petrovich,
Matthew Reimherr,
Carrie Daymont
Abstract:
This work presents a new approach, called MISFIT, for fitting generalized functional linear regression models with sparsely and irregularly sampled data. Current methods do not allow for consistent estimation unless one assumes that the number of observed points per curve grows sufficiently quickly with the sample size. In contrast, MISFIT is based on a multiple imputation framework, which has the…
▽ More
This work presents a new approach, called MISFIT, for fitting generalized functional linear regression models with sparsely and irregularly sampled data. Current methods do not allow for consistent estimation unless one assumes that the number of observed points per curve grows sufficiently quickly with the sample size. In contrast, MISFIT is based on a multiple imputation framework, which has the potential to produce consistent estimates without such an assumption. Just as importantly, it propagates the uncertainty of not having completely observed curves, allowing for a more accurate assessment of the uncertainty of parameter estimates, something that most methods currently cannot accomplish. This work is motivated by a longitudinal study on macrocephaly, or atypically large head size, in which electronic medical records allow for the collection of a great deal of data. However, the sampling is highly variable from child to child. Using MISFIT we are able to clearly demonstrate that the development of pathologic conditions related to macrocephaly is associated with both the overall head circumference of the children as well as the velocity of their head growth.
△ Less
Submitted 4 October, 2019; v1 submitted 22 May, 2018;
originally announced May 2018.
-
Manifold Data Analysis with Applications to High-Frequency 3D Imaging
Authors:
Hyun Bin Kang,
Matthew Reimherr,
Mark Shriver,
Peter Claes
Abstract:
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on develo** tools for handling such data. This paper presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples wher…
▽ More
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on develo** tools for handling such data. This paper presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units. We propose a framework that converts manifolds into functional objects, an efficient 2-step functional principal component method, and a manifold-on-scalar regression model. This work is motivated by an anthropological application involving 3D facial imaging data, which is discussed extensively throughout the paper. The proposed framework is used to understand how individual characteristics, such as age and genetic ancestry, influence the shape of the human face.
△ Less
Submitted 4 October, 2017;
originally announced October 2017.
-
A Geometric Approach to Confidence Regions and Bands for Functional Parameters
Authors:
Hyunphil Choi,
Matthew Reimherr
Abstract:
Functional data analysis, FDA, is now a well established discipline of statistics, with its core concepts and perspectives in place. Despite this, there are still fundamental statistical questions which have received relatively little attention. One of these is the systematic construction of confidence regions for functional parameters. This work is concerned with develo**, understanding, and vi…
▽ More
Functional data analysis, FDA, is now a well established discipline of statistics, with its core concepts and perspectives in place. Despite this, there are still fundamental statistical questions which have received relatively little attention. One of these is the systematic construction of confidence regions for functional parameters. This work is concerned with develo**, understanding, and visualizing such regions. We provide a general strategy for constructing confidence regions in a real separable Hilbert space using hyper-ellipsoids and hyper-rectangles. We then propose specific implementations which work especially well in practice. They provide powerful hypothesis tests and useful visualization tools without using any simulation. We also demonstrate the negative result that nearly all regions, including our own, have zero-coverage when working with empirical covariances. To overcome this challenge we propose a new paradigm for evaluating confidence regions by showing that the distance between an estimated region and the desired region (with proper coverage) tends to zero faster than the regions shrink to a point. We call this phenomena ghosting and refer to the empirical regions as ghost regions. We illustrate the proposed methods in a simulation study and an application to fractional anisotropy tract profile data.
△ Less
Submitted 10 August, 2016; v1 submitted 26 July, 2016;
originally announced July 2016.
-
A randomness test for functional panels
Authors:
Piotr Kokoszka,
Matthew Reimherr,
Nikolas Wölfing
Abstract:
Functional panels are collections of functional time series, and arise often in the study of high frequency multivariate data. We develop a portmanteau style test to determine if the cross-sections of such a panel are independent and identically distributed. Our framework allows the number of functional projections and/or the number of time series to grow with the sample size. A large sample justi…
▽ More
Functional panels are collections of functional time series, and arise often in the study of high frequency multivariate data. We develop a portmanteau style test to determine if the cross-sections of such a panel are independent and identically distributed. Our framework allows the number of functional projections and/or the number of time series to grow with the sample size. A large sample justification is based on a new central limit theorem for random vectors of increasing dimension. With a proper normalization, the limit is standard normal, potentially making this result easily applicable in other FDA context in which projections on a subspace of increasing dimension are used. The test is shown to have correct size and excellent power using simulated panels whose random structure mimics the realistic dependence encountered in real panel data. It is expected to find application in climatology, finance, ecology, economics, and geophysics. We apply it to Southern Pacific sea surface temperature data, precipitation patterns in the South-West United States, and temperature curves in Germany.
△ Less
Submitted 10 July, 2016; v1 submitted 9 October, 2015;
originally announced October 2015.
-
Testing separability of space--time functional processes
Authors:
Panayiotis Constantinou,
Piotr Kokoszka,
Matthew Reimherr
Abstract:
We present a new methodology and accompanying theory to test for separability of spatio-temporal functional data. In spatio-temporal statistics, separability is a common simplifying assumption concerning the covariance structure which, if true, can greatly increase estimation accuracy and inferential power. While our focus is on testing for the separation of space and time in spatio-temporal data,…
▽ More
We present a new methodology and accompanying theory to test for separability of spatio-temporal functional data. In spatio-temporal statistics, separability is a common simplifying assumption concerning the covariance structure which, if true, can greatly increase estimation accuracy and inferential power. While our focus is on testing for the separation of space and time in spatio-temporal data, our methods can be applied to any area where separability is useful, including biomedical imaging. We present three tests, one being a functional extension of the Monte Carlo likelihood method of Mitchell et. al. (2005), while the other two are based on quadratic forms. Our tests are based on asymptotic distributions of maximum likelihood estimators, and do not require Monte Carlo or bootstrap replications. The specification of the joint asymptotic distribution of these estimators is the main theoretical contribution of this paper. It can be used to derive many other tests. The main methodological finding is that one of the quadratic form methods, which we call a norm approach, emerges as a clear winner in terms of finite sample performance in nearly every setting we considered. The norm approach focuses directly on the Frobenius distance between the spatio-temporal covariance function and its separable approximation. We demonstrate the efficacy of our methods via simulations and an application to Irish wind data.
△ Less
Submitted 23 September, 2015;
originally announced September 2015.
-
Prior sample size extensions for assessing prior impact and prior--likelihood discordance
Authors:
Matthew Reimherr,
Xiao-Li Meng,
Dan L. Nicolae
Abstract:
This paper outlines a framework for quantifying the prior's contribution to posterior inference in the presence of prior-likelihood discordance, a broader concept than the usual notion of prior-likelihood conflict. We achieve this dual purpose by extending the classic notion of \textit{prior sample size}, $M$, in three directions: (I) estimating $M$ beyond conjugate families; (II) formulating $M$…
▽ More
This paper outlines a framework for quantifying the prior's contribution to posterior inference in the presence of prior-likelihood discordance, a broader concept than the usual notion of prior-likelihood conflict. We achieve this dual purpose by extending the classic notion of \textit{prior sample size}, $M$, in three directions: (I) estimating $M$ beyond conjugate families; (II) formulating $M$ as a relative notion, i.e., as a function of the likelihood sample size $k, M(k),$ which also leads naturally to a graphical diagnosis; and (III) permitting negative $M$, as a measure of prior-likelihood conflict, i.e., harmful discordance. Our asymptotic regime permits the prior sample size to grow with the likelihood data size, hence making asymptotic arguments meaningful for investigating the impact of the prior relative to that of likelihood. It leads to a simple asymptotic formula for quantifying the impact of a proper prior that only involves computing a centrality and a spread measure of the prior and the posterior. We use simulated and real data to illustrate the potential of the proposed framework, including quantifying how weak is a "weakly informative" prior adopted in a study of lupus nephritis. Whereas we take a pragmatic perspective in assessing the impact of a prior on a given inference problem under a specific evaluative metric, we also touch upon conceptual and theoretical issues such as using improper priors and permitting priors with asymptotically non-vanishing influence.
△ Less
Submitted 7 January, 2021; v1 submitted 23 June, 2014;
originally announced June 2014.
-
A functional data analysis approach for genetic association studies
Authors:
Matthew Reimherr,
Dan Nicolae
Abstract:
We present a new method based on Functional Data Analysis (FDA) for detecting associations between one or more scalar covariates and a longitudinal response, while correcting for other variables. Our methods exploit the temporal structure of longitudinal data in ways that are otherwise difficult with a multivariate approach. Our procedure, from an FDA perspective, is a departure from more establis…
▽ More
We present a new method based on Functional Data Analysis (FDA) for detecting associations between one or more scalar covariates and a longitudinal response, while correcting for other variables. Our methods exploit the temporal structure of longitudinal data in ways that are otherwise difficult with a multivariate approach. Our procedure, from an FDA perspective, is a departure from more established methods in two key aspects. First, the raw longitudinal phenotypes are assembled into functional trajectories prior to analysis. Second, we explore an association test that is not directly based on principal components. We instead focus on quantifying the reduction in $L^2$ variability as a means of detecting associations. Our procedure is motivated by longitudinal genome wide association studies and, in particular, the childhood asthma management program (CAMP) which explores the long term effects of daily asthma treatments. We conduct a simulation study to better understand the advantages (and/or disadvantages) of an FDA approach compared to a traditional multivariate one. We then apply our methodology to data coming from CAMP. We find a potentially new association with a SNP negatively affecting lung function. Furthermore, this SNP seems to have an interaction effect with one of the treatments.
△ Less
Submitted 29 April, 2014;
originally announced April 2014.
-
On Quantifying Dependence: A Framework for Develo** Interpretable Measures
Authors:
Matthew Reimherr,
Dan L. Nicolae
Abstract:
We present a framework for selecting and develo** measures of dependence when the goal is the quantification of a relationship between two variables, not simply the establishment of its existence. Much of the literature on dependence measures is focused, at least implicitly, on detection or revolves around the inclusion/exclusion of particular axioms and discussing which measures satisfy said ax…
▽ More
We present a framework for selecting and develo** measures of dependence when the goal is the quantification of a relationship between two variables, not simply the establishment of its existence. Much of the literature on dependence measures is focused, at least implicitly, on detection or revolves around the inclusion/exclusion of particular axioms and discussing which measures satisfy said axioms. In contrast, we start with only a few nonrestrictive guidelines focused on existence, range and interpretability, which provide a very open and flexible framework. For quantification, the most crucial is the notion of interpretability, whose foundation can be found in the work of Goodman and Kruskal [Measures of Association for Cross Classifications (1979) Springer], and whose importance can be seen in the popularity of tools such as the $R^2$ in linear regression. While Goodman and Kruskal focused on probabilistic interpretations for their measures, we demonstrate how more general measures of information can be used to achieve the same goal. To that end, we present a strategy for building dependence measures that is designed to allow practitioners to tailor measures to their needs. We demonstrate how many well-known measures fit in with our framework and conclude the paper by presenting two real data examples. Our first example explores U.S. income and education where we demonstrate how this methodology can help guide the selection and development of a dependence measure. Our second example examines measures of dependence for functional data, and illustrates them using data on geomagnetic storms.
△ Less
Submitted 21 February, 2013;
originally announced February 2013.