Skip to main content

Showing 1–19 of 19 results for author: Slawski, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2405.20149  [pdf, other

    stat.ME

    Accounting for Mismatch Error in Small Area Estimation with Linked Data

    Authors: Enrico Fabrizi, Nicola Salvati, Martin Slawski

    Abstract: In small area estimation different data sources are integrated in order to produce reliable estimates of target parameters (e.g., a mean or a proportion) for a collection of small subsets (areas) of a finite population. Regression models such as the linear mixed effects model or M-quantile regression are often used to improve the precision of survey sample estimates by leveraging auxiliary informa… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: total: 46 pages, main: 33 pages, references: 4 pages, technical appendix: 9 pages

  2. arXiv:2306.00909  [pdf, other

    stat.ME

    A General Framework for Regression with Mismatched Data Based on Mixture Modeling

    Authors: Martin Slawski, Brady T. West, Priyanjali Bukke, Guoqing Diao, Zhenbang Wang, Emanuel Ben-David

    Abstract: Data sets obtained from linking multiple files are frequently affected by mismatch error, as a result of non-unique or noisy identifiers used during record linkage. Accounting for such mismatch error in downstream analysis performed on the linked file is critical to ensure valid statistical inference. In this paper, we present a general framework to enable valid post-linkage inference in the chall… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 34 pages not counting references and appendix

  3. arXiv:2203.04689  [pdf, other

    stat.ME stat.AP stat.CO

    Tensor Completion for Causal Inference with Multivariate Longitudinal Data: A Reevaluation of COVID-19 Mandates

    Authors: Jonathan Auerbach, Martin Slawski, Shixue Zhang

    Abstract: We propose a new method that uses tensor completion to estimate causal effects with multivariate longitudinal data, data in which multiple outcomes are observed for each unit and time period. Our motivation is to estimate the number of COVID-19 fatalities prevented by government mandates such as travel restrictions, mask-wearing directives, and vaccination requirements. In addition to COVID-19 fat… ▽ More

    Submitted 19 November, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

  4. arXiv:2201.03528  [pdf, other

    math.ST cs.LG stat.ML

    Permuted and Unlinked Monotone Regression in $\mathbb{R}^d$: an approach based on mixture modeling and optimal transport

    Authors: Martin Slawski, Bodhisattva Sen

    Abstract: Suppose that we have a regression problem with response variable Y in $\mathbb{R}^d$ and predictor X in $\mathbb{R}^d$, for $d \geq 1$. In permuted or unlinked regression we have access to separate unordered data on X and Y, as opposed to data on (X,Y)-pairs in usual regression. So far in the literature the case $d=1$ has received attention, see e.g., the recent papers by Rigollet and Weed [Inform… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: 38 pages, 6 figures

  5. arXiv:2111.01767  [pdf, other

    stat.ML cs.LG stat.ME

    Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

    Authors: Zhenbang Wang, Emanuel Ben-David, Martin Slawski

    Abstract: In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observation unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this set… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: 25 pages, 5 figures

  6. arXiv:2010.00181  [pdf, other

    stat.ME

    Estimation in exponential family Regression based on linked data contaminated by mismatch error

    Authors: Zhenbang Wang, Emanuel Ben-David, Martin Slawski

    Abstract: Identification of matching records in multiple files can be a challenging and error-prone task. Linkage error can considerably affect subsequent statistical analysis based on the resulting linked file. Several recent papers have studied post-linkage linear regression analysis with the response variable in one file and the covariates in a second file from the perspective of the "Broken Sample Probl… ▽ More

    Submitted 26 October, 2020; v1 submitted 30 September, 2020; originally announced October 2020.

    Comments: 51 pages, 7 figures

  7. arXiv:1910.01623  [pdf, other

    stat.ME cs.LG stat.ML

    A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data

    Authors: Martin Slawski, Guoqing Diao, Emanuel Ben-David

    Abstract: Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this paper, we… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

    Comments: 31 pages

  8. arXiv:1909.02496  [pdf, ps, other

    cs.IT cs.LG stat.ML

    The Benefits of Diversity: Permutation Recovery in Unlabeled Sensing from Multiple Measurement Vectors

    Authors: Hang Zhang, Martin Slawski, ** Li

    Abstract: In "Unlabeled Sensing", one observes a set of linear measurements of an underlying signal with incomplete or missing information about their ordering, which can be modeled in terms of an unknown permutation. Previous work on the case of a single noisy measurement vector has exposed two main challenges: 1) a high requirement concerning the \emph{signal-to-noise ratio} ($\snr$), i.e., approximately… ▽ More

    Submitted 11 July, 2020; v1 submitted 5 September, 2019; originally announced September 2019.

  9. arXiv:1907.07148  [pdf, other

    stat.ML cs.IT cs.LG stat.ME

    A Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data

    Authors: Martin Slawski, Emanuel Ben-David, ** Li

    Abstract: A tacit assumption in linear regression is that (response, predictor)-pairs correspond to identical observational units. A series of recent works have studied scenarios in which this assumption is violated under terms such as ``Unlabeled Sensing and ``Regression with Unknown Permutation''. In this paper, we study the setup of multiple response variables and a notion of mismatches that generalizes… ▽ More

    Submitted 28 June, 2020; v1 submitted 16 July, 2019; originally announced July 2019.

  10. arXiv:1805.06915  [pdf, other

    stat.CO stat.ME stat.ML

    A Note on Coding and Standardization of Categorical Variables in (Sparse) Group Lasso Regression

    Authors: Felicitas J. Detmer, Martin Slawski

    Abstract: Categorical regressor variables are usually handled by introducing a set of indicator variables, and imposing a linear constraint to ensure identifiability in the presence of an intercept, or equivalently, using one of various coding schemes. As proposed in Yuan and Lin [J. R. Statist. Soc. B, 68 (2006), 49-67], the group lasso is a natural and computationally convenient approach to perform variab… ▽ More

    Submitted 17 May, 2018; originally announced May 2018.

  11. arXiv:1710.06030  [pdf, other

    math.ST stat.ME stat.ML

    Linear Regression with Sparsely Permuted Data

    Authors: Martin Slawski, Emanuel Ben-David

    Abstract: In regression analysis of multivariate data, it is tacitly assumed that response and predictor variables in each observed response-predictor pair correspond to the same entity or unit. In this paper, we consider the situation of "permuted data" in which this basic correspondence has been lost. Several recent papers have considered this situation without further assumptions on the underlying permut… ▽ More

    Submitted 15 November, 2017; v1 submitted 16 October, 2017; originally announced October 2017.

  12. arXiv:1709.08104  [pdf, other

    math.ST stat.ML

    On Principal Components Regression, Random Projections, and Column Subsampling

    Authors: Martin Slawski

    Abstract: Principal Components Regression (PCR) is a traditional tool for dimension reduction in linear regression that has been both criticized and defended. One concern about PCR is that obtaining the leading principal components tends to be computationally demanding for large data sets. While random projections do not possess the optimality properties of the leading principal subspace, they are computati… ▽ More

    Submitted 7 October, 2017; v1 submitted 23 September, 2017; originally announced September 2017.

  13. arXiv:1607.02649  [pdf, ps, other

    cs.IT stat.ME

    Linear signal recovery from $b$-bit-quantized linear measurements: precise analysis of the trade-off between bit depth and number of measurements

    Authors: Martin Slawski, ** Li

    Abstract: We consider the problem of recovering a high-dimensional structured signal from independent Gaussian linear measurements each of which is quantized to $b$ bits. Our interest is in linear approaches to signal recovery, where "linear" means that non-linearity resulting from quantization is ignored and the observations are treated as if they arose from a linear measurement model. Specifically, the fo… ▽ More

    Submitted 9 July, 2016; originally announced July 2016.

  14. arXiv:1605.00507  [pdf, ps, other

    stat.ME cs.LG

    Methods for Sparse and Low-Rank Recovery under Simplex Constraints

    Authors: ** Li, Syama Sundar Rangapuram, Martin Slawski

    Abstract: The de-facto standard approach of promoting sparsity by means of $\ell_1$-regularization becomes ineffective in the presence of simplex constraints, i.e.,~the target is known to have non-negative entries summing up to a given constant. The situation is analogous for the use of nuclear norm regularization for low-rank recovery of Hermitian positive semidefinite matrices with given trace. In the pre… ▽ More

    Submitted 2 May, 2016; originally announced May 2016.

  15. arXiv:1504.06305  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Regularization-free estimation in trace regression with symmetric positive semidefinite matrices

    Authors: Martin Slawski, ** Li, Matthias Hein

    Abstract: Over the past few years, trace regression models have received considerable attention in the context of matrix completion, quantum state tomography, and compressed sensing. Estimation of the underlying matrix from regularization-based approaches promoting low-rankedness, notably nuclear norm regularization, have enjoyed great popularity. In the present paper, we argue that such regularization may… ▽ More

    Submitted 23 April, 2015; originally announced April 2015.

  16. arXiv:1404.6640  [pdf, ps, other

    math.ST stat.ML

    Estimation of positive definite M-matrices and structure learning for attractive Gaussian Markov Random fields

    Authors: Martin Slawski, Matthias Hein

    Abstract: Consider a random vector with finite second moments. If its precision matrix is an M-matrix, then all partial correlations are non-negative. If that random vector is additionally Gaussian, the corresponding Markov random field (GMRF) is called attractive. We study estimation of M-matrices taking the role of inverse second moment or precision matrices using sign-constrained log-determinant divergen… ▽ More

    Submitted 26 April, 2014; originally announced April 2014.

    Comments: long version of a manuscript accepted for publication in Linear Algebra and its Applications

  17. arXiv:1401.6024  [pdf, ps, other

    stat.ML cs.LG

    Matrix factorization with Binary Components

    Authors: Martin Slawski, Matthias Hein, Pavlo Lutsik

    Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with $\{0,1\}$-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size $2^{m \cdot r}$, where $m$ is the dimension… ▽ More

    Submitted 23 January, 2014; originally announced January 2014.

    Comments: appeared in NIPS 2013

  18. arXiv:1205.0953  [pdf, ps, other

    math.ST stat.ML

    Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization

    Authors: Martin Slawski, Matthias Hein

    Abstract: Least squares fitting is in general not useful for high-dimensional linear models, in which the number of predictors is of the same or even larger order of magnitude than the number of samples. Theory developed in recent years has coined a paradigm according to which sparsity-promoting regularization is regarded as a necessity in such setting. Deviating from this paradigm, we show that non-negativ… ▽ More

    Submitted 12 February, 2014; v1 submitted 4 May, 2012; originally announced May 2012.

    Comments: major revision

    Journal ref: Electronic Journal of Statistics, 7(0):3004-3056, 2013

  19. Feature selection guided by structural information

    Authors: Martin Slawski, Wolfgang zu Castell, Gerhard Tutz

    Abstract: In generalized linear regression problems with an abundant number of features, lasso-type regularization which imposes an $\ell^1$-constraint on the regression coefficients has become a widely established technique. Deficiencies of the lasso in certain scenarios, notably strongly correlated design, were unmasked when Zou and Hastie [J. Roy. Statist. Soc. Ser. B 67 (2005) 301--320] introduced the e… ▽ More

    Submitted 10 November, 2010; originally announced November 2010.

    Comments: Published in at http://dx.doi.org/10.1214/09-AOAS302 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS302

    Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 2, 1056-1080