Search | arXiv e-print repository

An adaptive transfer learning perspective on classification in non-stationary environments

Abstract: We consider a semi-supervised classification problem with non-stationary label-shift in which we observe a labelled data set followed by a sequence of unlabelled covariate vectors in which the marginal probabilities of the class labels may change over time. Our objective is to predict the corresponding class-label for each covariate vector, without ever observing the ground-truth labels, beyond th… ▽ More We consider a semi-supervised classification problem with non-stationary label-shift in which we observe a labelled data set followed by a sequence of unlabelled covariate vectors in which the marginal probabilities of the class labels may change over time. Our objective is to predict the corresponding class-label for each covariate vector, without ever observing the ground-truth labels, beyond the initial labelled data set. Previous work has demonstrated the potential of sophisticated variants of online gradient descent to perform competitively with the optimal dynamic strategy (Bai et al. 2022). In this work we explore an alternative approach grounded in statistical methods for adaptive transfer learning. We demonstrate the merits of this alternative methodology by establishing a high-probability regret bound on the test error at any given individual test-time, which adapt automatically to the unknown dynamics of the marginal label probabilities. Further more, we give bounds on the average dynamic regret which match the average guarantees of the online learning perspective for any given time interval. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2403.16651 [pdf, ps, other]

A short proof of the Dvoretzky--Kiefer--Wolfowitz--Massart inequality

Authors: Henry W J Reeve

Abstract: The Dvoretzky--Kiefer--Wolfowitz--Massart inequality gives a sub-Gaussian tail bound on the supremum norm distance between the empirical distribution function of a random sample and its population counterpart. We provide a short proof of a result that improves the existing bound in two respects. First, our one-sided bound holds without any restrictions on the failure probability, thereby verifying… ▽ More The Dvoretzky--Kiefer--Wolfowitz--Massart inequality gives a sub-Gaussian tail bound on the supremum norm distance between the empirical distribution function of a random sample and its population counterpart. We provide a short proof of a result that improves the existing bound in two respects. First, our one-sided bound holds without any restrictions on the failure probability, thereby verifying a conjecture of Birnbaum and McCarty (1958). Second, it is local in the sense that it holds uniformly over sub-intervals of the real line with an error rate that adapts to the behaviour of the population distribution function on the interval. △ Less

Submitted 25 March, 2024; originally announced March 2024.

MSC Class: 62G30

arXiv:2305.04852 [pdf, other]

Isotonic subgroup selection

Authors: Manuel M. Müller, Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: Given a sample of covariate-response pairs, we consider the subgroup selection problem of identifying a subset of the covariate domain where the regression function exceeds a pre-determined threshold. We introduce a computationally-feasible approach for subgroup selection in the context of multivariate isotonic regression based on martingale tests and multiple testing procedures for logically-stru… ▽ More Given a sample of covariate-response pairs, we consider the subgroup selection problem of identifying a subset of the covariate domain where the regression function exceeds a pre-determined threshold. We introduce a computationally-feasible approach for subgroup selection in the context of multivariate isotonic regression based on martingale tests and multiple testing procedures for logically-structured hypotheses. Our proposed procedure satisfies a non-asymptotic, uniform Type I error rate guarantee with power that attains the minimax optimal rate up to poly-logarithmic factors. Extensions cover classification, isotonic quantile regression and heterogeneous treatment effect settings. Numerical studies on both simulated and real data confirm the practical effectiveness of our proposal, which is implemented in the R package ISS. △ Less

Submitted 28 June, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: 69 pages, 20 figures

MSC Class: 62G08; 62H15

arXiv:2109.01077 [pdf, ps, other]

Optimal subgroup selection

Authors: Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determin… ▽ More In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determined threshold. We formulate the problem as one of constrained optimisation, where we seek a low-complexity, data-dependent selection set on which, with a guaranteed probability, the regression function is uniformly at least as large as the threshold; subject to this constraint, we would like the region to contain as much mass under the marginal feature distribution as possible. This leads to a natural notion of regret, and our main contribution is to determine the minimax optimal rate for this regret in both the sample size and the Type I error probability. The rate involves a delicate interplay between parameters that control the smoothness of the regression function, as well as exponents that quantify the extent to which the optimal selection set at the population level can be approximated by families of well-behaved subsets. Finally, we expand the scope of our previous results by illustrating how they may be generalised to a treatment and control setting, where interest lies in the heterogeneous treatment effect. △ Less

Submitted 20 September, 2023; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: 65 pages, 2 figures, to appear in the Annals of Statistics

MSC Class: 62-XX; 62G08; 62Gxx; 62C20

arXiv:2106.04455 [pdf, other]

Adaptive transfer learning

Authors: Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate-dependent relationships between the source and target distributions that are not required… ▽ More In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate-dependent relationships between the source and target distributions that are not required to preserve the Bayes decision boundary. Our main contributions are to derive the minimax optimal rates of convergence (up to poly-logarithmic factors) in this problem, and show that the optimal rate can be achieved by an algorithm that adapts to key aspects of the unknown transfer relationship, as well as the smoothness and tail parameters of our distributional classes. This optimal rate turns out to have several regimes, depending on the interplay between the relative sample sizes and the strength of the transfer relationship, and our algorithm achieves optimality by careful, decision tree-based calibration of local nearest-neighbour procedures. △ Less

Submitted 8 June, 2021; originally announced June 2021.

MSC Class: 62G05

arXiv:2106.01092 [pdf, ps, other]

Statistical optimality conditions for compressive ensembles

Authors: Henry W. J. Reeve, Ata Kaban

Abstract: We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, an… ▽ More We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, and explains the in-built regularisation effect of the compressive approach. We then instantiate this general bound to classification and regression tasks, considering Johnson-Lindenstrauss map**s as the compression scheme. For each of these tasks, our strategy is to develop a tight upper bound on the compressibility function, and by doing so we discover distributional conditions of geometric nature under which the compressive algorithm attains minimax-optimal rates up to at most poly-logarithmic factors. In the case of compressive classification, this is achieved with a mild geometric margin condition along with a flexible moment condition that is significantly more general than the assumption of bounded domain. In the case of regression with strongly convex smooth loss functions we find that compressive regression is capable of exploiting spectral decay with near-optimal guarantees. In addition, a key ingredient for our central upper bound is a high probability uniform upper bound on the integrated deviation of dependent empirical processes, which may be of independent interest. △ Less

Submitted 2 June, 2021; originally announced June 2021.

MSC Class: 62-08

arXiv:1302.0954 [pdf, ps, other]

doi 10.1017/S0013091514000066

A Frostman type lemma for sets with large intersections, and an application to Diophantine approximation

Authors: Tomas Persson, Henry W. J. Reeve

Abstract: We consider classes $\mathscr{G}^s ([0,1])$ of subsets of $[0,1]$, originally introduced by Falconer, that are closed under countable intersections, and such that every set in the class has Hausdorff dimension at least $s$. We provide a Frostman type lemma to determine if a limsup-set is in such a class. Suppose $E = \limsup E_n \subset [0,1]$, and that $μ_n$ are probability measures with support… ▽ More We consider classes $\mathscr{G}^s ([0,1])$ of subsets of $[0,1]$, originally introduced by Falconer, that are closed under countable intersections, and such that every set in the class has Hausdorff dimension at least $s$. We provide a Frostman type lemma to determine if a limsup-set is in such a class. Suppose $E = \limsup E_n \subset [0,1]$, and that $μ_n$ are probability measures with support in $E_n$. If there is a constant $C$ such that \[\iint|x-y|^{-s}\, \mathrm{d}μ_n(x)\mathrm{d}μ_n(y)<C\] for all $n$, then under suitable conditions on the limit measure of the sequence $(μ_n)$, we prove that the set $E$ is in the class $\mathscr{G}^s ([0,1])$. As an application we prove that for $α> 1$ and almost all $λ\in (\frac{1}{2},1)$ the set \[ E_λ(α) = \{\,x\in[0,1] : |x - s_n| < 2^{-αn} \text{infinitely often}\ \}\] where $s_n \in \{\,(1-λ)\sum_{k=0}^na_kλ^k$ and $a_k\in\{0,1\}\,\}$, belongs to the class $\mathscr{G}^s$ for $s \leq \frac{1}α$. This improves one of our previous results. △ Less

Submitted 11 September, 2017; v1 submitted 5 February, 2013; originally announced February 2013.

Comments: 1+20 pages; Erratum added

MSC Class: 11J83; 11K55; 28A78; 28A80

Journal ref: Proceedings of the Edinburgh Mathematical Society, Volume 58, Issue 02, June 2015, 521--542

arXiv:1202.4904 [pdf, ps, other]

doi 10.1112/S0025579312001076

On the Diophantine properties of lambda-expansions

Authors: Tomas Persson, Henry W. J. Reeve

Abstract: For $λ\in (1/2, 1)$ and $α$, we consider sets of numbers $x$ such that for infinitely many $n$, $x$ is $2^{-αn}$-close to some $\sum_{i=1}^n ω_i λ^i$, where $ω_i \in \{0,1\}$. These sets are in Falconer's intersection classes for Hausdorff dimension $s$ for some $s$ such that $- \frac{1}α \frac{\log λ}{\log 2} \leq s \leq \frac{1}α$. We show that for almost all $λ\in (1/2, 2/3)$, the upper bound o… ▽ More For $λ\in (1/2, 1)$ and $α$, we consider sets of numbers $x$ such that for infinitely many $n$, $x$ is $2^{-αn}$-close to some $\sum_{i=1}^n ω_i λ^i$, where $ω_i \in \{0,1\}$. These sets are in Falconer's intersection classes for Hausdorff dimension $s$ for some $s$ such that $- \frac{1}α \frac{\log λ}{\log 2} \leq s \leq \frac{1}α$. We show that for almost all $λ\in (1/2, 2/3)$, the upper bound of $s$ is optimal, but for a countable infinity of values of $λ$ the lower bound is the best possible result. △ Less

Submitted 22 February, 2012; originally announced February 2012.

Comments: 21 pages

MSC Class: 11J83; 28A78

Showing 1–8 of 8 results for author: Reeve, H W J