Search | arXiv e-print repository

arXiv:2404.05155 [pdf, other]

On the price of exact truthfulness in incentive-compatible online learning with bandit feedback: A regret lower bound for WSU-UX

Authors: Ali Mortazavi, Junhao Lin, Nishant A. Mehta

Abstract: In one view of the classical game of prediction with expert advice with binary outcomes, in each round, each expert maintains an adversarially chosen belief and honestly reports this belief. We consider a recently introduced, strategic variant of this problem with selfish (reputation-seeking) experts, where each expert strategically reports in order to maximize their expected future reputation bas… ▽ More In one view of the classical game of prediction with expert advice with binary outcomes, in each round, each expert maintains an adversarially chosen belief and honestly reports this belief. We consider a recently introduced, strategic variant of this problem with selfish (reputation-seeking) experts, where each expert strategically reports in order to maximize their expected future reputation based on their belief. In this work, our goal is to design an algorithm for the selfish experts problem that is incentive-compatible (IC, or \emph{truthful}), meaning each expert's best strategy is to report truthfully, while also ensuring the algorithm enjoys sublinear regret with respect to the expert with the best belief. Freeman et al. (2020) recently studied this problem in the full information and bandit settings and obtained truthful, no-regret algorithms by leveraging prior work on wagering mechanisms. While their results under full information match the minimax rate for the classical ("honest experts") problem, the best-known regret for their bandit algorithm WSU-UX is $O(T^{2/3})$, which does not match the minimax rate for the classical ("honest bandits") setting. It was unclear whether the higher regret was an artifact of their analysis or a limitation of WSU-UX. We show, via explicit construction of loss sequences, that the algorithm suffers a worst-case $Ω(T^{2/3})$ lower bound. Left open is the possibility that a different IC algorithm obtains $O(\sqrt{T})$ regret. Yet, WSU-UX was a natural choice for such an algorithm owing to the limited design room for IC algorithms in this setting. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: Accepted to AISTATS 2024

arXiv:2403.01315 [pdf, ps, other]

Near-optimal Per-Action Regret Bounds for Slee** Bandits

Authors: Quan Nguyen, Nishant A. Mehta

Abstract: We derive near-optimal per-action regret bounds for slee** bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal slee** regrets. Compared to the min… ▽ More We derive near-optimal per-action regret bounds for slee** bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal slee** regrets. Compared to the minimax $Ω(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from slee** experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-slee** bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds. △ Less

Submitted 29 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

Comments: V2: corrected Theorem 8 (FTARL's high probability bound) from log(1/delta) to log(K/delta)

arXiv:2305.04093 [pdf, other]

An improved regret analysis for UCB-N and TS-N

Authors: Nishant A. Mehta

Abstract: In the setting of stochastic online learning with undirected feedback graphs, Lykouris et al. (2020) previously analyzed the pseudo-regret of the upper confidence bound-based algorithm UCB-N and the Thompson Sampling-based algorithm TS-N. In this note, we show how to improve their pseudo-regret analysis. Our improvement involves refining a key lemma of the previous analysis, allowing a $\log(T)$ f… ▽ More In the setting of stochastic online learning with undirected feedback graphs, Lykouris et al. (2020) previously analyzed the pseudo-regret of the upper confidence bound-based algorithm UCB-N and the Thompson Sampling-based algorithm TS-N. In this note, we show how to improve their pseudo-regret analysis. Our improvement involves refining a key lemma of the previous analysis, allowing a $\log(T)$ factor to be replaced by a factor $\log_2(α) + 3$ for $α$ the independence number of the feedback graph. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: 5 pages

arXiv:2301.04268 [pdf, other]

Adversarial Online Multi-Task Reinforcement Learning

Authors: Quan Nguyen, Nishant A. Mehta

Abstract: We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $λ$-separability… ▽ More We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $λ$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Ω(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Ω(\frac{K}{λ^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{λ^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{λ^2}$ is tight. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: To appear at the 34th International Conference on Algorithmic Learning Theory (ALT 2023)

arXiv:2106.12688 [pdf, other]

Best-Case Lower Bounds in Online Learning

Authors: Cristóbal Guzmán, Nishant A. Mehta, Ali Mortazavi

Abstract: Much of the work in online learning focuses on the study of sublinear upper bounds on the regret. In this work, we initiate the study of best-case lower bounds in online convex optimization, wherein we bound the largest improvement an algorithm can obtain relative to the single best action in hindsight. This problem is motivated by the goal of better understanding the adaptivity of a learning algo… ▽ More Much of the work in online learning focuses on the study of sublinear upper bounds on the regret. In this work, we initiate the study of best-case lower bounds in online convex optimization, wherein we bound the largest improvement an algorithm can obtain relative to the single best action in hindsight. This problem is motivated by the goal of better understanding the adaptivity of a learning algorithm. Another motivation comes from fairness: it is known that best-case lower bounds are instrumental in obtaining algorithms for decision-theoretic online learning (DTOL) that satisfy a notion of group fairness. Our contributions are a general method to provide best-case lower bounds in Follow The Regularized Leader (FTRL) algorithms with time-varying regularizers, which we use to show that best-case lower bounds are of the same order as existing upper regret bounds: this includes situations with a fixed learning rate, decreasing learning rates, timeless methods, and adaptive gradient methods. In stark contrast, we show that the linearized version of FTRL can attain negative linear regret. Finally, in DTOL with two experts and binary predictions, we fully characterize the best-case sequences, which provides a finer understanding of the best-case lower bounds. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: 28 pages

arXiv:2102.07929 [pdf, other]

Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

Authors: Bingshan Hu, Zhiming Huang, Nishant A. Mehta, Nidhi Hegde

Abstract: In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O \left(\sum_{j: Δ_j>0} \frac{\ln(T)}{\min \left\{Δ_j, ε\right\}} \right)$ instance-dependent regret bo… ▽ More In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O \left(\sum_{j: Δ_j>0} \frac{\ln(T)}{\min \left\{Δ_j, ε\right\}} \right)$ instance-dependent regret bound, where $T$ is the finite learning horizon, $Δ_j$ denotes the suboptimality gap between the optimal arm and a suboptimal arm $j$, and $ε$ is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an $Ω\left(\frac{\ln(K)}{\min \left\{Δ_{\min}, ε\right\}} \right)$ instance-dependent regret lower bound and an $Ω\left(\sqrt{T\ln(K)} + \frac{\ln(K)}ε\right)$ minimax lower bound, where $K$ is the total number of actions and $Δ_{\min}$ denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an $ε$-differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra $\log(T)$ factor. △ Less

Submitted 30 May, 2024; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 40 pages. New in v3: (i) Removed Hybrid-UCB (although its analysis is correct to our knowledge); (ii) Added Lazy-DP-TS from UAI 2022 paper of Hu and Hegde (2022)

arXiv:2003.03456 [pdf, other]

A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option

Authors: P Sharoff, Nishant A. Mehta, Ravi Ganti

Abstract: We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of "giving up" on a current a… ▽ More We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of "giving up" on a current action -- hence forfeiting any reward -- in order to choose another action. We cast this problem as a variant of the stochastic multi-armed bandits problem with stochastic consumption of resource. For this problem, we first establish that the optimal arm is the one that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. Using a novel upper confidence bound on this ratio, we then introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound which has an improved dependence on problem parameters compared to previous works. Simulations on various problem configurations comparing WAIT-UCB against the state-of-the-art algorithms are also presented. △ Less

Submitted 6 March, 2020; originally announced March 2020.

Comments: 16 pages, AISTATS 2020

arXiv:1910.13521 [pdf, other]

Dying Experts: Efficient Algorithms with Optimal Regret Bounds

Authors: Hamid Shayestehmanesh, Sajjad Azami, Nishant A. Mehta

Abstract: We study a variant of decision-theoretic online learning in which the set of experts that are available to Learner can shrink over time. This is a restricted version of the well-studied slee** experts problem, itself a generalization of the fundamental game of prediction with expert advice. Similar to many works in this direction, our benchmark is the ranking regret. Various results suggest that… ▽ More We study a variant of decision-theoretic online learning in which the set of experts that are available to Learner can shrink over time. This is a restricted version of the well-studied slee** experts problem, itself a generalization of the fundamental game of prediction with expert advice. Similar to many works in this direction, our benchmark is the ranking regret. Various results suggest that achieving optimal regret in the fully adversarial slee** experts problem is computationally hard. This motivates our relaxation where any expert that goes to sleep will never again wake up. We call this setting "dying experts" and study it in two different cases: the case where the learner knows the order in which the experts will die and the case where the learner does not. In both cases, we provide matching upper and lower bounds on the ranking regret in the fully adversarial setting. Furthermore, we present new, computationally efficient algorithms that obtain our optimal upper bounds. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: 18 Pages, NeurIPS 2019

arXiv:1802.09680 [pdf, other]

Multi-Observation Regression

Authors: Rafael Frongillo, Nishant A. Mehta, Tom Morgan, Bo Waggoner

Abstract: Recent work introduced loss functions which measure the error of a prediction based on multiple simultaneous observations or outcomes. In this paper, we explore the theoretical and practical questions that arise when using such multi-observation losses for regression on data sets of $(x,y)$ pairs. When a loss depends on only one observation, the average empirical loss decomposes by applying the lo… ▽ More Recent work introduced loss functions which measure the error of a prediction based on multiple simultaneous observations or outcomes. In this paper, we explore the theoretical and practical questions that arise when using such multi-observation losses for regression on data sets of $(x,y)$ pairs. When a loss depends on only one observation, the average empirical loss decomposes by applying the loss to each pair, but for the multi-observation case, empirical loss is not even well-defined, and the possibility of statistical guarantees is unclear without several $(x,y)$ pairs with exactly the same $x$ value. We propose four algorithms formalizing the concept of empirical risk minimization for this problem, two of which have statistical guarantees in settings allowing both slow and fast convergence rates, but which are out-performed empirically by the other two. Empirical results demonstrate practicality of these algorithms in low-dimensional settings, while lower bounds demonstrate intrinsic difficulty in higher dimensions. Finally, we demonstrate the potential benefit of the algorithms over natural baselines that use traditional single-observation losses via both lower bounds and simulations. △ Less

Submitted 26 February, 2018; originally announced February 2018.

Comments: 28 pages

arXiv:1710.07732 [pdf, other]

A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexit… ▽ More We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, $\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior})$ complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to $L_2(P)$ entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with $L_\infty$. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity. △ Less

Submitted 20 October, 2017; originally announced October 2017.

Comments: 38 pages

arXiv:1609.03319 [pdf, other]

CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method

Authors: Nishant A. Mehta, Alistair Rendell, Anish Varghese, Christfried Webers

Abstract: The adaptive gradient online learning method known as AdaGrad has seen widespread use in the machine learning community in stochastic and adversarial online learning problems and more recently in deep learning methods. The method's full-matrix incarnation offers much better theoretical guarantees and potentially better empirical performance than its diagonal version; however, this version is compu… ▽ More The adaptive gradient online learning method known as AdaGrad has seen widespread use in the machine learning community in stochastic and adversarial online learning problems and more recently in deep learning methods. The method's full-matrix incarnation offers much better theoretical guarantees and potentially better empirical performance than its diagonal version; however, this version is computationally prohibitive and so the simpler diagonal version often is used in practice. We introduce a new method, CompAdaGrad, that navigates the space between these two schemes and show that this method can yield results much better than diagonal AdaGrad while avoiding the (effectively intractable) $O(n^3)$ computational complexity of full-matrix AdaGrad for dimension $n$. CompAdaGrad essentially performs full-matrix regularization in a low-dimensional subspace while performing diagonal regularization in the complementary subspace. We derive CompAdaGrad's updates for composite mirror descent in case of the squared $\ell_2$ norm and the $\ell_1$ norm, demonstrate that its complexity per iteration is linear in the dimension, and establish guarantees for the method independent of the choice of composite regularizer. Finally, we show preliminary results on several datasets. △ Less

Submitted 4 October, 2016; v1 submitted 12 September, 2016; originally announced September 2016.

Comments: only updated acknowledgements

arXiv:1605.01288 [pdf, other]

Fast rates with high probability in exp-concave statistical learning

Authors: Nishant A. Mehta

Abstract: We present an algorithm for the statistical learning setting with a bounded exp-concave loss in $d$ dimensions that obtains excess risk $O(d \log(1/δ)/n)$ with probability at least $1 - δ$. The core technique is to boost the confidence of recent in-expectation $O(d/n)$ excess risk bounds for empirical risk minimization (ERM), without sacrificing the rate, by leveraging a Bernstein condition which… ▽ More We present an algorithm for the statistical learning setting with a bounded exp-concave loss in $d$ dimensions that obtains excess risk $O(d \log(1/δ)/n)$ with probability at least $1 - δ$. The core technique is to boost the confidence of recent in-expectation $O(d/n)$ excess risk bounds for empirical risk minimization (ERM), without sacrificing the rate, by leveraging a Bernstein condition which holds due to exp-concavity. We also show that with probability $1 - δ$ the standard ERM method obtains excess risk $O(d (\log(n) + \log(1/δ))/n)$. We further show that a regret bound for any online learner in this setting translates to a high probability excess risk bound for the corresponding online-to-batch conversion of the online learner. Lastly, we present two high probability bounds for the exp-concave model selection aggregation problem that are quantile-adaptive in a certain sense. The first bound is a purely exponential weights type algorithm, obtains a nearly optimal rate, and has no explicit dependence on the Lipschitz continuity of the loss. The second bound requires Lipschitz continuity but obtains the optimal rate. △ Less

Submitted 14 October, 2016; v1 submitted 4 May, 2016; originally announced May 2016.

Comments: added results on model selection aggregation (Section 7)

arXiv:1605.00252 [pdf, other]

Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Baye… ▽ More We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $η$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an "annealed" version thereof, by which we generalize several previous results connecting Hellinger and Rényi divergence to KL divergence. △ Less

Submitted 5 November, 2019; v1 submitted 1 May, 2016; originally announced May 2016.

Comments: accepted to JMLR pending minor final modifications

arXiv:1507.02592 [pdf, other]

Fast rates in statistical and online learning

Authors: Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, Robert C. Williamson

Abstract: The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most… ▽ More The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting. △ Less

Submitted 1 September, 2015; v1 submitted 9 July, 2015; originally announced July 2015.

Comments: 69 pages, 3 figures

Journal ref: Journal of Machine Learning Research 6(54):1793-1861, 2015

arXiv:1406.3781 [pdf, other]

From Stochastic Mixability to Fast Rates

Authors: Nishant A. Mehta, Robert C. Williamson

Abstract: Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rate… ▽ More Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible. △ Less

Submitted 22 November, 2014; v1 submitted 14 June, 2014; originally announced June 2014.

Comments: 21 pages, accepted to NIPS 2014

arXiv:1210.6293 [pdf, ps, other]

MLPACK: A Scalable C++ Machine Learning Library

Authors: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray

Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. ML… ▽ More MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. △ Less

Submitted 23 October, 2012; originally announced October 2012.

Comments: Submitted to JMLR MLOSS (http://jmlr.csail.mit.edu/mloss/)

Journal ref: Journal of Machine Learning Research 14 (2013) 801-805

arXiv:1209.2784 [pdf, other]

Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Authors: Nishant A. Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks' empirical risks. Via a certain relaxat… ▽ More Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks' empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. △ Less

Submitted 13 September, 2012; originally announced September 2012.

Comments: appearing at NIPS 2012

arXiv:1202.4050 [pdf, other]

On the Sample Complexity of Predictive Sparse Coding

Authors: Nishant A. Mehta, Alexander G. Gray

Abstract: The goal of predictive sparse coding is to learn a representation of examples as sparse linear combinations of elements from a dictionary, such that a learned hypothesis linear in the new representation performs well on a predictive task. Predictive sparse coding algorithms recently have demonstrated impressive performance on a variety of supervised tasks, but their generalization properties have… ▽ More The goal of predictive sparse coding is to learn a representation of examples as sparse linear combinations of elements from a dictionary, such that a learned hypothesis linear in the new representation performs well on a predictive task. Predictive sparse coding algorithms recently have demonstrated impressive performance on a variety of supervised tasks, but their generalization properties have not been studied. We establish the first generalization error bounds for predictive sparse coding, covering two settings: 1) the overcomplete setting, where the number of features k exceeds the original dimensionality d; and 2) the high or infinite-dimensional setting, where only dimension-free bounds are useful. Both learning bounds intimately depend on stability properties of the learned sparse encoder, as measured on the training sample. Consequently, we first present a fundamental stability result for the LASSO, a result characterizing the stability of the sparse codes with respect to perturbations to the dictionary. In the overcomplete setting, we present an estimation error bound that decays as \tilde{O}(sqrt(d k/m)) with respect to d and k. In the high or infinite-dimensional setting, we show a dimension-free bound that is \tilde{O}(sqrt(k^2 s / m)) with respect to k and s, where s is an upper bound on the number of non-zeros in the sparse code for any training data point. △ Less

Submitted 7 October, 2012; v1 submitted 17 February, 2012; originally announced February 2012.

Comments: Sparse Coding Stability Theorem from version 1 has been relaxed considerably using a new notion of coding margin. Old Sparse Coding Stability Theorem still in new version, now as Theorem 2. Presentation of all proofs simplified/improved considerably. Paper reorganized. Empirical analysis showing new coding margin is non-trivial on real datasets

arXiv:1005.0188 [pdf, other]

Generative and Latent Mean Map Kernels

Authors: Nishant A. Mehta, Alexander G. Gray

Abstract: We introduce two kernels that extend the mean map, which embeds probability measures in Hilbert spaces. The generative mean map kernel (GMMK) is a smooth similarity measure between probabilistic models. The latent mean map kernel (LMMK) generalizes the non-iid formulation of Hilbert space embeddings of empirical distributions in order to incorporate latent variable models. When comparing certain c… ▽ More We introduce two kernels that extend the mean map, which embeds probability measures in Hilbert spaces. The generative mean map kernel (GMMK) is a smooth similarity measure between probabilistic models. The latent mean map kernel (LMMK) generalizes the non-iid formulation of Hilbert space embeddings of empirical distributions in order to incorporate latent variable models. When comparing certain classes of distributions, the GMMK exhibits beneficial regularization and generalization properties not shown for previous generative kernels. We present experiments comparing support vector machine performance using the GMMK and LMMK between hidden Markov models to the performance of other methods on discrete and continuous observation sequence data. The results suggest that, in many cases, the GMMK has generalization error competitive with or better than other methods. △ Less

Submitted 3 May, 2010; originally announced May 2010.

Comments: 16 pages, 1 figure, 1 table

Showing 1–19 of 19 results for author: Mehta, N A