Search | arXiv e-print repository

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Authors: Tao Ma, Xuzhi Yang, Zoltan Szabo

Abstract: Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which in… ▽ More Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which incurs a non-negligible cost (examples include the shifting of the currently applied educational technology, modernization of a computing cluster, and the introduction of a new webpage design), and in the decision one is limited to using historical data without the availability for further online interaction. Despite the inevitable importance of this offline learning scenario, to our best knowledge, very little effort has been made to tackle the key problem of balancing between the gain and the cost of switching in a flexible and principled way. Leveraging ideas from the area of optimal transport, we initialize the systematic study of policy switching in offline RL. We establish fundamental properties and design a Net Actor-Critic algorithm for the proposed novel switching formulation. Numerical experiments demonstrate the efficiency of our approach on multiple benchmarks of the Gymnasium. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.13447 [pdf, other]

High-probability minimax lower bounds

Authors: Tianyi Ma, Kabir A. Verchand, Richard J. Samworth

Abstract: The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the… ▽ More The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the notion of a minimax quantile, and seek to articulate its dependence on the quantile level. To this end, we develop high-probability variants of the classical Le Cam and Fano methods, as well as a technique to convert local minimax risk lower bounds to lower bounds on minimax quantiles. To illustrate the power of our framework, we deploy our techniques on several examples, recovering recent results in robust mean estimation and stochastic convex optimisation, as well as obtaining several new results in covariance matrix estimation, sparse linear regression, nonparametric density estimation and isotonic regression. Our overall goal is to argue that minimax quantiles can provide a finer-grained understanding of the difficulty of statistical problems, and that, in wide generality, lower bounds on these quantities can be obtained via user-friendly tools. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 37 pages, 3 figures

MSC Class: 62C20; 62B10

arXiv:2406.06802 [pdf, other]

Satisficing Exploration in Bandit Optimization

Authors: Qing Feng, Tianyi Ma, Ruihao Zhu

Abstract: Motivated by the concept of satisficing in decision-making, we consider the problem of satisficing exploration in bandit optimization. In this setting, the learner aims at selecting satisficing arms (arms with mean reward exceeding a certain threshold value) as frequently as possible. The performance is measured by satisficing regret, which is the cumulative deficit of the chosen arm's mean reward… ▽ More Motivated by the concept of satisficing in decision-making, we consider the problem of satisficing exploration in bandit optimization. In this setting, the learner aims at selecting satisficing arms (arms with mean reward exceeding a certain threshold value) as frequently as possible. The performance is measured by satisficing regret, which is the cumulative deficit of the chosen arm's mean reward compared to the threshold. We propose SELECT, a general algorithmic template for Satisficing Exploration via LowEr Confidence bound Testing, that attains constant satisficing regret for a wide variety of bandit optimization problems in the realizable case (i.e., a satisficing arm exists). Specifically, given a class of bandit optimization problems and a corresponding learning oracle with sub-linear (standard) regret upper bound, SELECT iteratively makes use of the oracle to identify a potential satisficing arm with low regret. Then, it collects data samples from this arm, and continuously compares the LCB of the identified arm's mean reward against the threshold value to determine if it is a satisficing arm. As a complement, SELECT also enjoys the same (standard) regret guarantee as the oracle in the non-realizable case. Finally, we conduct numerical experiments to validate the performance of SELECT for several popular bandit optimization settings. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2404.00474 [pdf, other]

Linguistic Calibration of Long-Form Generations

Authors: Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto

Abstract: Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form gen… ▽ More Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making. △ Less

Submitted 4 June, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: ICML 2024. Code available at https://github.com/tatsu-lab/linguistic_calibration

arXiv:2402.12875 [pdf, other]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of… ▽ More Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers. △ Less

Submitted 23 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 38 pages, 10 figures. Accepted by ICLR 2024

arXiv:2401.07111 [pdf, other]

Bayesian Signal Matching for Transfer Learning in ERP-Based Brain Computer Interface

Authors: Tianwen Ma, Jane E. Huggins, Jian Kang

Abstract: An Event-Related Potential (ERP)-based Brain-Computer Interface (BCI) Speller System assists people with disabilities communicate by decoding electroencephalogram (EEG) signals. A P300-ERP embedded in EEG signals arises in response to a rare, but relevant event (target) among a series of irrelevant events (non-target). Different machine learning methods have constructed binary classifiers to detec… ▽ More An Event-Related Potential (ERP)-based Brain-Computer Interface (BCI) Speller System assists people with disabilities communicate by decoding electroencephalogram (EEG) signals. A P300-ERP embedded in EEG signals arises in response to a rare, but relevant event (target) among a series of irrelevant events (non-target). Different machine learning methods have constructed binary classifiers to detect target events, known as calibration. Existing calibration strategy only uses data from participants themselves with lengthy training time, causing biased P300 estimation and decreasing prediction accuracy. To resolve this issue, we propose a Bayesian signal matching (BSM) framework for calibrating the EEG signals from a new participant using data from source participants. BSM specifies the joint distribution of stimulus-specific EEG signals among source participants via a Bayesian hierarchical mixture model. We apply the inference strategy: if source and new participants are similar, they share the same set of model parameters, otherwise, they keep their own sets of model parameters; we predict on the testing data using parameters of the baseline cluster directly. Our hierarchical framework can be generalized to other base classifiers with clear likelihood specifications. We demonstrate the advantages of BSM using simulations and focus on the real data analysis among participants with neuro-degenerative diseases. △ Less

Submitted 13 January, 2024; originally announced January 2024.

Comments: 34 pages, 6 figures, 4 tables

arXiv:2401.00624 [pdf, other]

Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures

Authors: Yifan Yang, Tianzhou Ma, Chuan Bi, Shuo Chen

Abstract: Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA… ▽ More Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about "non-zero loadings" and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies "non-zero loadings" by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about "non-zero loadings". Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset. △ Less

Submitted 27 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

arXiv:2307.12461 [pdf, ps, other]

Rates of Approximation by ReLU Shallow Neural Networks

Authors: Tong Mao, Ding-Xuan Zhou

Abstract: Neural networks activated by the rectified linear unit (ReLU) play a central role in the recent development of deep learning. The topic of approximating functions from Hölder spaces by these networks is crucial for understanding the efficiency of the induced learning algorithms. Although the topic has been well investigated in the setting of deep neural networks with many layers of hidden neurons,… ▽ More Neural networks activated by the rectified linear unit (ReLU) play a central role in the recent development of deep learning. The topic of approximating functions from Hölder spaces by these networks is crucial for understanding the efficiency of the induced learning algorithms. Although the topic has been well investigated in the setting of deep neural networks with many layers of hidden neurons, it is still open for shallow networks having only one hidden layer. In this paper, we provide rates of uniform approximation by these networks. We show that ReLU shallow neural networks with $m$ hidden neurons can uniformly approximate functions from the Hölder space $W_\infty^r([-1, 1]^d)$ with rates $O((\log m)^{\frac{1}{2} +d}m^{-\frac{r}{d}\frac{d+2}{d+4}})$ when $r<d/2 +2$. Such rates are very close to the optimal one $O(m^{-\frac{r}{d}})$ in the sense that $\frac{d+2}{d+4}$ is close to $1$, when the dimension $d$ is large. △ Less

Submitted 23 July, 2023; originally announced July 2023.

arXiv:2307.11007 [pdf, other]

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Authors: Kaiyue Wen, Zhiyuan Li, Tengyu Ma

Abstract: Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investi… ▽ More Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize, and (3) perhaps most surprisingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks. △ Less

Submitted 22 July, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: 34 pages,11 figures

arXiv:2305.17126 [pdf, other]

Large Language Models as Tool Makers

Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

Abstract: Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) to… ▽ More Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost. △ Less

Submitted 10 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: Code available at https://github.com/ctlllll/LLM-ToolMaker

arXiv:2305.05722 [pdf]

Enhancing Clinical Predictive Modeling through Model Complexity-Driven Class Proportion Tuning for Class Imbalanced Data: An Empirical Study on Opioid Overdose Prediction

Authors: Yinan Liu, Xinyu Dong, Weimin Lyu, Richard N. Rosenthal, Rachel Wong, Tengfei Ma, Fusheng Wang

Abstract: Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that li… ▽ More Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that links the optimal class proportions to the model complexity, thereby tuning the class proportions per model. Our experiments on the opioid overdose prediction problem highlight the performance gain of tuning class proportions. Rigorous regression analysis also confirms the advantages of the theoretical framework proposed and the statistically significant correlation between the hyperparameters controlling the model complexity and the optimal class proportions. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2303.14281 [pdf, other]

Sequential Knockoffs for Variable Selection in Reinforcement Learning

Authors: Tao Ma, Hengrui Cai, Zhengling Qi, Chengchun Shi, Eric B. Laber

Abstract: In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state which is larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the sta… ▽ More In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state which is larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state can slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the smallest subvector of the original state under which the process remains an MDP and shares the same optimal policy as the original process. We propose a novel sequential knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method controls the false discovery rate, and selects all sufficient variables with probability approaching one. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy optimization. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods in terms of variable selection accuracy and regret. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2303.03520 [pdf, other]

The Effect of Alcohol Consumption on Brain Ageing: A New Causal Inference Framework for Incomplete and Massive Phenomic Data

Authors: Chixiang Chen, Shuo Chen, Zhenyao Ye, Xu Shi, Tianzhou Ma

Abstract: Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and a… ▽ More Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and actual age, under different alcohol intake frequencies in a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive life-style profile. We face two major challenges: (1) a large number of phenomic variables as potential confounders and (2) a small proportion of participants with complete phenomic data. To address these challenges, we first develop a new ensemble learning framework to establish robust estimation of mean potential outcome in the presence of many confounders. We then construct a data integration step to borrow information from UKB participants with incomplete phenomic data to improve efficiency. Our analysis results reveal that daily intake or even a few times a week may have significant effects on accelerating brain ageing. Moreover, extensive numerical studies demonstrate the superiority of our method over competing methods, in terms of smaller estimation bias and variability. △ Less

Submitted 4 March, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Contact: [email protected]

arXiv:2301.00363 [pdf]

doi 10.1016/j.rse.2023.113695

Map** smallholder cashew plantations to inform sustainable tree crop expansion in Benin

Authors: Leikun Yin, Rahul Ghosh, Chenxi Lin, David Hale, Christoph Weigl, James Obarowski, Junxiong Zhou, Jessica Till, Xiaowei Jia, Troy Mao, Vipin Kumar, Zhenong **

Abstract: Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could su… ▽ More Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy over 85% and the CASTC model achieved an overall accuracy of 76%. We found that the cashew area in Benin almost doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 55%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape. △ Less

Submitted 15 January, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

Journal ref: Remote Sensing of Environment, 295, p.113695 (2023)

arXiv:2212.13294 [pdf, other]

Multivariate Bayesian variable selection with application to multi-trait genetic fine map**

Authors: Travis Canida, Hongjie Ke, Shuo Chen, Zhenayo Ye, Tianzhou Ma

Abstract: Variable selection has played a critical role in modern statistical learning and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but most of these methods consider selecting variables for only one response. As more data is being collected nowadays, it is common to analyze multiple related re… ▽ More Variable selection has played a critical role in modern statistical learning and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but most of these methods consider selecting variables for only one response. As more data is being collected nowadays, it is common to analyze multiple related responses from the same study. Existing multivariate variable selection methods select variables for all responses without considering the possible heterogeneity across different responses, i.e. some features may only predict a subset of responses but not the rest. Motivated by the multi-trait fine map** problem in genetics to identify the causal variants for multiple related traits, we developed a novel multivariate Bayesian variable selection method to select critical predictors from a large number of grouped predictors that target at multiple correlated and possibly heterogeneous responses. Our new method is featured by its selection at multiple levels, its incorporation of prior biological knowledge to guide selection and identification of best subset of responses predictors target at. We showed the advantage of our method via extensive simulations and a real fine map** example to identify causal variants associated with different subsets of addictive behaviors. △ Less

Submitted 1 March, 2024; v1 submitted 26 December, 2022; originally announced December 2022.

Comments: 46 pages, 4 figures

arXiv:2212.13292 [pdf, other]

Robust distance correlation for variable screening

Authors: Tianzhou Ma, Hongjie Ke, Zhao Ren

Abstract: High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dime… ▽ More High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes. △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2211.14699 [pdf, other]

A Theoretical Study of Inductive Biases in Contrastive Learning

Authors: Jeff Z. HaoChen, Tengyu Ma

Abstract: Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of Saunshi et al. argues that the model architecture -- a component largely ignored by previous works -- also has significant influences on the downstream performance of self-supervised learni… ▽ More Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of Saunshi et al. argues that the model architecture -- a component largely ignored by previous works -- also has significant influences on the downstream performance of self-supervised learning. In this work, we provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases originating from the model class. In particular, we focus on contrastive learning -- a popular self-supervised learning method that is widely used in the vision domain. We show that when the model has limited capacity, contrastive representations would recover certain special clustering structures that are compatible with the model architecture, but ignore many other clustering structures in the data distribution. As a result, our theory can capture the more realistic setting where contrastive representations have much lower dimensionality than the number of clusters in the data distribution. We instantiate our theory on several synthetic data distributions, and provide empirical evidence to support the theory. △ Less

Submitted 8 April, 2023; v1 submitted 26 November, 2022; originally announced November 2022.

Comments: ICLR 2023

arXiv:2211.11719 [pdf, other]

First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains

Authors: Kefan Dong, Tengyu Ma

Abstract: Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models -- under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural netwo… ▽ More Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models -- under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps toward analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the marginal distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an arbitrary function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized. △ Less

Submitted 1 December, 2022; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: added citations and fixed typos

arXiv:2211.05729 [pdf, other]

How Does Sharpness-Aware Minimization Minimize Sharpness?

Authors: Kaiyue Wen, Tengyu Ma, Zhiyuan Li

Abstract: Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient… ▽ More Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied. △ Less

Submitted 5 January, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

Comments: 94 pages, 1 figure

arXiv:2208.08472 [pdf, other]

Bayesian response adaptive randomization design with a composite endpoint of mortality and morbidity

Authors: Zhongying Xu, Andriy I. Bandos, Tianzhou Ma, Lu Tang, Victor B. Talisa, Chung-Chou H. Chang

Abstract: Allocating patients to treatment arms during a trial based on the observed responses accumulated prior to the decision point, and sequential adaptation of this allocation,, could minimize the expected number of failures or maximize total benefit to patients. In this study, we developed a Bayesian response adaptive randomization (RAR) design targeting the endpoint of organ support-free days (OSFD)… ▽ More Allocating patients to treatment arms during a trial based on the observed responses accumulated prior to the decision point, and sequential adaptation of this allocation,, could minimize the expected number of failures or maximize total benefit to patients. In this study, we developed a Bayesian response adaptive randomization (RAR) design targeting the endpoint of organ support-free days (OSFD) for patients admitted to the intensive care units (ICU). The OSFD is a mixture of mortality and morbidity assessed by the number of days of free of organ support within a predetermined time-window post-randomization. In the past, researchers treated OSFD as an ordinal outcome variable where the lowest category is death. We propose a novel RAR design for a composite endpoint of mortality and morbidity, e.g., OSFD, by using a Bayesian mixture model with a Markov chain Monte Carlo sampling to estimate the posterior probability of OSFD and determine treatment allocation ratios at each interim. Simulations were conducted to compare the performance of our proposed design under various randomization rules and different alpha spending functions. The results show that our RAR design using Bayesian inference allocated more patients to the better performing arm(s) compared to other existing adaptive rules while assuring adequate power and type I error rate control for the across a range of plausible clinical scenarios. △ Less

Submitted 31 August, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

arXiv:2207.08977 [pdf, other]

Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift

Authors: Ananya Kumar, Tengyu Ma, Percy Liang, Aditi Raghunathan

Abstract: We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy: a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM. In this paper, we find that ID-calibrated ensembles -- where we s… ▽ More We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy: a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM. In this paper, we find that ID-calibrated ensembles -- where we simply ensemble the standard and robust models after calibrating on only ID data -- outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy. On eleven natural distribution shift datasets, ID-calibrated ensembles obtain the best of both worlds: strong ID accuracy and OOD accuracy. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD: (1) we need to calibrate the standard and robust models (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features. △ Less

Submitted 18 July, 2022; originally announced July 2022.

Comments: Accepted to UAI 2022

arXiv:2207.04369 [pdf, other]

Balancing Producer Fairness and Efficiency via Prior-Weighted Rating System Design

Authors: Thomas Ma, Michael S. Bernstein, Ramesh Johari, Nikhil Garg

Abstract: Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive one low rating early on, negatively impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-qu… ▽ More Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive one low rating early on, negatively impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-quality products (efficiency) and minimizing the variance in economic outcomes of producers of similar quality (individual producer fairness). We show that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce prior-weighted rating systems as an approach to managing this trade-off. Informally, the system we propose sets a system-wide prior for the quality of an incoming product; subsequently, the system updates that prior to a posterior for each producer's quality based on user-generated ratings over time. We show theoretically that in markets where products accrue reviews at an equal rate, the strength of the rating system's prior determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (increasing individual fairness), but the slower the platform is in learning about true item quality (so efficiency suffers). We further analyze this trade-off in a responsive market where customers make decisions based on historical ratings. Through calibrated simulations, we show that the choice of prior strength mediates the same efficiency-consistency trade-off in this setting. Overall, we demonstrate that by tuning the prior as a design choice in a prior-weighted rating system, platforms can be intentional about the balance between efficiency and producer fairness. △ Less

Submitted 25 November, 2023; v1 submitted 9 July, 2022; originally announced July 2022.

Comments: 12 pages, 8 figures, submitted to TheWebConf 2024

arXiv:2111.15086 [pdf, other]

Scalable Semiparametric Spatio-temporal Regression for Large Data Analysis

Authors: Ting Fung Ma, Fangfang Wang, Jun Zhu, Anthony R. Ives, Katarzyna E. Lewińska

Abstract: With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric… ▽ More With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric autoregressive model without the usual Gaussian assumption and devise a computationally scalable procedure that enables the regression analysis of large datasets. We estimate the model parameters by quasi maximum likelihood and show that the computational complexity can be reduced from cubic to linear of the sample size. Asymptotic properties under suitable regularity conditions are further established that inform the computational procedure to be efficient and scalable. A simulation study is conducted to evaluate the finite-sample properties of the parameter estimation and statistical inference. We illustrate our methodology by a dataset with over 2.96 million observations of annual land surface temperature and the comparison with an existing state-of-the-art approach highlights the advantages of our method. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2111.03741 [pdf, other]

Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective

Authors: Margalit Glasgow, Honglin Yuan, Tengyu Ma

Abstract: Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the ex… ▽ More Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective. △ Less

Submitted 11 February, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: Accepted to AISTATS 2022. The first two authors contributed equally

arXiv:2110.05025 [pdf, other]

Self-supervised Learning is More Robust to Dataset Imbalance

Authors: Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, Tengyu Ma

Abstract: Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experime… ▽ More Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes and downstream tasks. In contrast, supervised learning has no incentive to learn features irrelevant to the labels from frequent examples. We validate this hypothesis with semi-synthetic experiments and theoretical analyses on a simplified setting. Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples. △ Less

Submitted 22 May, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

arXiv:2107.13163 [pdf, ps, other]

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

Authors: Colin Wei, Yining Chen, Tengyu Ma

Abstract: A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meanin… ▽ More A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability. We study SM approximation for two function classes: boolean circuits and Turing machines. We show that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network. In addition, we show that transformers can SM approximate Turing machines with computation time bounded by $T$ with sample complexity polynomial in the alphabet size, state space size, and $\log (T)$. We also introduce new tools for analyzing generalization which provide much tighter sample complexities than the typical VC-dimension or norm-based bounds, which may be of independent interest. △ Less

Submitted 30 March, 2023; v1 submitted 28 July, 2021; originally announced July 2021.

arXiv:2107.06650 [pdf, other]

doi 10.1145/3447548.3467167

An Efficient Deep Distribution Network for Bid Shading in First-Price Auctions

Authors: Tian Zhou, Hao He, Shengjun Pan, Niklas Karlsson, Bharatbhushan Shetty, Brendan Kitts, Djordje Gligorijevic, San Gultekin, Tingyu Mao, Junwei Pan, Jianlong Zhang, Aaron Flores

Abstract: Since 2019, most ad exchanges and sell-side platforms (SSPs), in the online advertising industry, shifted from second to first price auctions. Due to the fundamental difference between these auctions, demand-side platforms (DSPs) have had to update their bidding strategies to avoid bidding unnecessarily high and hence overpaying. Bid shading was proposed to adjust the bid price intended for second… ▽ More Since 2019, most ad exchanges and sell-side platforms (SSPs), in the online advertising industry, shifted from second to first price auctions. Due to the fundamental difference between these auctions, demand-side platforms (DSPs) have had to update their bidding strategies to avoid bidding unnecessarily high and hence overpaying. Bid shading was proposed to adjust the bid price intended for second-price auctions, in order to balance cost and winning probability in a first-price auction setup. In this study, we introduce a novel deep distribution network for optimal bidding in both open (non-censored) and closed (censored) online first-price auctions. Offline and online A/B testing results show that our algorithm outperforms previous state-of-art algorithms in terms of both surplus and effective cost per action (eCPX) metrics. Furthermore, the algorithm is optimized in run-time and has been deployed into VerizonMedia DSP as production algorithm, serving hundreds of billions of bid requests per day. Online A/B test shows that advertiser's ROI are improved by +2.4%, +2.4%, and +8.6% for impression based (CPM), click based (CPC), and conversion based (CPA) campaigns respectively. △ Less

Submitted 15 July, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'21), August 14-18, 2021, Singapore

arXiv:2107.05719 [pdf, other]

Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration

Authors: Shengjia Zhao, Michael P. Kim, Roshni Sahoo, Tengyu Ma, Stefano Ermon

Abstract: When facing uncertainty, decision-makers want predictions they can trust. A machine learning provider can convey confidence to decision-makers by guaranteeing their predictions are distribution calibrated -- amongst the inputs that receive a predicted class probabilities vector $q$, the actual distribution over classes is $q$. For multi-class prediction problems, however, achieving distribution ca… ▽ More When facing uncertainty, decision-makers want predictions they can trust. A machine learning provider can convey confidence to decision-makers by guaranteeing their predictions are distribution calibrated -- amongst the inputs that receive a predicted class probabilities vector $q$, the actual distribution over classes is $q$. For multi-class prediction problems, however, achieving distribution calibration tends to be infeasible, requiring sample complexity exponential in the number of classes $C$. In this work, we introduce a new notion -- \emph{decision calibration} -- that requires the predicted distribution and true distribution to be ``indistinguishable'' to a set of downstream decision-makers. When all possible decision makers are under consideration, decision calibration is the same as distribution calibration. However, when we only consider decision makers choosing between a bounded number of actions (e.g. polynomial in $C$), our main result shows that decisions calibration becomes feasible -- we design a recalibration algorithm that requires sample complexity polynomial in the number of actions and the number of classes. We validate our recalibration algorithm empirically: compared to existing methods, decision calibration improves decision-making on skin lesion and ImageNet classification with modern neural network predictors. △ Less

Submitted 12 July, 2021; originally announced July 2021.

arXiv:2106.09913 [pdf, other]

Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

Authors: Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, Andrej Risteski

Abstract: Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al… ▽ More Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al., 2018] are popular and enjoy empirical success, but they lack formal guarantees. Other approaches such as Invariant Risk Minimization (IRM) require a prohibitively large number of training environments -- linear in the dimension of the spurious feature space $d_s$ -- even on simple data models like the one proposed by [Rosenfeld et al., 2021]. Under a variant of this model, we show that both ERM and IRM cannot generalize with $o(d_s)$ environments. We then present an iterative feature matching algorithm that is guaranteed with high probability to yield a predictor that generalizes after seeing only $O(\log d_s)$ environments. Our results provide the first theoretical justification for a family of distribution-matching algorithms widely used in practice under a concrete nontrivial data model. △ Less

Submitted 22 November, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

Comments: We acknowledge that the previous version of this paper (v1) contained an error - Theorem 3.2 was incorrect. We removed this theorem and updated the rest of the paper in v2

arXiv:2106.09226 [pdf, other]

Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

Authors: Colin Wei, Sang Michael Xie, Tengyu Ma

Abstract: Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downs… ▽ More Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings. △ Less

Submitted 20 April, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

arXiv:2106.06530 [pdf, other]

Label Noise SGD Provably Prefers Flat Global Minimizers

Authors: Alex Damian, Tengyu Ma, Jason D. Lee

Abstract: In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to… ▽ More In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(θ) +λR(θ)$, where $L(θ)$ is the training loss, $λ$ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and $R(θ)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models. △ Less

Submitted 4 December, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: 57 pages, 5 figures, NeurIPS 2021

arXiv:2106.04156 [pdf, other]

Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss

Authors: Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, Tengyu Ma

Abstract: Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while kee** negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of… ▽ More Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while kee** negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings. △ Less

Submitted 23 June, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: Accepted as an oral to NeurIPS 2021

arXiv:2105.13975 [pdf, other]

Relation Matters in Sampling: A Scalable Multi-Relational Graph Neural Network for Drug-Drug Interaction Prediction

Authors: Arthur Feeney, Rishabh Gupta, Veronika Thost, Rico Angell, Gayathri Chandu, Yash Adhikari, Tengfei Ma

Abstract: Sampling is an established technique to scale graph neural networks to large graphs. Current approaches however assume the graphs to be homogeneous in terms of relations and ignore relation types, critically important in biomedical graphs. Multi-relational graphs contain various types of relations that usually come with variable frequency and have different importance for the problem at hand. We p… ▽ More Sampling is an established technique to scale graph neural networks to large graphs. Current approaches however assume the graphs to be homogeneous in terms of relations and ignore relation types, critically important in biomedical graphs. Multi-relational graphs contain various types of relations that usually come with variable frequency and have different importance for the problem at hand. We propose an approach to modeling the importance of relation types for neighborhood sampling in graph neural networks and show that we can learn the right balance: relation-type probabilities that reflect both frequency and importance. Our experiments on drug-drug interaction prediction show that state-of-the-art graph neural networks profit from relation-dependent sampling in terms of both accuracy and efficiency. △ Less

Submitted 28 May, 2021; originally announced May 2021.

arXiv:2103.13462 [pdf, other]

Why Do Local Methods Solve Nonconvex Problems?

Authors: Tengyu Ma

Abstract: Non-convex optimization is ubiquitous in modern machine learning. Researchers devise non-convex objective functions and optimize them using off-the-shelf optimizers such as stochastic gradient descent and its variants, which leverage the local geometry and update iteratively. Even though solving non-convex functions is NP-hard in the worst case, the optimization quality in practice is often not an… ▽ More Non-convex optimization is ubiquitous in modern machine learning. Researchers devise non-convex objective functions and optimize them using off-the-shelf optimizers such as stochastic gradient descent and its variants, which leverage the local geometry and update iteratively. Even though solving non-convex functions is NP-hard in the worst case, the optimization quality in practice is often not an issue -- optimizers are largely believed to find approximate global minima. Researchers hypothesize a unified explanation for this intriguing phenomenon: most of the local minima of the practically-used objectives are approximately global minima. We rigorously formalize it for concrete instances of machine learning problems. △ Less

Submitted 24 March, 2021; originally announced March 2021.

Comments: This is the Chapter 21 of the book "Beyond the Worst-Case Analysis of Algorithms"

arXiv:2102.11297 [pdf, other]

You Only Compress Once: Optimal Data Compression for Estimating Linear Models

Authors: Jeffrey Wong, Eskil Forsell, Randall Lewis, Tobias Mao, Matthew Wardrop

Abstract: Linear models are used in online decision making, such as in machine learning, policy algorithms, and experimentation platforms. Many engineering systems that use linear models achieve computational efficiency through distributed systems and expert configuration. While there are strengths to this approach, it is still difficult to have an environment that enables researchers to interactively itera… ▽ More Linear models are used in online decision making, such as in machine learning, policy algorithms, and experimentation platforms. Many engineering systems that use linear models achieve computational efficiency through distributed systems and expert configuration. While there are strengths to this approach, it is still difficult to have an environment that enables researchers to interactively iterate and explore data and models, as well as leverage analytics solutions from the open source community. Consequently, innovation can be blocked. Conditionally sufficient statistics is a unified data compression and estimation strategy that is useful for the model development process, as well as the engineering deployment process. The strategy estimates linear models from compressed data without loss on the estimated parameters and their covariances, even when errors are autocorrelated within clusters of observations. Additionally, the compression preserves almost all interactions with the the original data, unlocking better productivity for both researchers and engineering systems. △ Less

Submitted 3 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: v2: Further reduce matrix algebra and fix typo in Section 5.3.3. Improve the relationships across Section 5.3.1, 5.3.2, and 5.3.3. v3: Change citation styles and update Section 5.3.2

arXiv:2102.03450 [pdf, other]

Wasserstein Graph Neural Networks for Graphs with Missing Attributes

Authors: Zhixian Chen, Tengfei Ma, Yangqiu Song, Yang Wang

Abstract: Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node… ▽ More Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node representation learning framework, Wasserstein Graph Neural Network (WGNN), to mitigate the problem. To make the most of limited observed attribute information and capture the uncertainty caused by missing values, we express nodes as low-dimensional distributions derived from the decomposition of the attribute matrix. Furthermore, we strengthen the expressiveness of representations by develo** a novel message passing schema that aggregates distributional information from neighbors in the Wasserstein space. We test WGNN in node classification tasks under two missing-attribute cases on both synthetic and real-world datasets. In addition, we find WGNN suitable to recover missing values and adapt them to tackle matrix completion problems with graphs of users and items. Experimental results on both tasks demonstrate the superiority of our method. △ Less

Submitted 16 February, 2022; v1 submitted 5 February, 2021; originally announced February 2021.

arXiv:2012.04550 [pdf, other]

In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Authors: Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, Percy Liang

Abstract: Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically acro… ▽ More Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt OOD error; but (ii) using auxiliary information as outputs of auxiliary pre-training tasks improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error. △ Less

Submitted 7 April, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: ICLR 2021

arXiv:2010.11356 [pdf, ps, other]

Beyond Lazy Training for Over-parameterized Tensor Decomposition

Authors: Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge

Abstract: Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an $l$-th order tensor in $(R^d)^{\otimes l}$ of rank $r$ (where $r\ll d$), can variants of gradient descent find a rank… ▽ More Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an $l$-th order tensor in $(R^d)^{\otimes l}$ of rank $r$ (where $r\ll d$), can variants of gradient descent find a rank $m$ decomposition where $m > r$? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = Ω(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2.5l}\log d)$. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: NeurIPS 2020; the first two authors contribute equally

arXiv:2010.04591 [pdf, other]

Physics-Informed Gaussian Process Regression for Probabilistic States Estimation and Forecasting in Power Grids

Authors: Tong Ma, David Alonso Barajas-Solano, Ramakrishna Tipireddy, Alexandre M. Tartakovsky

Abstract: Real-time state estimation and forecasting is critical for efficient operation of power grids. In this paper, a physics-informed Gaussian process regression (PhI-GPR) method is presented and used for probabilistic forecasting and estimating the phase angle, angular speed, and wind mechanical power of a three-generator power grid system using sparse measurements. In standard data-driven Gaussian pr… ▽ More Real-time state estimation and forecasting is critical for efficient operation of power grids. In this paper, a physics-informed Gaussian process regression (PhI-GPR) method is presented and used for probabilistic forecasting and estimating the phase angle, angular speed, and wind mechanical power of a three-generator power grid system using sparse measurements. In standard data-driven Gaussian process regression (GPR), parameterized models for the prior statistics are fit by maximizing the marginal likelihood of observed data, whereas in PhI-GPR, we compute the prior statistics by solving stochastic equations governing power grid dynamics. The short-term forecast of a power grid system dominated by wind generation is complicated by the stochastic nature of the wind and the resulting uncertain mechanical wind power. Here, we assume that the power-grid dynamic is governed by the swing equations, and we treat the unknown terms in the swing equations (specifically, the mechanical wind power) as random processes, which turns these equations into stochastic differential equations. We solve these equations for the mean and variance of the power grid system using the Monte Carlo simulations method. We demonstrate that the proposed PhI-GPR method can accurately forecast and estimate both observed and unobserved states, including the mean behavior and associated uncertainty. For observed states, we show that PhI-GPR provides a forecast comparable to the standard data-driven GPR, with both forecasts being significantly more accurate than the autoregressive integrated moving average (ARIMA) forecast. We also show that the ARIMA forecast is much more sensitive to observation frequency and measurement errors than the PhI-GPR forecast. △ Less

Submitted 9 October, 2020; originally announced October 2020.

MSC Class: 62G99

arXiv:2010.03622 [pdf, other]

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

Authors: Colin Wei, Kendrick Shen, Yining Chen, Tengyu Ma

Abstract: Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised lear… ▽ More Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic "expansion" assumption, which states that a low probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization. △ Less

Submitted 20 April, 2022; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Published at ICLR 2021

arXiv:2009.09259 [pdf, other]

Bid Shading by Win-Rate Estimation and Surplus Maximization

Authors: Shengjun Pan, Brendan Kitts, Tian Zhou, Hao He, Bharatbhushan Shetty, Aaron Flores, Djordje Gligorijevic, Junwei Pan, Tingyu Mao, San Gultekin, Jianlong Zhang

Abstract: This paper describes a new win-rate based bid shading algorithm (WR) that does not rely on the minimum-bid-to-win feedback from a Sell-Side Platform (SSP). The method uses a modified logistic regression to predict the profit from each possible shaded bid price. The function form allows fast maximization at run-time, a key requirement for Real-Time Bidding (RTB) systems. We report production result… ▽ More This paper describes a new win-rate based bid shading algorithm (WR) that does not rely on the minimum-bid-to-win feedback from a Sell-Side Platform (SSP). The method uses a modified logistic regression to predict the profit from each possible shaded bid price. The function form allows fast maximization at run-time, a key requirement for Real-Time Bidding (RTB) systems. We report production results from this method along with several other algorithms. We found that bid shading, in general, can deliver significant value to advertisers, reducing price per impression to about 55% of the unshaded cost. Further, the particular approach described in this paper captures 7% more profit for advertisers, than do benchmark methods of just bidding the most probable winning price. We also report 4.3% higher surplus than an industry Sell-Side Platform shading service. Furthermore, we observed 3% - 7% lower eCPM, eCPC and eCPA when the algorithm was integrated with budget controllers. We attribute the gains above as being mainly due to the explicit maximization of the surplus function, and note that other algorithms can take advantage of this same approach. △ Less

Submitted 19 September, 2020; originally announced September 2020.

Comments: AdKDD 2020

arXiv:2007.04596 [pdf, ps, other]

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

Authors: Yuanzhi Li, Tengyu Ma, Hongyang R. Zhang

Abstract: We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural… ▽ More We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $Ω(1 / d)$. △ Less

Submitted 9 July, 2020; originally announced July 2020.

Comments: Conference on Learning Theory (COLT) 2020

arXiv:2007.04395 [pdf, other]

doi 10.1109/TNNLS.2021.3102234

Multilevel Graph Matching Networks for Deep Graph Similarity Learning

Authors: Xiang Ling, Lingfei Wu, Saizhuo Wang, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, Shouling Ji

Abstract: While the celebrated graph neural networks yield effective representations for individual nodes of a graph, there has been relatively less success in extending to the task of graph similarity learning. Recent work on graph similarity learning has considered either global-level graph-graph interactions or low-level node-node interactions, however ignoring the rich cross-level interactions (e.g., be… ▽ More While the celebrated graph neural networks yield effective representations for individual nodes of a graph, there has been relatively less success in extending to the task of graph similarity learning. Recent work on graph similarity learning has considered either global-level graph-graph interactions or low-level node-node interactions, however ignoring the rich cross-level interactions (e.g., between each node of one graph and the other whole graph). In this paper, we propose a multi-level graph matching network (MGMN) framework for computing the graph similarity between any pair of graph-structured objects in an end-to-end fashion. In particular, the proposed MGMN consists of a node-graph matching network for effectively learning cross-level interactions between each node of one graph and the other whole graph, and a siamese graph neural network to learn global-level interactions between two input graphs. Furthermore, to compensate for the lack of standard benchmark datasets, we have created and collected a set of datasets for both the graph-graph classification and graph-graph regression tasks with different sizes in order to evaluate the effectiveness and robustness of our models. Comprehensive experiments demonstrate that MGMN consistently outperforms state-of-the-art baseline models on both the graph-graph classification and graph-graph regression tasks. Compared with previous work, MGMN also exhibits stronger robustness as the sizes of the two input graphs increase. △ Less

Submitted 7 August, 2021; v1 submitted 8 July, 2020; originally announced July 2020.

Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS)

arXiv:2006.16205 [pdf, other]

Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization

Authors: Sang Michael Xie, Tengyu Ma, Percy Liang

Abstract: We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. We can capture the… ▽ More We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. We can capture the output structure by pre-training a denoiser to denoise corrupted versions of unlabeled outputs. We first show that standard fine-tuning after pre-training destroys some of this structure. We then propose composed fine-tuning, which fine-tunes a predictor composed with the pre-trained denoiser, which is frozen to preserve output structure. For two-layer ReLU networks, we prove that composed fine-tuning significantly reduces the complexity of the predictor, thus improving generalization. Empirically, we show that composed fine-tuning improves over standard fine-tuning on two pseudocode-to-code translation datasets (3% and 6% relative). The improvement from composed fine-tuning is magnified on out-of-distribution (OOD) examples (4% and 25% relative). △ Less

Submitted 24 October, 2023; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: ICML 2021 Long talk

arXiv:2006.15766 [pdf, other]

Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization

Authors: Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, Tengyu Ma

Abstract: Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a… ▽ More Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. △ Less

Submitted 18 March, 2021; v1 submitted 28 June, 2020; originally announced June 2020.

Comments: to appear in ICLR 2021

arXiv:2006.14481 [pdf, other]

Active Online Learning with Hidden Shifting Domains

Authors: Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang

Abstract: Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that dep… ▽ More Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to non-linear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget. △ Less

Submitted 25 February, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

arXiv:2006.10288 [pdf, other]

Individual Calibration with Randomized Forecasting

Authors: Shengjia Zhao, Tengyu Ma, Stefano Ermon

Abstract: Machine learning applications often require calibrated predictions, e.g. a 90\% credible interval should contain the true outcome 90\% of the times. However, typical definitions of calibration only require this to hold on average, and offer no guarantees on predictions made on individual samples. Thus, predictions can be systematically over or under confident on certain subgroups, leading to issue… ▽ More Machine learning applications often require calibrated predictions, e.g. a 90\% credible interval should contain the true outcome 90\% of the times. However, typical definitions of calibration only require this to hold on average, and offer no guarantees on predictions made on individual samples. Thus, predictions can be systematically over or under confident on certain subgroups, leading to issues of fairness and potential vulnerabilities. We show that calibration for individual samples is possible in the regression setup if the predictions are randomized, i.e. outputting randomized credible intervals. Randomization removes systematic bias by trading off bias with variance. We design a training objective to enforce individual calibration and use it to train randomized regression functions. The resulting models are more calibrated for arbitrarily chosen subgroups of the data, and can achieve higher utility in decision making against adversaries that exploit miscalibrated predictions. △ Less

Submitted 9 September, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.10032 [pdf, other]

Self-training Avoids Using Spurious Features Under Domain Shift

Authors: Yining Chen, Colin Wei, Ananya Kumar, Tengyu Ma

Abstract: In unsupervised domain adaptation, existing theory focuses on situations where the source and target domains are close. In practice, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory. We identify and analyze one particular setting where the domain shift can be large, but these algorithms provably work: certa… ▽ More In unsupervised domain adaptation, existing theory focuses on situations where the source and target domains are close. In practice, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory. We identify and analyze one particular setting where the domain shift can be large, but these algorithms provably work: certain spurious features correlate with the label in the source domain but are independent of the label in the target. Our analysis considers linear classification where the spurious features are Gaussian and the non-spurious features are a mixture of log-concave distributions. For this setting, we prove that entropy minimization on unlabeled target data will avoid using the spurious feature if initialized with a decently accurate source classifier, even though the objective is non-convex and contains multiple bad local minima using the spurious features. We verify our theory for spurious domain shift tasks on semi-synthetic Celeb-A and MNIST datasets. Our results suggest that practitioners collect and self-train on large, diverse datasets to reduce biases in classifiers even if labeling is impractical. △ Less

Submitted 7 December, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

arXiv:2006.08950 [pdf, other]

Federated Accelerated Stochastic Gradient Descent

Authors: Honglin Yuan, Tengyu Ma

Abstract: We propose Federated Accelerated Stochastic Gradient Descent (FedAc), a principled acceleration of Federated Averaging (FedAvg, also known as Local SGD) for distributed optimization. FedAc is the first provable acceleration of FedAvg that improves convergence speed and communication efficiency on various types of convex functions. For example, for strongly convex and smooth functions, when using… ▽ More We propose Federated Accelerated Stochastic Gradient Descent (FedAc), a principled acceleration of Federated Averaging (FedAvg, also known as Local SGD) for distributed optimization. FedAc is the first provable acceleration of FedAvg that improves convergence speed and communication efficiency on various types of convex functions. For example, for strongly convex and smooth functions, when using $M$ workers, the previous state-of-the-art FedAvg analysis can achieve a linear speedup in $M$ if given $M$ rounds of synchronization, whereas FedAc only requires $M^{\frac{1}{3}}$ rounds. Moreover, we prove stronger guarantees for FedAc when the objectives are third-order smooth. Our technique is based on a potential-based perturbed iterate analysis, a novel stability analysis of generalized accelerated SGD, and a strategic tradeoff between acceleration and stability. △ Less

Submitted 5 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Accepted to NeurIPS 2020. Best paper in International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020 (FL-ICML'20). Code repository see https://github.com/hongliny/FedAc-NeurIPS20

arXiv:2006.08875 [pdf, other]

Model-based Adversarial Meta-Reinforcement Learning

Authors: Zichuan Lin, Garrett Thomas, Guangwen Yang, Tengyu Ma

Abstract: Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this pape… ▽ More Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms. △ Less

Submitted 27 February, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: Accepted by NeurIPS 2020. Code at https://github.com/LinZichuan/AdMRL

Showing 1–50 of 112 results for author: Ma, T