Search | arXiv e-print repository

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark. △ Less

Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.05870 [pdf, other]

CoBT: Collaborative Programming of Behaviour Trees from One Demonstration for Robot Manipulation

Authors: Aayush Jain, Philip Long, Valeria Villani, John D. Kelleher, Maria Chiara Leva

Abstract: Mass customization and shorter manufacturing cycles are becoming more important among small and medium-sized companies. However, classical industrial robots struggle to cope with product variation and dynamic environments. In this paper, we present CoBT, a collaborative programming by demonstration framework for generating reactive and modular behavior trees. CoBT relies on a single demonstration… ▽ More Mass customization and shorter manufacturing cycles are becoming more important among small and medium-sized companies. However, classical industrial robots struggle to cope with product variation and dynamic environments. In this paper, we present CoBT, a collaborative programming by demonstration framework for generating reactive and modular behavior trees. CoBT relies on a single demonstration and a combination of data-driven machine learning methods with logic-based declarative learning to learn a task, thus eliminating the need for programming expertise or long development times. The proposed framework is experimentally validated on 7 manipulation tasks and we show that CoBT achieves approx. 93% success rate overall with an average of 7.5s programming time. We conduct a pilot study with non-expert users to provide feedback regarding the usability of CoBT. △ Less

Submitted 10 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted for presentation at IEEE ICRA 2024

arXiv:2401.15687 [pdf, other]

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Authors: Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, **gyi Yu, Lan Xu

Abstract: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an effic… ▽ More The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder map** facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation. △ Less

Submitted 30 January, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

Comments: Project Page: https://sites.google.com/view/media2face

arXiv:2312.01661 [pdf, other]

ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions

Authors: Phuoc Pham Van Long, Duc Anh Vu, Nhat M. Hoang, Xuan Long Do, Anh Tuan Luu

Abstract: Mathematical questioning is crucial for assessing students problem-solving skills. Since manually creating such questions requires substantial effort, automatic methods have been explored. Existing state-of-the-art models rely on fine-tuning strategies and struggle to generate questions that heavily involve multiple steps of logical and arithmetic reasoning. Meanwhile, large language models(LLMs)… ▽ More Mathematical questioning is crucial for assessing students problem-solving skills. Since manually creating such questions requires substantial effort, automatic methods have been explored. Existing state-of-the-art models rely on fine-tuning strategies and struggle to generate questions that heavily involve multiple steps of logical and arithmetic reasoning. Meanwhile, large language models(LLMs) such as ChatGPT have excelled in many NLP tasks involving logical and arithmetic reasoning. Nonetheless, their applications in generating educational questions are underutilized, especially in the field of mathematics. To bridge this gap, we take the first step to conduct an in-depth analysis of ChatGPT in generating pre-university math questions. Our analysis is categorized into two main settings: context-aware and context-unaware. In the context-aware setting, we evaluate ChatGPT on existing math question-answering benchmarks covering elementary, secondary, and ternary classes. In the context-unaware setting, we evaluate ChatGPT in generating math questions for each lesson from pre-university math curriculums that we crawl. Our crawling results in TopicMath, a comprehensive and novel collection of pre-university math curriculums collected from 121 math topics and 428 lessons from elementary, secondary, and tertiary classes. Through this analysis, we aim to provide insight into the potential of ChatGPT as a math questioner. △ Less

Submitted 27 February, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted at the 39th ACM/SIGAPP Symposium On Applied Computing (SAC 2024), Main Conference

arXiv:2309.12488 [pdf, other]

Sharpness-Aware Minimization and the Edge of Stability

Authors: Philip M. Long, Peter L. Bartlett

Abstract: Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $η$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/η$, after which it fluctuates around this value. The quantity $2/η$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a… ▽ More Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $η$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/η$, after which it fluctuates around this value. The quantity $2/η$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis. △ Less

Submitted 5 June, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

arXiv:2307.01481 [pdf, other]

Equivalence, Identity, and Unitarity Checking in Black-Box Testing of Quantum Programs

Authors: Peixun Long, Jianjun Zhao

Abstract: Quantum programs exhibit inherent non-deterministic behavior, which poses more significant challenges for error discovery compared to classical programs. While several testing methods have been proposed for quantum programs, they often overlook fundamental questions in black-box testing. In this paper, we bridge this gap by presenting three novel algorithms specifically designed to address the cha… ▽ More Quantum programs exhibit inherent non-deterministic behavior, which poses more significant challenges for error discovery compared to classical programs. While several testing methods have been proposed for quantum programs, they often overlook fundamental questions in black-box testing. In this paper, we bridge this gap by presenting three novel algorithms specifically designed to address the challenges of equivalence, identity, and unitarity checking in black-box testing of quantum programs. We also explore optimization techniques for these algorithms, including specialized versions for equivalence and unitarity checking, and provide valuable insights into parameter selection to maximize performance and effectiveness. To evaluate the effectiveness of our proposed methods, we conducted comprehensive experimental evaluations, which demonstrate that our methods can rigorously perform equivalence, identity, and unitarity checking, offering robust support for black-box testing of quantum programs. △ Less

Submitted 24 May, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: 23 pages

arXiv:2306.17407 [pdf, other]

Testing Multi-Subroutine Quantum Programs: From Unit Testing to Integration Testing

Authors: Peixun Long, Jianjun Zhao

Abstract: Quantum computing has emerged as a promising field with the potential to revolutionize various domains by harnessing the principles of quantum mechanics. As quantum hardware and algorithms continue to advance, develo** high-quality quantum software has become crucial. However, testing quantum programs poses unique challenges due to the distinctive characteristics of quantum systems and the compl… ▽ More Quantum computing has emerged as a promising field with the potential to revolutionize various domains by harnessing the principles of quantum mechanics. As quantum hardware and algorithms continue to advance, develo** high-quality quantum software has become crucial. However, testing quantum programs poses unique challenges due to the distinctive characteristics of quantum systems and the complexity of multi-subroutine programs. This paper addresses the specific testing requirements of multi-subroutine quantum programs. We begin by investigating critical properties by surveying existing quantum libraries and providing insights into the challenges of testing these programs. Building upon this understanding, we focus on testing criteria and techniques based on the whole testing process perspective, spanning from unit testing to integration testing. We delve into various aspects, including IO analysis, quantum relation checking, structural testing, behavior testing, integration of subroutine pairs, and test case generation. We also introduce novel testing principles and criteria to guide the testing process. We conduct comprehensive testing on typical quantum subroutines, including diverse mutants and randomized inputs, to evaluate our proposed approach. The analysis of failures provides valuable insights into the effectiveness of our testing methodology. Additionally, we present case studies on representative multi-subroutine quantum programs, demonstrating the practical application and effectiveness of our proposed testing principles and criteria. △ Less

Submitted 24 May, 2024; v1 submitted 30 June, 2023; originally announced June 2023.

Comments: 62 pages

arXiv:2304.11059 [pdf, ps, other]

doi 10.1006/jcss.1997.1557

Prediction, Learning, Uniform Convergence, and Scale-sensitive Dimensions

Authors: Peter L. Bartlett, Philip M. Long

Abstract: We present a new general-purpose algorithm for learning classes of $[0,1]$-valued functions in a generalization of the prediction model, and prove a general upper bound on the expected absolute error of this algorithm in terms of a scale-sensitive generalization of the Vapnik dimension proposed by Alon, Ben-David, Cesa-Bianchi and Haussler. We give lower bounds implying that our upper bounds canno… ▽ More We present a new general-purpose algorithm for learning classes of $[0,1]$-valued functions in a generalization of the prediction model, and prove a general upper bound on the expected absolute error of this algorithm in terms of a scale-sensitive generalization of the Vapnik dimension proposed by Alon, Ben-David, Cesa-Bianchi and Haussler. We give lower bounds implying that our upper bounds cannot be improved by more than a constant factor in general. We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension. Using a different technique, we obtain new bounds on packing numbers in terms of Kearns and Schapire's fat-shattering function. We show how to apply both packing bounds to obtain improved general bounds on the sample complexity of agnostic learning. For each $ε> 0$, we establish weaker sufficient and stronger necessary conditions for a class of $[0,1]$-valued functions to be agnostically learnable to within $ε$, and to be an $ε$-uniform Glivenko-Cantelli class. This is a manuscript that was accepted by JCSS, together with a correction. △ Less

Submitted 24 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

Comments: One header page, a three page correction, then a 28 page original paper

arXiv:2210.09540 [pdf, other]

Contact-Implicit Planning and Control for Non-Prehensile Manipulation Using State-Triggered Constraints

Authors: Maozhen Wang, Aykut Ozgun Onol, Philip Long, Taskin Padir

Abstract: We present a contact-implicit planning approach that can generate contact-interaction trajectories for non-prehensile manipulation problems without tuning or a tailored initial guess and with high success rates. This is achieved by leveraging the concept of state-triggered constraints (STCs) to capture the hybrid dynamics induced by discrete contact modes without explicitly reasoning about the com… ▽ More We present a contact-implicit planning approach that can generate contact-interaction trajectories for non-prehensile manipulation problems without tuning or a tailored initial guess and with high success rates. This is achieved by leveraging the concept of state-triggered constraints (STCs) to capture the hybrid dynamics induced by discrete contact modes without explicitly reasoning about the combinatorics. STCs enable triggering arbitrary constraints by a strict inequality condition in a continuous way. We first use STCs to develop an automatic contact constraint activation method to minimize the effective constraint space based on the utility of contact candidates for a given task. Then, we introduce a re-formulation of the Coulomb friction model based on STCs that is more efficient for the discovery of tangential forces than the well-studied complementarity constraints-based approach. Last, we include the proposed friction model in the planning and control of quasi-static planar pushing. The performance of the STC-based contact activation and friction methods is evaluated by extensive simulation experiments in a dynamic pushing scenario. The results demonstrate that our methods outperform the baselines based on complementarity constraints with a significant decrease in the planning time and a higher success rate. We then compare the proposed quasi-static pushing controller against a mixed-integer programming-based approach in simulation and find that our method is computationally more efficient and provides a better tracking accuracy, with the added benefit of not requiring an initial control trajectory. Finally, we present hardware experiments demonstrating the usability of our framework in executing complex trajectories in real-time even with a low-accuracy tracking system. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: 16 pages, The International Symposium on Robotics Research 2022

arXiv:2210.01513 [pdf, other]

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

Authors: Peter L. Bartlett, Philip M. Long, Olivier Bousquet

Abstract: We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest c… ▽ More We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative -- the derivative of the Hessian in the leading eigenvector direction -- that encourages drift toward wider minima. △ Less

Submitted 11 April, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

arXiv:2209.09315 [pdf, other]

Deep Linear Networks can Benignly Overfit when Shallow Ones Do

Authors: Niladri S. Chatterji, Philip M. Long

Abstract: We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear model… ▽ More We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced. △ Less

Submitted 6 February, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

arXiv:2208.09206 [pdf, other]

Testing Quantum Programs with Multiple Subroutines

Authors: Peixun Long, Jianjun Zhao

Abstract: Errors in quantum programs are challenging to track down due to the uncertainty of quantum programs. Testing is, therefore, an indispensable method for assuring the quality of quantum software. Existing testing methods focus only on testing quantum programs with quantum circuits or single subroutines and, therefore, cannot effectively test quantum programs with multi-subroutines. In this paper, we… ▽ More Errors in quantum programs are challenging to track down due to the uncertainty of quantum programs. Testing is, therefore, an indispensable method for assuring the quality of quantum software. Existing testing methods focus only on testing quantum programs with quantum circuits or single subroutines and, therefore, cannot effectively test quantum programs with multi-subroutines. In this paper, we first discuss several critical issues that must be considered when testing multi-subroutine quantum programs and point out the limitations and problems with existing testing methods. We then present a novel framework for testing multi-subroutine quantum programs that allow for both unit and integration testing. Our framework includes two novel test coverage criteria for the equivalent class partition of quantum variables to guide our testing tasks and techniques to test quantum programs with several common patterns. We also discuss how to generate test cases based on our framework. To evaluate the effectiveness of our testing framework, we implemented a tool called QSharpTester for testing Q\# programs with multiple subroutines. We used it to conduct experiments on hundreds of mutation programs deriving from seven original Q\# programs. The experimental results show that our testing methods can deal with broader types of quantum programs than existing ones and perform well on almost all faulty mutation programs. △ Less

Submitted 26 February, 2023; v1 submitted 19 August, 2022; originally announced August 2022.

Comments: 14 pages, 9 figures

arXiv:2203.03365 [pdf]

Machine learning using longitudinal prescription and medical claims for the detection of nonalcoholic steatohepatitis (NASH)

Authors: Ozge Yasar, Patrick Long, Brett Harder, Hanna Marshall, Sanjay Bhasin, Suyin Lee, Mark Delegge, Stephanie Roy, Orla Doyle, Nadea Leavitt, John Rigg

Abstract: Objectives To develop and evaluate machine learning models to detect suspected undiagnosed nonalcoholic steatohepatitis (NASH) patients for diagnostic screening and clinical management. Methods In this retrospective observational noninterventional study using administrative medical claims data from 1,463,089 patients, gradient-boosted decision trees were trained to detect likely NASH patients fr… ▽ More Objectives To develop and evaluate machine learning models to detect suspected undiagnosed nonalcoholic steatohepatitis (NASH) patients for diagnostic screening and clinical management. Methods In this retrospective observational noninterventional study using administrative medical claims data from 1,463,089 patients, gradient-boosted decision trees were trained to detect likely NASH patients from an at-risk patient population with a history of obesity, type 2 diabetes mellitus (T2DM), metabolic disorder, or nonalcoholic fatty liver (NAFL). Models were trained to detect likely NASH in all at-risk patients or in the subset without a prior NAFL diagnosis (non-NAFL at-risk patients). Models were trained and validated using retrospective medical claims data and assessed using area under precision recall and receiver operating characteristic curves (AUPRCs, AUROCs). Results The 6-month incidence of NASH in claims data was 1 per 1,437 at-risk patients and 1 per 2,127 non-NAFL at-risk patients. The model trained to detect NASH in all at-risk patients had an AUPRC of 0.0107 (95% CI 0.0104 - 0.011) and an AUROC of 0.84. At 10% recall, model precision was 4.3%, which is 60x above NASH incidence. The model trained to detect NASH in non-NAFL patients had an AUPRC of 0.003 (95% CI 0.0029 - 0.0031) and an AUROC of 0.78. At 10% recall, model precision was 1%, which is 20x above NASH incidence. Conclusion The low incidence of NASH in medical claims data corroborates the pattern of NASH underdiagnosis in clinical practice. Claims-based machine learning could facilitate the detection of probable NASH patients for diagnostic testing and disease management. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: 22 pages, 4 figures

arXiv:2112.04590 [pdf, ps, other]

The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss

Authors: Philip M. Long, Rocco A. Servedio

Abstract: Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a… ▽ More Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing. △ Less

Submitted 4 March, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

arXiv:2110.02914 [pdf, ps, other]

Foolish Crowds Support Benign Overfitting

Authors: Niladri S. Chatterji, Philip M. Long

Abstract: We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the gro… ▽ More We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the $\textit{noise}$ is ameliorated by spreading it among many directions -- the variance reduction arises from a $\textit{foolish}$ crowd. △ Less

Submitted 17 March, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

arXiv:2108.11489 [pdf, ps, other]

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

Abstract: The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained wit… ▽ More The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk. △ Less

Submitted 9 September, 2022; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: Accepted for publication at JMLR

arXiv:2106.01921 [pdf, ps, other]

doi 10.1002/sam.11559

Sample Selection Bias in Evaluation of Prediction Performance of Causal Models

Authors: James P. Long, Min ** Ha

Abstract: Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets.… ▽ More Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets. Biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association-based estimators such as Lasso. Finally, we compare the performance of causal estimators in simulation studies that reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren. △ Less

Submitted 26 October, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: 12 pages, 4 figures, 2 tables

arXiv:2105.10585 [pdf, ps, other]

Properties of the After Kernel

Authors: Philip M. Long

Abstract: The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel defined using neural networks at initialization, whose embedding is the gradient of the output of the network with respect to its parameters. We study the "after kernel", which is defined using the same embedding, except after training, for neural networks with standard architectures, on binary classification problems extracted… ▽ More The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel defined using neural networks at initialization, whose embedding is the gradient of the output of the network with respect to its parameters. We study the "after kernel", which is defined using the same embedding, except after training, for neural networks with standard architectures, on binary classification problems extracted from MNIST and CIFAR-10, trained using SGD in a standard way. For some dataset-architecture pairs, after a few epochs of neural network training, a hard-margin SVM using the network's after kernel is much more accurate than when the network's initial kernel is used. For networks with an architecture similar to VGG, the after kernel is more "global", in the sense that it is less invariant to transformations of input images that disrupt the global structure of the image while leaving the local statistics largely intact. For fully connected networks, the after kernel is less global in this sense. The after kernel tends to be more invariant to small shifts, rotations and zooms; data augmentation does not improve these invariances. The (finite approximation to the) conjugate kernel, obtained using the last layer of hidden nodes, sometimes, but not always, provides a good approximation to the NTK and the after kernel. Training a network with a larger learning rate (while holding the training error constant) produces a better kernel, as measured by the test error of a hard-margin SVM. The after kernels of networks trained with larger learning rates tend to be more global, and more invariant to small shifts, rotations and zooms. △ Less

Submitted 13 December, 2021; v1 submitted 21 May, 2021; originally announced May 2021.

arXiv:2102.04998 [pdf, ps, other]

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

Abstract: We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at… ▽ More We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses. △ Less

Submitted 1 July, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

arXiv:2012.02409 [pdf, other]

When does gradient descent with logistic loss find interpolating two-layer networks?

Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

Abstract: We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the… ▽ More We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies. △ Less

Submitted 1 July, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2011.04848 [pdf, other]

AES: Autonomous Excavator System for Real-World and Hazardous Environments

Authors: **xin Zhao, Pinxin Long, Liyang Wang, Lingfeng Qian, Feixiang Lu, Xibin Song, Dinesh Manocha, Liangjun Zhang

Abstract: Excavators are widely used for material-handling applications in unstructured environments, including mining and construction. The size of the global market of excavators is 44.12 Billion USD in 2018 and is predicted to grow to 63.14 Billion USD by 2026. Operating excavators in a real-world environment can be challenging due to extreme conditions and rock sliding, ground collapse, or exceeding dus… ▽ More Excavators are widely used for material-handling applications in unstructured environments, including mining and construction. The size of the global market of excavators is 44.12 Billion USD in 2018 and is predicted to grow to 63.14 Billion USD by 2026. Operating excavators in a real-world environment can be challenging due to extreme conditions and rock sliding, ground collapse, or exceeding dust. Multiple fatalities and injuries occur each year during excavations. An autonomous excavator that can substitute human operators in these hazardous environments would substantially lower the number of injuries and can improve the overall productivity. △ Less

Submitted 9 November, 2020; originally announced November 2020.

arXiv:2010.14038 [pdf, other]

Optimization-Based Framework for Excavation Trajectory Generation

Authors: Yajue Yang, Pinxin Long, Jia Pan, Xinbin Song, Liangjun Zhang

Abstract: In this paper, we present a novel optimization-based framework for autonomous excavator trajectory generation under various objectives, including minimum joint displacement and minimum time. Traditional methods on excavation trajectory generation usually separate the excavation motion into a sequence of fixed phases, resulting in limited trajectory searching space. Our framework explores the space… ▽ More In this paper, we present a novel optimization-based framework for autonomous excavator trajectory generation under various objectives, including minimum joint displacement and minimum time. Traditional methods on excavation trajectory generation usually separate the excavation motion into a sequence of fixed phases, resulting in limited trajectory searching space. Our framework explores the space of all possible excavation trajectories represented with waypoints interpolated by a polynomial spline, thereby enabling optimization over a larger searching space. We formulate a generic task specification for excavation by constraining the instantaneous motion of the bucket and further add a target-oriented constraint, i.e. swept volume that indicates the estimated amount of excavated materials. To formulate time related objectives and constraints, we introduce time intervals between waypoints as variables into the optimization framework. We implement the proposed framework and evaluate its performance on a UR5 robotic arm. The experimental results demonstrate that the generated trajectories are able to excavate sufficient mass of soil for different terrain shapes and have 60% shorter minimal length than traditional excavation methods. We further compare our one-stage time optimal trajectory generation with the two-stage method. The result shows that trajectories generated by our one-stage method cost 18% less time on average. △ Less

Submitted 26 October, 2020; originally announced October 2020.

arXiv:2010.08479 [pdf, ps, other]

Failures of model-dependent generalization bounds for least-norm interpolation

Authors: Peter L. Bartlett, Philip M. Long

Abstract: We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joi… ▽ More We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose -- it can be bounded below by a constant when the true excess risk goes to zero. △ Less

Submitted 20 January, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

Journal ref: JMLR, 22(204):1-15, 2021

arXiv:2006.06176 [pdf, other]

Tuning-Free Contact-Implicit Trajectory Optimization

Authors: Aykut Ozgun Onol, Radu Corcodel, Philip Long, Taskin Padir

Abstract: We present a contact-implicit trajectory optimization framework that can plan contact-interaction trajectories for different robot architectures and tasks using a trivial initial guess and without requiring any parameter tuning. This is achieved by using a relaxed contact model along with an automatic penalty adjustment loop for suppressing the relaxation. Moreover, the structure of the problem en… ▽ More We present a contact-implicit trajectory optimization framework that can plan contact-interaction trajectories for different robot architectures and tasks using a trivial initial guess and without requiring any parameter tuning. This is achieved by using a relaxed contact model along with an automatic penalty adjustment loop for suppressing the relaxation. Moreover, the structure of the problem enables us to exploit the contact information implied by the use of relaxation in the previous iteration, such that the solution is explicitly improved with little computational overhead. We test the proposed approach in simulation experiments for non-prehensile manipulation using a 7-DOF arm and a mobile robot and for planar locomotion using a humanoid-like robot in zero gravity. The results demonstrate that our method provides an out-of-the-box solution with good performance for a wide range of applications. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: IEEE International Conference on Robotics and Automation (ICRA) 2020

arXiv:2006.00811 [pdf, other]

Time Variable Minimum Torque Trajectory Optimization for Autonomous Excavator

Authors: Yajue Yang, Jia Pan, Pinxin Long, Xibin Song, Liangjun Zhang

Abstract: In this paper, we present a minimal torque and time variable trajectory optimization method for autonomous excavator considering the soil-tool interaction. The method formulates the excavation motion generation as a trajectory optimization problem and takes into account geometric, kinematic and dynamics constraints. To generate time-efficient trajectory and improve the overall optimization efficie… ▽ More In this paper, we present a minimal torque and time variable trajectory optimization method for autonomous excavator considering the soil-tool interaction. The method formulates the excavation motion generation as a trajectory optimization problem and takes into account geometric, kinematic and dynamics constraints. To generate time-efficient trajectory and improve the overall optimization efficiency, we propose a time variable trajectory optimization mechanism so that the time intervals between the keypoints along the trajectory subject to the optimization. As a result, the method uses few keypoints and reduces the total number of optimization variables. We further introduce a soil-tool interaction force model, which considers the geometric shape of the bucket and the physical properties of the soil. The experimental result on a high fidelity dynamic simulator shows our method can generate feasible trajectories, which satisfy excavation task constraints and are adaptive to different soil conditions. △ Less

Submitted 1 June, 2020; originally announced June 2020.

arXiv:2004.12019 [pdf, other]

Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime

Authors: Niladri S. Chatterji, Philip M. Long

Abstract: We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassif… ▽ More We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk. △ Less

Submitted 1 June, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: Corrected typographical errors from the previous version of this paper

arXiv:2003.01094 [pdf, other]

On the Global Convergence of Training Deep Linear ResNets

Authors: Difan Zou, Philip M. Long, Quanquan Gu

Abstract: We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minim… ▽ More We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu 2019), our condition on the neural network width is sharper by a factor of $O(κL)$, where $κ$ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the input and output dimensions respectively. △ Less

Submitted 2 March, 2020; originally announced March 2020.

Comments: 26 pages, 1 figure. In ICLR 2020

arXiv:2002.00291 [pdf, ps, other]

Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms

Authors: Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long

Abstract: We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an… ▽ More We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\varepsilon$ away from the target in total variation distance if the number of gradient queries is less than $Ω(σ^2 d/\varepsilon^2)$, where $σ^2 d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem. △ Less

Submitted 3 July, 2021; v1 submitted 1 February, 2020; originally announced February 2020.

Comments: 21 pages; accepted for publication at Bernoulli

arXiv:1910.09998 [pdf, other]

Learning Resilient Behaviors for Navigation Under Uncertainty

Authors: Tingxiang Fan, Pinxin Long, Wenxi Liu, Jia Pan, Ruigang Yang, Dinesh Manocha

Abstract: Deep reinforcement learning has great potential to acquire complex, adaptive behaviors for autonomous agents automatically. However, the underlying neural network polices have not been widely deployed in real-world applications, especially in these safety-critical tasks (e.g., autonomous driving). One of the reasons is that the learned policy cannot perform flexible and resilient behaviors as trad… ▽ More Deep reinforcement learning has great potential to acquire complex, adaptive behaviors for autonomous agents automatically. However, the underlying neural network polices have not been widely deployed in real-world applications, especially in these safety-critical tasks (e.g., autonomous driving). One of the reasons is that the learned policy cannot perform flexible and resilient behaviors as traditional methods to adapt to diverse environments. In this paper, we consider the problem that a mobile robot learns adaptive and resilient behaviors for navigating in unseen uncertain environments while avoiding collisions. We present a novel approach for uncertainty-aware navigation by introducing an uncertainty-aware predictor to model the environmental uncertainty, and we propose a novel uncertainty-aware navigation network to learn resilient behaviors in the prior unknown environments. To train the proposed uncertainty-aware network more stably and efficiently, we present the temperature decay training paradigm, which balances exploration and exploitation during the training process. Our experimental evaluation demonstrates that our approach can learn resilient behaviors in diverse environments and generate adaptive trajectories according to environmental uncertainties. △ Less

Submitted 3 June, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: accepted to ICRA 2020

arXiv:1906.11300 [pdf, other]

doi 10.1073/pnas.1907378117

Benign Overfitting in Linear Regression

Authors: Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Abstract: The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for whic… ▽ More The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size. △ Less

Submitted 29 January, 2020; v1 submitted 26 June, 2019; originally announced June 2019.

arXiv:1905.12600 [pdf, other]

Generalization bounds for deep convolutional neural networks

Authors: Philip M. Long, Hanie Sedghi

Abstract: We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyper… ▽ More We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps. △ Less

Submitted 8 April, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: Published as a conference paper at ICLR 2020

arXiv:1901.02104 [pdf, other]

On the effect of the activation function on the distribution of hidden nodes in a deep network

Authors: Philip M. Long, Hanie Sedghi

Abstract: We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$. We show that, if the activation function $φ$ satisfies a minimal set of assumptions, satisfied by all activation functions that… ▽ More We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$. We show that, if the activation function $φ$ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the `length process' converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases, and the activation function $φ$. We also show that this convergence may fail for $φ$ that violate our assumptions. △ Less

Submitted 7 January, 2019; originally announced January 2019.

arXiv:1811.03744 [pdf, ps, other]

Density estimation for shift-invariant multidimensional distributions

Authors: Anindya De, Philip M. Long, Rocco A. Servedio

Abstract: We study density estimation for classes of shift-invariant distributions over $\mathbb{R}^d$. A multidimensional distribution is "shift-invariant" if, roughly speaking, it is close in total variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness assumptions commonly used in non-parametric density estimation to allow jump discontinuities. The different classe… ▽ More We study density estimation for classes of shift-invariant distributions over $\mathbb{R}^d$. A multidimensional distribution is "shift-invariant" if, roughly speaking, it is close in total variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness assumptions commonly used in non-parametric density estimation to allow jump discontinuities. The different classes of distributions that we consider correspond to different rates of tail decay. For each such class we give an efficient algorithm that learns any distribution in the class from independent samples with respect to total variation distance. As a special case of our general result, we show that $d$-dimensional shift-invariant distributions which satisfy an exponential tail bound can be learned to total variation distance error $ε$ using $\tilde{O}_d(1/ ε^{d+2})$ examples and $\tilde{O}_d(1/ ε^{2d+2})$ time. This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/ε^{2d+2})$ time using $\tilde{O}_d(1/ε^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-ε,ε)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(ε)$ from the target distribution. We show that our general results are close to best possible by proving a simple $Ω\left(1/ε^d\right)$ information-theoretic lower bound on sample complexity even for learning bounded distributions that are shift-invariant. △ Less

Submitted 8 November, 2018; originally announced November 2018.

Comments: Appears in the Proceedings of ITCS 2019

arXiv:1810.10462 [pdf, other]

doi 10.1109/ICRA.2019.8794250

Contact-Implicit Trajectory Optimization Based on a Variable Smooth Contact Model and Successive Convexification

Authors: Aykut Ozgun Onol, Philip Long, Taskin Padir

Abstract: In this paper, we propose a contact-implicit trajectory optimization (CITO) method based on a variable smooth contact model (VSCM) and successive convexification (SCvx). The VSCM facilitates the convergence of gradient-based optimization without compromising physical fidelity. On the other hand, the proposed SCvx-based approach combines the advantages of direct and shooting methods for CITO. For e… ▽ More In this paper, we propose a contact-implicit trajectory optimization (CITO) method based on a variable smooth contact model (VSCM) and successive convexification (SCvx). The VSCM facilitates the convergence of gradient-based optimization without compromising physical fidelity. On the other hand, the proposed SCvx-based approach combines the advantages of direct and shooting methods for CITO. For evaluations, we consider non-prehensile manipulation tasks. The proposed method is compared to a version based on iterative linear quadratic regulator (iLQR) on a planar example. The results demonstrate that both methods can find physically-consistent motions that complete the tasks without a meaningful initial guess owing to the VSCM. The proposed SCvx-based method outperforms the iLQR-based method in terms of convergence, computation time, and the quality of motions found. Finally, the proposed SCvx-based method is tested on a standard robot platform and shown to perform efficiently for a real-world application. △ Less

Submitted 4 March, 2019; v1 submitted 24 October, 2018; originally announced October 2018.

Comments: Accepted for publication in ICRA 2019

arXiv:1810.00352 [pdf, other]

Getting Robots Unfrozen and Unlost in Dense Pedestrian Crowds

Authors: Tingxiang Fan, Xin**g Cheng, Jia Pan, Pinxin Long, Wenxi Liu, Ruigang Yang, Dinesh Manocha

Abstract: We aim to enable a mobile robot to navigate through environments with dense crowds, e.g., shop** malls, canteens, train stations, or airport terminals. In these challenging environments, existing approaches suffer from two common problems: the robot may get frozen and cannot make any progress toward its goal, or it may get lost due to severe occlusions inside a crowd. Here we propose a navigatio… ▽ More We aim to enable a mobile robot to navigate through environments with dense crowds, e.g., shop** malls, canteens, train stations, or airport terminals. In these challenging environments, existing approaches suffer from two common problems: the robot may get frozen and cannot make any progress toward its goal, or it may get lost due to severe occlusions inside a crowd. Here we propose a navigation framework that handles the robot freezing and the navigation lost problems simultaneously. First, we enhance the robot's mobility and unfreeze the robot in the crowd using a reinforcement learning based local navigation policy developed in our previous work~\cite{long2017towards}, which naturally takes into account the coordination between the robot and the human. Secondly, the robot takes advantage of its excellent local mobility to recover from its localization failure. In particular, it dynamically chooses to approach a set of recovery positions with rich features. To the best of our knowledge, our method is the first approach that simultaneously solves the freezing problem and the navigation lost problem in dense crowds. We evaluate our method in both simulated and real-world environments and demonstrate that it outperforms the state-of-the-art approaches. Videos are available at https://sites.google.com/view/rlslam. △ Less

Submitted 30 September, 2018; originally announced October 2018.

arXiv:1808.03841 [pdf, other]

Fully Distributed Multi-Robot Collision Avoidance via Deep Reinforcement Learning for Safe and Efficient Navigation in Complex Scenarios

Authors: Tingxiang Fan, Pinxin Long, Wenxi Liu, Jia Pan

Abstract: In this paper, we present a decentralized sensor-level collision avoidance policy for multi-robot systems, which shows promising results in practical applications. In particular, our policy directly maps raw sensor measurements to an agent's steering commands in terms of the movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we pre… ▽ More In this paper, we present a decentralized sensor-level collision avoidance policy for multi-robot systems, which shows promising results in practical applications. In particular, our policy directly maps raw sensor measurements to an agent's steering commands in terms of the movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to learn an optimal policy. The policy is trained over a large number of robots in rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. The learning algorithm is also integrated into a hybrid control framework to further improve the policy's robustness and effectiveness. We validate the learned sensor-level collision avoidance policy in a variety of simulated and real-world scenarios with thorough performance evaluations for large-scale multi-robot systems. The generalization of the learned policy is verified in a set of unseen scenarios including the navigation of a group of heterogeneous robots and a large-scale scenario with 100 robots. Although the policy is trained using simulation data only, we have successfully deployed it on physical robots with shapes and dynamics characteristics that are different from the simulated agents, in order to demonstrate the controller's robustness against the sim-to-real modeling error. Finally, we show that the collision-avoidance policy learned from multi-robot navigation tasks provides an excellent solution to the safe and effective autonomous navigation for a single robot working in a dense real human crowd. Our learned policy enables a robot to make effective progress in a crowd without getting stuck. Videos are available at https://sites.google.com/view/hybridmrca △ Less

Submitted 11 August, 2018; originally announced August 2018.

arXiv:1807.07013 [pdf, ps, other]

Learning Sums of Independent Random Variables with Sparse Collective Support

Authors: Anindya De, Philip M. Long, Rocco A. Servedio

Abstract: We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbf{Z}_{+}$, a sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$'s are mutu… ▽ More We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbf{Z}_{+}$, a sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$'s are mutually independent (but not necessarily identically distributed) integer random variables with $\cup_i \mathsf{supp}(\mathbf{X}_i) \subseteq \mathcal{A}.$ We give two main algorithmic results for learning such distributions: 1. For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $ε$ that uses $\mathsf{poly}(1/ε)$ samples and runs in time $\mathsf{poly}(1/ε)$, independent of $N$ and of the elements of $\mathcal{A}$. 2. For an arbitrary constant $k \geq 4$, if $\mathcal{A} = \{ a_1,...,a_k\}$ with $0 \leq a_1 < ... < a_k$, we give an algorithm that uses $\mathsf{poly}(1/ε) \cdot \log \log a_k$ samples (independent of $N$) and runs in time $\mathsf{poly}(1/ε, \log a_k).$ We prove an essentially matching lower bound: if $|\mathcal{A}| = 4$, then any algorithm must use $Ω(\log \log a_4) $ samples even for learning to constant accuracy. We also give similar-in-spirit (but quantitatively very different) algorithmic results, and essentially matching lower bounds, for the case in which $\mathcal{A}$ is not known to the learner. △ Less

Submitted 12 November, 2020; v1 submitted 18 July, 2018; originally announced July 2018.

Comments: Conference version in FOCS'18; Journal version to appear in JMLR

arXiv:1807.04814 [pdf]

Integrating Risk in Humanoid Robot Control for Applications in the Nuclear Industry

Authors: Xianchao Long, Philip Long, Aykut Onol, Taskin Padir

Abstract: This paper discuss the integration of risk into a robot control framework for decommissioning applications in the nuclear industry. Our overall objective is to allow the robot to evaluate a risk associated with several methods of completing the same task by combining a set of action sequences. If the environment is known and in the absence of sensing errors each set of actions would successfully c… ▽ More This paper discuss the integration of risk into a robot control framework for decommissioning applications in the nuclear industry. Our overall objective is to allow the robot to evaluate a risk associated with several methods of completing the same task by combining a set of action sequences. If the environment is known and in the absence of sensing errors each set of actions would successfully complete the task. In this paper, instead of attempting to model the errors associated with each sensing system in order to compute an exact solution, a set of solutions are obtained along with a prescribed risk index. The risk associated with each set of actions can then be compared to possible payoffs or rewards, for instance task completion time or power consumption. This information is then sent to a high level decision planner, for instance a human teleoperator, who can then make a more informed decision regarding the robots actions. In order to illustrate the concept, we introduce three specific risk measures, namely, the collision risk and the risk of toppling and failure risk associated with gras** an object. We demonstrate the results from this foundational study of risk-aware compositional robot autonomy in simulation using NASA's Valkyrie humanoid robot, and the gras** simulator HAPTIX. △ Less

Submitted 12 July, 2018; originally announced July 2018.

Comments: 9 pages, 6 figues

Journal ref: WM2018 Conference, March 18-22, 2018, Phoenix, Arizona, USA

arXiv:1807.04198 [pdf]

Using Contact to Increase Robot Performance for Glovebox D&D Tasks

Authors: Aykut Onol, Philip Long, Taskin Padir

Abstract: Glovebox decommissioning tasks usually require manipulating relatively heavy objects in a highly constrained environment. Thus, contact with the surroundings becomes inevitable. In order to allow the robot to interact with the environment in a natural way, we present a contact-implicit motion planning framework. This framework enables the system, without the specification in advance of a contact p… ▽ More Glovebox decommissioning tasks usually require manipulating relatively heavy objects in a highly constrained environment. Thus, contact with the surroundings becomes inevitable. In order to allow the robot to interact with the environment in a natural way, we present a contact-implicit motion planning framework. This framework enables the system, without the specification in advance of a contact plan, to make and break contacts to maintain stability while performing a manipulation task. In this method, we use linear complementarity constraints to model rigid body contacts and find a locally optimal solution for joint displacements and magnitudes of support forces. Then, joint torques are calculated such that the support forces have the highest priority. We evaluate our framework in a 2.5D, quasi-static simulation in which a humanoid robot with planar arms manipulates a heavy object. Our results suggest that the proposed method provides the robot with the ability to balance itself by generating support forces on the environment while simultaneously performing the manipulation task. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Comments: 11 pages, 5 figures; Accepted for publication in Waste Management Symposia 2018

arXiv:1806.01425 [pdf, other]

doi 10.1109/IROS.2018.8594284

A Comparative Analysis of Contact Models in Trajectory Optimization for Manipulation

Authors: Aykut Ozgun Onol, Philip Long, Taskin Padir

Abstract: In this paper, we analyze the effects of contact models on contact-implicit trajectory optimization for manipulation. We consider three different approaches: (1) a contact model that is based on complementarity constraints, (2) a smooth contact model, and our proposed method (3) a variable smooth contact model. We compare these models in simulation in terms of physical accuracy, quality of motions… ▽ More In this paper, we analyze the effects of contact models on contact-implicit trajectory optimization for manipulation. We consider three different approaches: (1) a contact model that is based on complementarity constraints, (2) a smooth contact model, and our proposed method (3) a variable smooth contact model. We compare these models in simulation in terms of physical accuracy, quality of motions, and computation time. In each case, the optimization process is initialized by setting all torque variables to zero, namely, without a meaningful initial guess. For simulations, we consider a pushing task with varying complexity for a 7 degrees-of-freedom robot arm. Our results demonstrate that the optimization based on the proposed variable smooth contact model provides a good trade-off between the physical fidelity and quality of motions at the cost of increased computation time. △ Less

Submitted 30 July, 2018; v1 submitted 4 June, 2018; originally announced June 2018.

Comments: 6 pages, 7 figures, 4 tables, IROS 2018 camera-ready version

arXiv:1805.10408 [pdf, other]

The Singular Values of Convolutional Layers

Authors: Hanie Sedghi, Vineet Gupta, Philip M. Long

Abstract: We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network usin… ▽ More We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\% to 5.3\%. △ Less

Submitted 5 March, 2019; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: Published as a conference paper at ICLR 2019

arXiv:1804.05012 [pdf, ps, other]

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Authors: Peter L. Bartlett, Steven N. Evans, Philip M. Long

Abstract: We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residu… ▽ More We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fréchet derivatives with respect to the $h_1,...,h_m$, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region. △ Less

Submitted 16 April, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

arXiv:1802.06093 [pdf, ps, other]

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

Authors: Peter L. Bartlett, David P. Helmbold, Philip M. Long

Abstract: We analyze algorithms for approximating a function $f(x) = Φx$ map** $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $Θ_1,...,Θ_L$ and defined by $h(x) = Θ_L Θ_{L-1} ... Θ_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic.… ▽ More We analyze algorithms for approximating a function $f(x) = Φx$ map** $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $Θ_1,...,Θ_L$ and defined by $h(x) = Θ_L Θ_{L-1} ... Θ_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $Φ$, in the case where the initial hypothesis $Θ_1 = ... = Θ_L = I$ has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $Φ$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If $Φ$ is symmetric positive definite, we show that an algorithm that initializes $Θ_i = I$ learns an $ε$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $Φ$, and $\log(d/ε)$. In contrast, we show that if the least squares matrix $Φ$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that $Φ$ satisfies $u^{\top} Φu > 0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} Θ_L Θ_{L-1} ... Θ_1 u > 0$ for all $u$, and another that "balances" $Θ_1, ..., Θ_L$ so that they have the same singular values. △ Less

Submitted 18 June, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

arXiv:1709.10082 [pdf, other]

Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning

Authors: Pinxin Long, Tingxiang Fan, Xinyi Liao, Wenxi Liu, Hao Zhang, Jia Pan

Abstract: Develo** a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computational… ▽ More Develo** a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computationally prohibitive and not robust. More importantly, in practice the performance of these methods are much lower than their centralized counterparts. We present a decentralized sensor-level collision avoidance policy for multi-robot systems, which directly maps raw sensor measurements to an agent's steering commands in terms of movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to find an optimal policy which is trained over a large number of robots on rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. We validate the learned sensor-level collision avoidance policy in a variety of simulated scenarios with thorough performance evaluations and show that the final learned policy is able to find time efficient, collision-free paths for a large-scale robot system. We also demonstrate that the learned policy can be well generalized to new scenarios that do not appear in the entire training period, including navigating a heterogeneous group of robots and a large-scale scenario with 100 robots. Videos are available at https://sites.google.com/view/drlmaca △ Less

Submitted 20 May, 2018; v1 submitted 28 September, 2017; originally announced September 2017.

arXiv:1709.09574 [pdf, ps, other]

Fillable arrays with constant time operations and a single bit of redundancy

Authors: Jacob Teo Por Loong, Jelani Nelson, Huacheng Yu

Abstract: In the fillable array problem one must maintain an array A[1..n] of $w$-bit entries subject to random access reads and writes, and also a $\texttt{fill}(Δ)$ operation which sets every entry of to some $Δ\in\{0,\ldots,2^w-1\}$. We show that with just one bit of redundancy, i.e. a data structure using $nw+1$ bits of memory, $\texttt{read}/\texttt{fill}$ can be implemented in worst case constant time… ▽ More In the fillable array problem one must maintain an array A[1..n] of $w$-bit entries subject to random access reads and writes, and also a $\texttt{fill}(Δ)$ operation which sets every entry of to some $Δ\in\{0,\ldots,2^w-1\}$. We show that with just one bit of redundancy, i.e. a data structure using $nw+1$ bits of memory, $\texttt{read}/\texttt{fill}$ can be implemented in worst case constant time, and $\texttt{write}$ can be implemented in either amortized constant time (deterministically) or worst case expected constant (randomized). In the latter case, we need to store an additional $O(\log n)$ random bits to specify a permutation drawn from an $1/n^2$-almost pairwise independent family. △ Less

Submitted 27 September, 2017; originally announced September 2017.

arXiv:1609.06838 [pdf, other]

Deep-Learned Collision Avoidance Policy for Distributed Multi-Agent Navigation

Authors: Pinxin Long, Wenxi Liu, Jia Pan

Abstract: High-speed, low-latency obstacle avoidance that is insensitive to sensor noise is essential for enabling multiple decentralized robots to function reliably in cluttered and dynamic environments. While other distributed multi-agent collision avoidance systems exist, these systems require online geometric optimization where tedious parameter tuning and perfect sensing are necessary. We present a n… ▽ More High-speed, low-latency obstacle avoidance that is insensitive to sensor noise is essential for enabling multiple decentralized robots to function reliably in cluttered and dynamic environments. While other distributed multi-agent collision avoidance systems exist, these systems require online geometric optimization where tedious parameter tuning and perfect sensing are necessary. We present a novel end-to-end framework to generate reactive collision avoidance policy for efficient distributed multi-agent navigation. Our method formulates an agent's navigation strategy as a deep neural network map** from the observed noisy sensor measurements to the agent's steering commands in terms of movement velocity. We train the network on a large number of frames of collision avoidance data collected by repeatedly running a multi-agent simulator with different parameter settings. We validate the learned deep neural network policy in a set of simulated and real scenarios with noisy measurements and demonstrate that our method is able to generate a robust navigation strategy that is insensitive to imperfect sensing and works reliably in all situations. We also show that our method can be well generalized to scenarios that do not appear in our training data, including scenes with static obstacles and agents with different sizes. Videos are available at https://sites.google.com/view/deepmaca. △ Less

Submitted 6 July, 2017; v1 submitted 22 September, 2016; originally announced September 2016.

Journal ref: IEEE Robotics and Automation Letters 2(2): 656-663 (2017)

arXiv:1603.06317 [pdf, other]

DoraPicker: An Autonomous Picking System for General Objects

Authors: Hao Zhang, Pinxin Long, Dandan Zhou, Zhongfeng Qian, Zheng Wang, Weiwei Wan, Dinesh Manocha, Chonhyon Park, Tommy Hu, Chao Cao, Yibo Chen, Marco Chow, Jia Pan

Abstract: Robots that autonomously manipulate objects within warehouses have the potential to shorten the package delivery time and improve the efficiency of the e-commerce industry. In this paper, we present a robotic system that is capable of both picking and placing general objects in warehouse scenarios. Given a target object, the robot autonomously detects it from a shelf or a table and estimates its f… ▽ More Robots that autonomously manipulate objects within warehouses have the potential to shorten the package delivery time and improve the efficiency of the e-commerce industry. In this paper, we present a robotic system that is capable of both picking and placing general objects in warehouse scenarios. Given a target object, the robot autonomously detects it from a shelf or a table and estimates its full 6D pose. With this pose information, the robot picks the object using its gripper, and then places it into a container or at a specified location. We describe our pick-and-place system in detail while highlighting our design principles for the warehouse settings, including the perception method that leverages knowledge about its workspace, three grippers designed to handle a large variety of different objects in terms of shape, weight and material, and grasp planning in cluttered scenarios. We also present extensive experiments to evaluate the performance of our picking system and demonstrate that the robot is competent to accomplish various tasks in warehouse settings, such as picking a target item from a tight space, gras** different objects from the shelf, and performing pick-and-place tasks on the table. △ Less

Submitted 20 March, 2016; originally announced March 2016.

Comments: 10 pages, 10 figures

arXiv:1602.04484 [pdf, ps, other]

Surprising properties of dropout in deep networks

Authors: David P. Helmbold, Philip M. Long

Abstract: We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dro… ▽ More We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow exponentially in the depth of the network while the weight-decay penalty remains essentially linear, and that dropout is insensitive to various re-scalings of the input features, outputs, and network weights. This last insensitivity implies that there are no isolated local minima of the dropout training criterion. Our work uncovers new properties of dropout, extends our understanding of why dropout succeeds, and lays the foundation for further progress. △ Less

Submitted 19 April, 2017; v1 submitted 14 February, 2016; originally announced February 2016.

arXiv:1412.4736 [pdf, other]

On the Inductive Bias of Dropout

Authors: David P. Helmbold, Philip M. Long

Abstract: Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassif… ▽ More Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization. △ Less

Submitted 17 February, 2015; v1 submitted 15 December, 2014; originally announced December 2014.

Journal ref: Journal of Machine Learning Research, 16, 3403-3454 (2015). (See http://jmlr.org/papers/volume16/helmbold15a/helmbold15a.pdf.)

arXiv:1307.8371 [pdf, ps, other]

The Power of Localization for Efficiently Learning Linear Separators with Noise

Authors: Pranjal Awasthi, Maria Florina Balcan, Philip M. Long

Abstract: We introduce a new approach for designing computationally efficient learning algorithms that are tolerant to noise, and demonstrate its effectiveness by designing algorithms with improved noise tolerance guarantees for learning linear separators. We consider both the malicious noise model and the adversarial label noise model. For malicious noise, where the adversary can corrupt both the label a… ▽ More We introduce a new approach for designing computationally efficient learning algorithms that are tolerant to noise, and demonstrate its effectiveness by designing algorithms with improved noise tolerance guarantees for learning linear separators. We consider both the malicious noise model and the adversarial label noise model. For malicious noise, where the adversary can corrupt both the label and the features, we provide a polynomial-time algorithm for learning linear separators in $\Re^d$ under isotropic log-concave distributions that can tolerate a nearly information-theoretically optimal noise rate of $η= Ω(ε)$. For the adversarial label noise model, where the distribution over the feature vectors is unchanged, and the overall probability of a noisy label is constrained to be at most $η$, we also give a polynomial-time algorithm for learning linear separators in $\Re^d$ under isotropic log-concave distributions that can handle a noise rate of $η= Ω\left(ε\right)$. We show that, in the active learning model, our algorithms achieve a label complexity whose dependence on the error parameter $ε$ is polylogarithmic. This provides the first polynomial-time active learning algorithm for learning linear separators in the presence of malicious noise or adversarial label noise. △ Less

Submitted 3 June, 2018; v1 submitted 31 July, 2013; originally announced July 2013.

Comments: Contains improved label complexity analysis communicated to us by Steve Hanneke

ACM Class: F.2

Showing 1–50 of 52 results for author: Long, P