-
Textbooks Are All You Need
Authors:
Suriya Gunasekar,
Yi Zhang,
Jyoti Aneja,
Caio César Teodoro Mendes,
Allie Del Giorno,
Sivakanth Gopi,
Mojan Javaheripi,
Piero Kauffmann,
Gustavo de Rosa,
Olli Saarikivi,
Adil Salim,
Shital Shah,
Harkirat Singh Behl,
Xin Wang,
Sébastien Bubeck,
Ronen Eldan,
Adam Tauman Kalai,
Yin Tat Lee,
Yuanzhi Li
Abstract:
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu…
▽ More
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
△ Less
Submitted 2 October, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Scaling the Convex Barrier with Sparse Dual Algorithms
Authors:
Alessandro De Palma,
Harkirat Singh Behl,
Rudy Bunel,
Philip H. S. Torr,
M. Pawan Kumar
Abstract:
Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear rel…
▽ More
Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise-linear activations exists, it comes at the cost of exponentially many constraints and currently lacks an efficient customized solver. We alleviate this deficiency by presenting two novel dual algorithms: one operates a subgradient method on a small active set of dual variables, the other exploits the sparsity of Frank-Wolfe type optimizers to incur only a linear memory cost. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. At the same time, they share the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we can obtain better bounds than off-the-shelf solvers in only a fraction of their running time, attaining significant formal verification speed-ups.
△ Less
Submitted 26 February, 2024; v1 submitted 14 January, 2021;
originally announced January 2021.
-
AutoSimulate: (Quickly) Learning Synthetic Data Generation
Authors:
Harkirat Singh Behl,
Atılım Güneş Baydin,
Ran Gal,
Philip H. S. Torr,
Vibhav Vineet
Abstract:
Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and valida…
▽ More
Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to $50\times$), with significantly reduced training data generation (up to $30\times$) and better accuracy ($+8.7\%$) on real-world test datasets than previous methods.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
STEER: Simple Temporal Regularization For Neural ODEs
Authors:
Arnab Ghosh,
Harkirat Singh Behl,
Emilien Dupont,
Philip H. S. Torr,
Vinay Namboodiri
Abstract:
Training Neural Ordinary Differential Equations (ODEs) is often computationally expensive. Indeed, computing the forward pass of such models involves solving an ODE which can become arbitrarily complex during training. Recent works have shown that regularizing the dynamics of the ODE can partially alleviate this. In this paper we propose a new regularization technique: randomly sampling the end ti…
▽ More
Training Neural Ordinary Differential Equations (ODEs) is often computationally expensive. Indeed, computing the forward pass of such models involves solving an ODE which can become arbitrarily complex during training. Recent works have shown that regularizing the dynamics of the ODE can partially alleviate this. In this paper we propose a new regularization technique: randomly sampling the end time of the ODE during training. The proposed regularization is simple to implement, has negligible overhead and is effective across a wide variety of tasks. Further, the technique is orthogonal to several other methods proposed to regularize the dynamics of ODEs and as such can be used in conjunction with them. We show through experiments on normalizing flows, time series models and image recognition that the proposed regularization can significantly decrease training time and even improve performance over baseline models.
△ Less
Submitted 2 November, 2020; v1 submitted 18 June, 2020;
originally announced June 2020.
-
Progressive Skeletonization: Trimming more fat from a network at initialization
Authors:
Pau de Jorge,
Amartya Sanyal,
Harkirat S. Behl,
Philip H. S. Torr,
Gregory Rogez,
Puneet K. Dokania
Abstract:
Recent studies have shown that skeletonization (pruning parameters) of networks \textit{at initialization} provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx $95\%$), these approaches fail to preserve the network performance, and to our surprise,…
▽ More
Recent studies have shown that skeletonization (pruning parameters) of networks \textit{at initialization} provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx $95\%$), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum {\em foresight connection sensitivity} (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration. We then propose two approximate procedures to maximize our objective (1) Iterative SNIP: allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages; and (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization. Empirical analyses on a large suite of experiments show that our approach, while providing at least as good a performance as other recent approaches on moderate pruning levels, provides remarkably improved performance on higher pruning levels (could remove up to $99.5\%$ parameters while kee** the networks trainable). Code can be found in https://github.com/naver/force.
△ Less
Submitted 19 March, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Alpha MAML: Adaptive Model-Agnostic Meta-Learning
Authors:
Harkirat Singh Behl,
Atılım Güneş Baydin,
Philip H. S. Torr
Abstract:
Model-agnostic meta-learning (MAML) is a meta-learning technique to train a model on a multitude of learning tasks in a way that primes the model for few-shot learning of new tasks. The MAML algorithm performs well on few-shot learning problems in classification, regression, and fine-tuning of policy gradients in reinforcement learning, but comes with the need for costly hyperparameter tuning for…
▽ More
Model-agnostic meta-learning (MAML) is a meta-learning technique to train a model on a multitude of learning tasks in a way that primes the model for few-shot learning of new tasks. The MAML algorithm performs well on few-shot learning problems in classification, regression, and fine-tuning of policy gradients in reinforcement learning, but comes with the need for costly hyperparameter tuning for training stability. We address this shortcoming by introducing an extension to MAML, called Alpha MAML, to incorporate an online hyperparameter adaptation scheme that eliminates the need to tune meta-learning and learning rates. Our results with the Omniglot database demonstrate a substantial reduction in the need to tune MAML training hyperparameters and improvement to training stability with less sensitivity to hyperparameter choice.
△ Less
Submitted 17 May, 2019;
originally announced May 2019.
-
Meta Learning Deep Visual Words for Fast Video Object Segmentation
Authors:
Harkirat Singh Behl,
Mohammad Najafi,
Anurag Arnab,
Philip H. S. Torr
Abstract:
Personal robots and driverless cars need to be able to operate in novel environments and thus quickly and efficiently learn to recognise new object classes. We address this problem by considering the task of video object segmentation. Previous accurate methods for this task finetune a model using the first annotated frame, and/or use additional inputs such as optical flow and complex post-processi…
▽ More
Personal robots and driverless cars need to be able to operate in novel environments and thus quickly and efficiently learn to recognise new object classes. We address this problem by considering the task of video object segmentation. Previous accurate methods for this task finetune a model using the first annotated frame, and/or use additional inputs such as optical flow and complex post-processing. In contrast, we develop a fast, causal algorithm that requires no finetuning, auxiliary inputs or post-processing, and segments a variable number of objects in a single forward-pass. We represent an object with clusters, or "visual words", in the embedding space, which correspond to object parts in the image space. This allows us to robustly match to the reference objects throughout the video, because although the global appearance of an object changes as it undergoes occlusions and deformations, the appearance of more local parts may stay consistent. We learn these visual words in an unsupervised manner, using meta-learning to ensure that our training objective matches our inference procedure. We achieve comparable accuracy to finetuning based methods (whilst being 1 to 2 orders of magnitude faster), and state-of-the-art in terms of speed/accuracy trade-offs on four video segmentation datasets. Code is available at https://github.com/harkiratbehl/MetaVOS.
△ Less
Submitted 16 August, 2020; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Incremental Tube Construction for Human Action Detection
Authors:
Harkirat Singh Behl,
Michael Sapienza,
Gurkirt Singh,
Suman Saha,
Fabio Cuzzolin,
Philip H. S. Torr
Abstract:
Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action per video, or because they assume that the entire video is available ahead of time. In this work, we introduce a real-time and online joint-labelling and associ…
▽ More
Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action per video, or because they assume that the entire video is available ahead of time. In this work, we introduce a real-time and online joint-labelling and association algorithm for action detection that can incrementally construct space-time action tubes on the most challenging action videos in which different action categories occur concurrently. In contrast to previous methods, we solve the detection-window association and action labelling problems jointly in a single pass. We demonstrate superior online association accuracy and speed (2.2ms per frame) as compared to the current state-of-the-art offline systems. We further demonstrate that the entire action detection pipeline can easily be made to work effectively in real-time using our action tube construction algorithm.
△ Less
Submitted 23 July, 2018; v1 submitted 5 April, 2017;
originally announced April 2017.
-
Ultrafast Dynamics of Vibrational Symmetry Breaking in a Charge-ordered Nickelate
Authors:
Giacomo Coslovich,
Alexander F. Kemper,
Sascha Behl,
Bernhard Huber,
Hans A. Bechtel,
Takao Sasagawa,
Michael C. Martin,
Alessandra Lanzara,
Robert A. Kaindl
Abstract:
The ability to probe symmetry breaking transitions on their natural time scales is one of the key challenges in nonequilibrium physics. Stripe ordering represents an intriguing type of broken symmetry, where complex interactions result in atomic-scale lines of charge and spin density. Although phonon anomalies and periodic distortions attest the importance of electron-phonon coupling in the format…
▽ More
The ability to probe symmetry breaking transitions on their natural time scales is one of the key challenges in nonequilibrium physics. Stripe ordering represents an intriguing type of broken symmetry, where complex interactions result in atomic-scale lines of charge and spin density. Although phonon anomalies and periodic distortions attest the importance of electron-phonon coupling in the formation of stripe phases, a direct time-domain view of vibrational symmetry breaking is lacking. We report experiments that track the transient multi-THz response of the model stripe compound La$_{1.75}$Sr$_{0.25}$NiO$_{4}$, yielding novel insight into its electronic and structural dynamics following an ultrafast optical quench. We find that although electronic carriers are immediately delocalized, the crystal symmetry remains initially frozen - as witnessed by time-delayed suppression of zone-folded Ni-O bending modes acting as a fingerprint of lattice symmetry. Longitudinal and transverse vibrations react with different speeds, indicating a strong directionality and an important role of polar interactions. The hidden complexity of electronic and structural coupling during stripe melting and formation, captured here within a single terahertz spectrum, opens new paths to understanding symmetry breaking dynamics in solids.
△ Less
Submitted 30 November, 2017; v1 submitted 25 March, 2016;
originally announced March 2016.