Search | arXiv e-print repository

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.11742 [pdf, other]

Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance

Authors: Chiraag Kaushik, Ran Liu, Chi-Heng Lin, Amrit Khera, Matthew Y **, Wenrui Ma, Vidya Muthukumar, Eva L Dyer

Abstract: Classification models are expected to perform equally well for different classes, yet in practice, there are often large gaps in their performance. This issue of class bias is widely studied in cases of datasets with sample imbalance, but is relatively overlooked in balanced datasets. In this work, we introduce the concept of spectral imbalance in features as a potential source for class dispariti… ▽ More Classification models are expected to perform equally well for different classes, yet in practice, there are often large gaps in their performance. This issue of class bias is widely studied in cases of datasets with sample imbalance, but is relatively overlooked in balanced datasets. In this work, we introduce the concept of spectral imbalance in features as a potential source for class disparities and study the connections between spectral imbalance and class bias in both theory and practice. To build the connection between spectral imbalance and class gap, we develop a theoretical framework for studying class disparities and derive exact expressions for the per-class error in a high-dimensional mixture model setting. We then study this phenomenon in 11 different state-of-the-art pretrained encoders and show how our proposed framework can be used to compare the quality of encoders, as well as evaluate and combine data augmentation strategies to mitigate the issue. Our work sheds light on the class-dependent effects of learning, and provides new insights into how state-of-the-art pretrained features may have unknown biases that can be diagnosed through their spectra. △ Less

Submitted 3 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

Comments: 25 pages, 9 figures

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.06585 [pdf, other]

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Authors: Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron , et al. (16 additional authors not shown)

Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig… ▽ More Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data. △ Less

Submitted 17 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: Accepted to TMLR. Camera-ready version. First three authors contributed equally

arXiv:2310.16046 [pdf, other]

A Unified, Scalable Framework for Neural Population Decoding

Authors: Mehdi Azabou, Vinam Arora, Venkataramana Ganesh, Ximeng Mao, Santosh Nachimuthu, Michael J. Mendelson, Blake Richards, Matthew G. Perich, Guillaume Lajoie, Eva L. Dyer

Abstract: Our ability to use deep learning approaches to decipher neural activity would likely benefit from greater scale, in terms of both model size and datasets. However, the integration of many neural recordings into one unified model is challenging, as each recording contains the activity of different neurons from different individual animals. In this paper, we introduce a training framework and archit… ▽ More Our ability to use deep learning approaches to decipher neural activity would likely benefit from greater scale, in terms of both model size and datasets. However, the integration of many neural recordings into one unified model is challenging, as each recording contains the activity of different neurons from different individual animals. In this paper, we introduce a training framework and architecture designed to model the population dynamics of neural activity across diverse, large-scale neural recordings. Our method first tokenizes individual spikes within the dataset to build an efficient representation of neural events that captures the fine temporal structure of neural activity. We then employ cross-attention and a PerceiverIO backbone to further construct a latent tokenization of neural population activities. Utilizing this architecture and training framework, we construct a large-scale multi-session model trained on large datasets from seven nonhuman primates, spanning over 158 different sessions of recording from over 27,373 neural units and over 100 hours of recordings. In a number of different tasks, we demonstrate that our pretrained model can be rapidly adapted to new, unseen sessions with unspecified neuron correspondence, enabling few-shot performance with minimal labels. This work presents a powerful new approach for building deep learning tools to analyze neural data and stakes out a clear path to training at scale. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2308.14596 [pdf, other]

LatentDR: Improving Model Generalization Through Sample-Aware Latent Degradation and Restoration

Authors: Ran Liu, Sahil Khose, **gyun Xiao, Lakshmi Sathidevi, Keerthan Ramnath, Zsolt Kira, Eva L. Dyer

Abstract: Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the l… ▽ More Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the latent space, map** them to augmented labels, and then restores the samples from their corrupted versions during training. This process confuses the classifier in the degradation step and restores the overall class distribution of the original samples, promoting diverse intra-class/cross-domain variability. We extensively evaluate our approach on a diverse set of datasets and tasks, including domain generalization benchmarks and medical imaging datasets with strong domain shift, where we show our approach achieves significant improvements over existing methods for latent space augmentation. We further show that our method can be flexibly adapted to long-tail recognition tasks, demonstrating its versatility in building more generalizable models. Code is available at https://github.com/nerdslab/LatentDR. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2308.09198 [pdf, other]

Half-Hop: A graph upsampling approach for slowing down message passing

Authors: Mehdi Azabou, Venkataramana Ganesh, Shantanu Thakoor, Chi-Heng Lin, Lakshmi Sathidevi, Ran Liu, Michal Valko, Petar Veličković, Eva L. Dyer

Abstract: Message passing neural networks have shown a lot of success on graph-structured data. However, there are many instances where message passing can lead to over-smoothing or fail when neighboring nodes belong to different classes. In this work, we introduce a simple yet general framework for improving learning in message passing neural networks. Our approach essentially upsamples edges in the origin… ▽ More Message passing neural networks have shown a lot of success on graph-structured data. However, there are many instances where message passing can lead to over-smoothing or fail when neighboring nodes belong to different classes. In this work, we introduce a simple yet general framework for improving learning in message passing neural networks. Our approach essentially upsamples edges in the original graph by adding "slow nodes" at each edge that can mediate communication between a source and a target node. Our method only modifies the input graph, making it plug-and-play and easy to use with existing models. To understand the benefits of slowing down message passing, we provide theoretical and empirical analyses. We report results on several supervised and self-supervised benchmarks, and show improvements across the board, notably in heterophilic conditions where adjacent nodes are more likely to have different labels. Finally, we show how our approach can be used to generate augmentations for self-supervised learning, where slow nodes are randomly introduced into different edges in the graph to generate multi-scale views with variable path lengths. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Published as a conference paper at ICML 2023

arXiv:2307.11600 [pdf, other]

doi 10.1364/JOSAB.501086

Optical pum** enhancement of a free-induction-decay magnetometer

Authors: Dominic Hunter, Marcin S. Mrozowski, Allan McWilliam, Stuart J. Ingleby, Terry E. Dyer, Paul F. Griffin, Erling Riis

Abstract: Spin preparation prior to a free-induction-decay (FID) measurement can be adversely affected by transverse bias fields, particularly in the geophysical field range. A strategy that enhances the spin polarization accumulated before readout is demonstrated, by synchronizing optical pum** with a magnetic field pulse that supersedes any transverse fields by over two order of magnitude. The pulsed ma… ▽ More Spin preparation prior to a free-induction-decay (FID) measurement can be adversely affected by transverse bias fields, particularly in the geophysical field range. A strategy that enhances the spin polarization accumulated before readout is demonstrated, by synchronizing optical pum** with a magnetic field pulse that supersedes any transverse fields by over two order of magnitude. The pulsed magnetic field is generated along the optical pum** axis using a compact electromagnetic coil pair encompassing a micro-electromechanical systems (MEMS) vapor cell. The coils also resistively heat the cesium (Cs) vapor to the optimal atomic density without spurious magnetic field contributions as they are rapidly demagnetized to approximately zero field during spin readout. The demagnetization process is analyzed electronically, and directly with a FID measurement, to confirm that the residual magnetic field is minimal during detection. The sensitivity performance of this technique is compared to existing optical pum** modalities across a wide magnetic field range. A noise floor sensitivity of $238\,\mathrm{fT/\surd{Hz}}$ was achieved in a field of approximately $\mathrm{50\,μ{T}}$, in close agreement with the Cramér-Rao lower bound (CRLB) predicted noise density of $258\,\mathrm{fT/\surd{Hz}}$. △ Less

Submitted 29 September, 2023; v1 submitted 21 July, 2023; originally announced July 2023.

Comments: 10 pages, 7 figures

Journal ref: Journal of the Optical Society of America B, vol 40, issue 10, pp. 2489-2683 (2023)

arXiv:2305.10403 [pdf, other]

PaLM 2 Technical Report

Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report. △ Less

Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.04142 [pdf]

Slideflow: Deep Learning for Digital Histopathology with Real-Time Whole-Slide Visualization

Authors: James M. Dolezal, Sara Kochanny, Emma Dyer, Andrew Srisuwananukorn, Matteo Sacco, Frederick M. Howard, Anran Li, Prajval Mohan, Alexander T. Pearson

Abstract: Deep learning methods have emerged as powerful tools for analyzing histopathological images, but current methods are often specialized for specific domains and software environments, and few open-source options exist for deploying models in an interactive interface. Experimenting with different deep learning approaches typically requires switching software libraries and reprocessing data, reducing… ▽ More Deep learning methods have emerged as powerful tools for analyzing histopathological images, but current methods are often specialized for specific domains and software environments, and few open-source options exist for deploying models in an interactive interface. Experimenting with different deep learning approaches typically requires switching software libraries and reprocessing data, reducing the feasibility and practicality of experimenting with new architectures. We developed a flexible deep learning library for histopathology called Slideflow, a package which supports a broad array of deep learning methods for digital pathology and includes a fast whole-slide interface for deploying trained models. Slideflow includes unique tools for whole-slide image data processing, efficient stain normalization and augmentation, weakly-supervised whole-slide classification, uncertainty quantification, feature generation, feature space analysis, and explainability. Whole-slide image processing is highly optimized, enabling whole-slide tile extraction at 40X magnification in 2.5 seconds per slide. The framework-agnostic data processing pipeline enables rapid experimentation with new methods built with either Tensorflow or PyTorch, and the graphical user interface supports real-time visualization of slides, predictions, heatmaps, and feature space characteristics on a variety of hardware devices, including ARM-based devices such as the Raspberry Pi. △ Less

Submitted 8 April, 2023; originally announced April 2023.

arXiv:2303.08811 [pdf, other]

Relax, it doesn't matter how you get there: A new self-supervised approach for multi-timescale behavior analysis

Authors: Mehdi Azabou, Michael Mendelson, Nauman Ahad, Maks Sorokin, Shantanu Thakoor, Carolina Urzay, Eva L. Dyer

Abstract: Natural behavior consists of dynamics that are complex and unpredictable, especially when trying to predict many steps into the future. While some success has been found in building representations of behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings where behavior becomes increasingly hard to model. In this wor… ▽ More Natural behavior consists of dynamics that are complex and unpredictable, especially when trying to predict many steps into the future. While some success has been found in building representations of behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings where behavior becomes increasingly hard to model. In this work, we develop a multi-task representation learning model for behavior that combines two novel components: (i) An action prediction objective that aims to predict the distribution of actions over future timesteps, and (ii) A multi-scale architecture that builds separate latent spaces to accommodate short- and long-term dynamics. After demonstrating the ability of the method to build representations of both local and global dynamics in realistic robots in varying environments and terrains, we apply our method to the MABe 2022 Multi-agent behavior challenge, where our model ranks 1st overall and on all global tasks, and 1st or 2nd on 7 out of 9 frame-level tasks. In all of these cases, we show that our model can build representations that capture the many different factors that drive behavior and solve a wide range of downstream tasks. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: arXiv admin note: text overlap with arXiv:2206.07041

arXiv:2302.11023 [pdf, other]

Learning signatures of decision making from many individuals playing the same game

Authors: Michael J Mendelson, Mehdi Azabou, Suma Jacob, Nicola Grissom, David Darrow, Becket Ebitz, Alexander Herman, Eva L. Dyer

Abstract: Human behavior is incredibly complex and the factors that drive decision making--from instinct, to strategy, to biases between individuals--often vary over multiple timescales. In this paper, we design a predictive framework that learns representations to encode an individual's 'behavioral style', i.e. long-term behavioral trends, while simultaneously predicting future actions and choices. The mod… ▽ More Human behavior is incredibly complex and the factors that drive decision making--from instinct, to strategy, to biases between individuals--often vary over multiple timescales. In this paper, we design a predictive framework that learns representations to encode an individual's 'behavioral style', i.e. long-term behavioral trends, while simultaneously predicting future actions and choices. The model explicitly separates representations into three latent spaces: the recent past space, the short-term space, and the long-term space where we hope to capture individual differences. To simultaneously extract both global and local variables from complex human behavior, our method combines a multi-scale temporal convolutional network with latent prediction tasks, where we encourage embeddings across the entire sequence, as well as subsets of the sequence, to be mapped to similar points in the latent space. We develop and apply our method to a large-scale behavioral dataset from 1,000 humans playing a 3-armed bandit task, and analyze what our model's resulting embeddings reveal about the human decision making process. In addition to predicting future choices, we show that our model can learn rich representations of human behavior over multiple timescales and provide signatures of differences in individuals. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: 4 pages, 2 figures. To be published in IEEE NER

arXiv:2301.00345 [pdf, other]

MTNeuro: A Benchmark for Evaluating Representations of Brain Structure Across Multiple Levels of Abstraction

Authors: Jorge Quesada, Lakshmi Sathidevi, Ran Liu, Nauman Ahad, Joy M. Jackson, Mehdi Azabou, **gyun Xiao, Christopher Liding, Matthew **, Carolina Urzay, William Gray-Roncal, Erik C. Johnson, Eva L. Dyer

Abstract: There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain map**, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain reg… ▽ More There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain map**, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain region the image comes from) is a crucial and open challenge. However, most existing datasets and benchmarks for neuroanatomy consider only a single downstream task at a time. To bridge this gap, we introduce a new dataset, annotations, and multiple downstream tasks that provide diverse ways to readout information about brain structure and architecture from the same image. Our multi-task neuroimaging benchmark (MTNeuro) is built on volumetric, micrometer-resolution X-ray microtomography images spanning a large thalamocortical section of mouse brain, encompassing multiple cortical and subcortical regions. We generated a number of different prediction challenges and evaluated several supervised and self-supervised models for brain-region prediction and pixel-level semantic segmentation of microstructures. Our experiments not only highlight the rich heterogeneity of this dataset, but also provide insights into how self-supervised approaches can be used to learn representations that capture multiple attributes of a single image and perform well on a variety of downstream tasks. Datasets, code, and pre-trained baseline models are provided at: https://mtneuro.github.io/ . △ Less

Submitted 31 December, 2022; originally announced January 2023.

Comments: 10 pages, 4 figures, Accepted at NeurIPS 2022

arXiv:2210.05021 [pdf, other]

The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective

Authors: Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar

Abstract: Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine… ▽ More Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design. △ Less

Submitted 27 February, 2024; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: 72 pages, 8 figures

arXiv:2207.04901 [pdf, other]

Exploring Length Generalization in Large Language Models

Authors: Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur

Abstract: The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring th… ▽ More The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models. We first establish that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models' in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization. We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equip** language models with the ability to generalize to longer problems. △ Less

Submitted 14 November, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

arXiv:2206.14858 [pdf, other]

Solving Quantitative Reasoning Problems with Language Models

Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

Abstract: Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained o… ▽ More Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them. △ Less

Submitted 30 June, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

Comments: 12 pages, 5 figures + references and appendices

arXiv:2206.07041 [pdf, other]

Learning Behavior Representations Through Multi-Timescale Bootstrap**

Authors: Mehdi Azabou, Michael Mendelson, Maks Sorokin, Shantanu Thakoor, Nauman Ahad, Carolina Urzay, Eva L. Dyer

Abstract: Natural behavior consists of dynamics that are both unpredictable, can switch suddenly, and unfold over many different timescales. While some success has been found in building representations of behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings due to the fact that they assume a single scale of temporal dynamic… ▽ More Natural behavior consists of dynamics that are both unpredictable, can switch suddenly, and unfold over many different timescales. While some success has been found in building representations of behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings due to the fact that they assume a single scale of temporal dynamics. In this work, we introduce Bootstrap Across Multiple Scales (BAMS), a multi-scale representation learning model for behavior: we combine a pooling module that aggregates features extracted over encoders with different temporal receptive fields, and design a set of latent objectives to bootstrap the representations in each respective space to encourage disentanglement across different timescales. We first apply our method on a dataset of quadrupeds navigating in different terrain types, and show that our model captures the temporal complexity of behavior. We then apply our method to the MABe 2022 Multi-agent behavior challenge, where our model ranks 3rd overall and 1st on two subtasks, and show the importance of incorporating multi-timescales when analyzing behavior. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2206.06131 [pdf, other]

Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers

Authors: Ran Liu, Mehdi Azabou, Max Dabagia, **gyun Xiao, Eva L. Dyer

Abstract: Complex time-varying systems are often studied by abstracting away from the dynamics of individual components to build a model of the population-level dynamics from the start. However, when building a population-level description, it can be easy to lose sight of each individual and how they contribute to the larger picture. In this paper, we present a novel transformer architecture for learning fr… ▽ More Complex time-varying systems are often studied by abstracting away from the dynamics of individual components to build a model of the population-level dynamics from the start. However, when building a population-level description, it can be easy to lose sight of each individual and how they contribute to the larger picture. In this paper, we present a novel transformer architecture for learning from time-varying data that builds descriptions of both the individual as well as the collective population dynamics. Rather than combining all of our data into our model at the onset, we develop a separable architecture that operates on individual time-series first before passing them forward; this induces a permutation-invariance property and can be used to transfer across systems of different size and order. After demonstrating that our model can be applied to successfully recover complex interactions and dynamics in many-body systems, we apply our approach to populations of neurons in the nervous system. On neural activity datasets, we show that our model not only yields robust decoding performance, but also provides impressive performance in transfer across recordings of different animals without any neuron-level correspondence. By enabling flexible pre-training that can be transferred to neural recordings of different size and order, our work provides a first step towards creating a foundation model for neural decoding. △ Less

Submitted 20 October, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: accepted by NeurIPS 2022

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2205.08413 [pdf, other]

Comparing high-dimensional neural recordings by aligning their low-dimensional latent representations

Authors: Max Dabagia, Konrad P Kording, Eva L Dyer

Abstract: Many questions in neuroscience involve understanding of the responses of large populations of neurons. However, when dealing with large-scale neural activity, interpretation becomes difficult, and comparisons between two animals, or across different time points becomes challenging. One major challenge that we face in modern neuroscience is that of correspondence, e.g. we do not record the exact sa… ▽ More Many questions in neuroscience involve understanding of the responses of large populations of neurons. However, when dealing with large-scale neural activity, interpretation becomes difficult, and comparisons between two animals, or across different time points becomes challenging. One major challenge that we face in modern neuroscience is that of correspondence, e.g. we do not record the exact same neurons at the exact same times. Without some way to link two or more datasets, comparing different collections of neural activity patterns becomes impossible. Here, we describe approaches for leveraging shared latent structure across neural recordings to tackle this correspondence challenge. We review algorithms that map two datasets into a shared space where they can be directly compared, and argue that alignment is key for comparing high-dimensional neural activities across times, subsets of neurons, and individuals. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2203.07852 [pdf, other]

Block-Recurrent Transformers

Authors: DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

Abstract: We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is stri… ▽ More We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is strikingly simple. It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens. Our design was inspired in part by LSTM cells, and it uses LSTM-style gates, but it scales the typical LSTM cell up by several orders of magnitude. Our implementation of recurrence has the same cost in both computation time and parameter count as a conventional transformer layer, but offers dramatically improved perplexity in language modeling tasks over very long sequences. Our model out-performs a long-range Transformer XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG19 (books), arXiv papers, and GitHub source code. Our code has been released as open source. △ Less

Submitted 1 November, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

Comments: Update to NeurIPS camera-ready version

arXiv:2202.04000 [pdf, other]

Learning Sinkhorn divergences for supervised change point detection

Authors: Nauman Ahad, Eva L. Dyer, Keith B. Hengen, Yao Xie, Mark A. Davenport

Abstract: Many modern applications require detecting change points in complex sequential data. Most existing methods for change point detection are unsupervised and, as a consequence, lack any information regarding what kind of changes we want to detect or if some kinds of changes are safe to ignore. This often results in poor change detection performance. We present a novel change point detection framework… ▽ More Many modern applications require detecting change points in complex sequential data. Most existing methods for change point detection are unsupervised and, as a consequence, lack any information regarding what kind of changes we want to detect or if some kinds of changes are safe to ignore. This often results in poor change detection performance. We present a novel change point detection framework that uses true change point instances as supervision for learning a ground metric such that Sinkhorn divergences can be then used in two-sample tests on sliding windows to detect change points in an online manner. Our method can be used to learn a sparse metric which can be useful for both feature selection and interpretation in high-dimensional change point detection settings. Experiments on simulated as well as real world sequences show that our proposed method can substantially improve change point detection performance over existing unsupervised change point detection methods using only few labeled change point instances. △ Less

Submitted 10 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: 19 pages, 13 figures. Reorganized figures and text for improved readability

arXiv:2111.02338 [pdf, other]

Drop, Swap, and Generate: A Self-Supervised Approach for Generating Neural Activity

Authors: Ran Liu, Mehdi Azabou, Max Dabagia, Chi-Heng Lin, Mohammad Gheshlaghi Azar, Keith B. Hengen, Michal Valko, Eva L. Dyer

Abstract: Meaningful and simplified representations of neural activity can yield insights into how and what information is being processed within a neural circuit. However, without labels, finding representations that reveal the link between the brain and behavior can be challenging. Here, we introduce a novel unsupervised approach for learning disentangled representations of neural activity called Swap-VAE… ▽ More Meaningful and simplified representations of neural activity can yield insights into how and what information is being processed within a neural circuit. However, without labels, finding representations that reveal the link between the brain and behavior can be challenging. Here, we introduce a novel unsupervised approach for learning disentangled representations of neural activity called Swap-VAE. Our approach combines a generative modeling framework with an instance-specific alignment loss that tries to maximize the representational similarity between transformed views of the input (brain state). These transformed (or augmented) views are created by drop** out neurons and jittering samples in time, which intuitively should lead the network to a representation that maintains both temporal consistency and invariance to the specific neurons used to represent the neural state. Through evaluations on both synthetic data and neural recordings from hundreds of neurons in different primate brains, we show that it is possible to build representations that disentangle neural datasets along relevant latent dimensions linked to behavior. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: To be published in Neurips 2021

arXiv:2109.04463 [pdf, other]

Neural Latents Benchmark '21: Evaluating latent variable models of neural population activity

Authors: Felix Pei, Joel Ye, David Zoltowski, Anqi Wu, Raeed H. Chowdhury, Hansem Sohn, Joseph E. O'Doherty, Krishna V. Shenoy, Matthew T. Kaufman, Mark Churchland, Mehrdad Jazayeri, Lee E. Miller, Jonathan Pillow, Il Memming Park, Eva L. Dyer, Chethan Pandarinath

Abstract: Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress with LVMs for neuronal population activ… ▽ More Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress with LVMs for neuronal population activity is currently impeded by a lack of standardization, resulting in methods being developed and compared in an ad hoc manner. To coordinate these modeling efforts, we introduce a benchmark suite for latent variable modeling of neural population activity. We curate four datasets of neural spiking activity from cognitive, sensory, and motor areas to promote models that apply to the wide variety of activity seen across these areas. We identify unsupervised evaluation as a common framework for evaluating models across datasets, and apply several baselines that demonstrate benchmark diversity. We release this benchmark through EvalAI. http://neurallatents.github.io △ Less

Submitted 17 January, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2102.10106 [pdf, other]

Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction

Authors: Mehdi Azabou, Mohammad Gheshlaghi Azar, Ran Liu, Chi-Heng Lin, Erik C. Johnson, Kiran Bhaskaran-Nair, Max Dabagia, Bernardo Avila-Pires, Lindsey Kitchell, Keith B. Hengen, William Gray-Roncal, Michal Valko, Eva L. Dyer

Abstract: State-of-the-art methods for self-supervised learning (SSL) build representations by maximizing the similarity between different transformed "views" of a sample. Without sufficient diversity in the transformations used to create views, however, it can be difficult to overcome nuisance variables in the data and build rich representations. This motivates the use of the dataset itself to find similar… ▽ More State-of-the-art methods for self-supervised learning (SSL) build representations by maximizing the similarity between different transformed "views" of a sample. Without sufficient diversity in the transformations used to create views, however, it can be difficult to overcome nuisance variables in the data and build rich representations. This motivates the use of the dataset itself to find similar, yet distinct, samples to serve as views for one another. In this paper, we introduce Mine Your Own vieW (MYOW), a new approach for self-supervised learning that looks within the dataset to define diverse targets for prediction. The idea behind our approach is to actively mine views, finding samples that are neighbors in the representation space of the network, and then predict, from one sample's latent representation, the representation of a nearby sample. After showing the promise of MYOW on benchmarks used in computer vision, we highlight the power of this idea in a novel application in neuroscience where SSL has yet to be applied. When tested on multi-unit neural recordings, we find that MYOW outperforms other self-supervised approaches in all examples (in some cases by more than 10%), and often surpasses the supervised baseline. With MYOW, we show that it is possible to harness the diversity of the data to build rich views and leverage self-supervision in new domains where augmentations are limited or unknown. △ Less

Submitted 13 December, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

arXiv:2102.06701 [pdf, other]

doi 10.1073/pnas.2311878121

Explaining Neural Scaling Laws

Authors: Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma

Abstract: The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scali… ▽ More The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origins of and relationships between scaling exponents. △ Less

Submitted 28 April, 2024; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: 11 pages, 3 figures + Supplement (expanded). This version to appear in PNAS

Journal ref: PNAS 121 (27) e2311878121 (2024)

arXiv:2102.06514 [pdf, other]

Large-Scale Representation Learning on Graphs via Bootstrap**

Authors: Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko

Abstract: Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootst… ▽ More Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. BGRL outperforms or matches prior methods on several established benchmarks, while achieving a 2-10x reduction in memory costs. Furthermore, we show that BGRL can be scaled up to extremely large graphs with hundreds of millions of nodes in the semi-supervised regime - achieving state-of-the-art performance and improving over supervised baselines where representations are shaped only through label information. In particular, our solution centered on BGRL constituted one of the winning entries to the Open Graph Benchmark - Large Scale Challenge at KDD Cup 2021, on a graph orders of magnitudes larger than all previously available benchmarks, thus demonstrating the scalability and effectiveness of our approach. △ Less

Submitted 20 February, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2012.11589 [pdf, other]

Making transport more robust and interpretable by moving data through a small number of anchor points

Authors: Chi-Heng Lin, Mehdi Azabou, Eva L. Dyer

Abstract: Optimal transport (OT) is a widely used technique for distribution alignment, with applications throughout the machine learning, graphics, and vision communities. Without any additional structural assumptions on trans-port, however, OT can be fragile to outliers or noise, especially in high dimensions. Here, we introduce a new form of structured OT that simultaneously learns low-dimensional struct… ▽ More Optimal transport (OT) is a widely used technique for distribution alignment, with applications throughout the machine learning, graphics, and vision communities. Without any additional structural assumptions on trans-port, however, OT can be fragile to outliers or noise, especially in high dimensions. Here, we introduce a new form of structured OT that simultaneously learns low-dimensional structure in data while leveraging this structure to solve the alignment task. Compared with OT, the resulting transport plan has better structural interpretability, highlighting the connections between individual data points and local geometry, and is more robust to noise and sampling. We apply the method to synthetic as well as real datasets, where we show that our method can facilitate alignment in noisy settings and can be used to both correct and interpret domain shift. △ Less

Submitted 17 July, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

Journal ref: International Conference on Machine Learning (ICML) 2021

arXiv:2012.03107 [pdf, other]

When Do Curricula Work?

Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

Abstract: Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefit… ▽ More Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the \emph{implicit curricula} resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of \emph{explicit curricula}, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data. △ Less

Submitted 9 February, 2021; v1 submitted 5 December, 2020; originally announced December 2020.

Comments: ICLR 2021

arXiv:2008.08675 [pdf, other]

Asymptotics of Wide Convolutional Neural Networks

Authors: Anders Andreassen, Ethan Dyer

Abstract: Wide neural networks have proven to be a rich class of architectures for both theory and practice. Motivated by the observation that finite width convolutional networks appear to outperform infinite width networks, we study scaling laws for wide CNNs and networks with skip connections. Following the approach of (Dyer & Gur-Ari, 2019), we present a simple diagrammatic recipe to derive the asymptoti… ▽ More Wide neural networks have proven to be a rich class of architectures for both theory and practice. Motivated by the observation that finite width convolutional networks appear to outperform infinite width networks, we study scaling laws for wide CNNs and networks with skip connections. Following the approach of (Dyer & Gur-Ari, 2019), we present a simple diagrammatic recipe to derive the asymptotic width dependence for many quantities of interest. These scaling relationships provide a solvable description for the training dynamics of wide convolutional networks. We test these relations across a broad range of architectures. In particular, we find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width. Nonetheless, this relation is consistent with finite width models generalizing either better or worse than their infinite width counterparts, and we provide examples where the relative performance depends on the optimization details. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 23 pages, 12 figures

arXiv:2008.07545 [pdf, other]

Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization

Authors: Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

Abstract: Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second momen… ▽ More Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information, resulting in reduced or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve. △ Less

Submitted 19 July, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: 13+10 pages, 10 figures; minor textual changes and some reorganization, one new figure and a new proof of main theorem added

arXiv:2007.07400 [pdf, other]

Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics

Authors: Vinay V. Ramasesh, Ethan Dyer, Maithra Raghu

Abstract: A central challenge in develo** versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting… ▽ More A central challenge in develo** versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting affects representations in neural network models. Through representational analysis techniques, we find that deeper layers are disproportionately the source of forgetting. Supporting this, a study of methods to mitigate forgetting illustrates that they act to stabilize deeper layers. These insights enable the development of an analytic argument and empirical picture relating the degree of forgetting to representational similarity between tasks. Consistent with this picture, we observe maximal forgetting occurs for task sequences with intermediate similarity. We perform empirical studies on the standard split CIFAR-10 setup and also introduce a novel CIFAR-100 based task approximating realistic input distribution shift. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2006.02624 [pdf, other]

Bayesian optimization for modular black-box systems with switching costs

Authors: Chi-Heng Lin, Joseph D. Miano, Eva L. Dyer

Abstract: Most existing black-box optimization methods assume that all variables in the system being optimized have equal cost and can change freely at each iteration. However, in many real world systems, inputs are passed through a sequence of different operations or modules, making variables in earlier stages of processing more costly to update. Such structure imposes a cost on switching variables in earl… ▽ More Most existing black-box optimization methods assume that all variables in the system being optimized have equal cost and can change freely at each iteration. However, in many real world systems, inputs are passed through a sequence of different operations or modules, making variables in earlier stages of processing more costly to update. Such structure imposes a cost on switching variables in early parts of a data processing pipeline. In this work, we propose a new algorithm for switch cost-aware optimization called Lazy Modular Bayesian Optimization (LaMBO). This method efficiently identifies the global optimum while minimizing cost through a passive change of variables in early modules. The method is theoretical grounded and achieves vanishing regret when augmented with switching cost. We apply LaMBO to multiple synthetic functions and a three-stage image segmentation pipeline used in a neuroscience application, where we obtain promising improvements over prevailing cost-aware Bayesian optimization algorithms. Our results demonstrate that LaMBO is an effective strategy for black-box optimization that is capable of minimizing switching costs in modular systems. △ Less

Submitted 11 October, 2021; v1 submitted 3 June, 2020; originally announced June 2020.

arXiv:2003.03267 [pdf, other]

Resonant Very Low- and Ultra Low Frequency Digital Signal Reception Using a Portable Atomic Magnetometer

Authors: Stuart J. Ingleby, Iain C. Chalmers, Terry E. Dyer, Paul F. Griffin, Erling Riis

Abstract: Radio communication through attenuating media necessitates the use of very-low frequency (VLF) and ultra-low frequency (ULF) carrier bands, which are frequently used in underwater and under-ground communication applications. Quantum sensing techniques can be used to circumvent hard constraints on the size, weight and noise floor of classical signal transducers. In this low-frequency range, an opti… ▽ More Radio communication through attenuating media necessitates the use of very-low frequency (VLF) and ultra-low frequency (ULF) carrier bands, which are frequently used in underwater and under-ground communication applications. Quantum sensing techniques can be used to circumvent hard constraints on the size, weight and noise floor of classical signal transducers. In this low-frequency range, an optically pumped atomic sample can be used to detect carrier wave modulation resonant with ground-state Zeeman splitting of alkali atoms. Using a compact, self-calibrating system we demonstrate a resonant atomic transducer for digital data encoded using binary phase- and frequency-keying of resonant carrier waves in the 200 Hz -200 kHz range. We present field trial data showing sensor noise floor, decoded data and received bit error rate, and calculate the projected range of sub-sea communication using this device. △ Less

Submitted 6 March, 2020; originally announced March 2020.

Comments: 8 pages, 9 figures

arXiv:2003.02218 [pdf, other]

The large learning rate phase of deep learning: the catapult mechanism

Authors: Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

Abstract: The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small l… ▽ More The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Comments: 25 pages, 19 figures

arXiv:2002.08973 [pdf, other]

Affinity and Diversity: Quantifying Mechanisms of Data Augmentation

Authors: Raphael Gontijo-Lopes, Sylvia J. Smullin, Ekin D. Cubuk, Ethan Dyer

Abstract: Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of either distribution shift or augmentation diversity. Inspired by these, we seek to quantify how data augmentation improves model generalizati… ▽ More Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of either distribution shift or augmentation diversity. Inspired by these, we seek to quantify how data augmentation improves model generalization. To this end, we introduce interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance is predicted not by either of these alone but by jointly optimizing the two. △ Less

Submitted 4 June, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: 10 pages, 7 figures

arXiv:1909.11304 [pdf, other]

Asymptotics of Wide Networks from Feynman Diagrams

Authors: Ethan Dyer, Guy Gur-Ari

Abstract: Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolut… ▽ More Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: 10 pages, 3 figures, 1 Table + Appendices

arXiv:1906.11768 [pdf, other]

Hierarchical Optimal Transport for Multimodal Distribution Alignment

Authors: John Lee, Max Dabagia, Eva L. Dyer, Christopher J. Rozell

Abstract: In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets. Optimal transport (OT)-based approaches pose alignment as a divergence minimization problem: the aim is to transform a source dataset to match a target dataset using the Wasserstein distance as a divergence measure. We introduce a hierarchical formulation of OT which… ▽ More In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets. Optimal transport (OT)-based approaches pose alignment as a divergence minimization problem: the aim is to transform a source dataset to match a target dataset using the Wasserstein distance as a divergence measure. We introduce a hierarchical formulation of OT which leverages clustered structure in data to improve alignment in noisy, ambiguous, or multimodal settings. To solve this numerically, we propose a distributed ADMM algorithm that also exploits the Sinkhorn distance, thus it has an efficient computational complexity that scales quadratically with the size of the largest cluster. When the transformation between two datasets is unitary, we provide performance guarantees that describe when and how well aligned cluster correspondences can be recovered with our formulation, as well as provide worst-case dataset geometry for such a strategy. We apply this method to synthetic datasets that model data as mixtures of low-rank Gaussians and study the impact that different geometric properties of the data have on alignment. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions and instantaneous velocities from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in datasets, and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in cross-domain alignment. △ Less

Submitted 3 November, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

arXiv:1812.07579 [pdf, other]

doi 10.1007/JHEP04(2019)025

The Most Irrational Rational Theories

Authors: Nathan Benjamin, Ethan Dyer, A. Liam Fitzpatrick, Yuan Xin

Abstract: We propose a two-parameter family of modular invariant partition functions of two-dimensional conformal field theories (CFTs) holographically dual to pure three-dimensional gravity in anti de Sitter space. Our two parameters control the central charge, and the representation of $SL(2,\mathbb{Z})$. At large central charge, the partition function has a gap to the first nontrivial primary state of… ▽ More We propose a two-parameter family of modular invariant partition functions of two-dimensional conformal field theories (CFTs) holographically dual to pure three-dimensional gravity in anti de Sitter space. Our two parameters control the central charge, and the representation of $SL(2,\mathbb{Z})$. At large central charge, the partition function has a gap to the first nontrivial primary state of $\frac{c}{24}$. As the $SL(2,\mathbb{Z})$ representation dimension gets large, the partition function exhibits some of the qualitative features of an irrational CFT. This, for instance, is captured in the behavior of the spectral form factor. As part of these analyses, we find similar behavior in the minimal model spectral form factor as $c$ approaches $1$. △ Less

Submitted 18 December, 2018; originally announced December 2018.

Comments: 25 pages plus appendices, 11 figures

arXiv:1812.04754 [pdf, other]

Gradient Descent Happens in a Tiny Subspace

Authors: Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer

Abstract: We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly… ▽ More We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning. △ Less

Submitted 11 December, 2018; originally announced December 2018.

Comments: 9 pages + appendices, 12 figures

arXiv:1709.01533 [pdf, other]

doi 10.1007/JHEP02(2018)148

Constraints on Flavored 2d CFT Partition Functions

Authors: Ethan Dyer, A. Liam Fitzpatrick, Yuan Xin

Abstract: We study the implications of modular invariance on 2d CFT partition functions with abelian or non-abelian currents when chemical potentials for the charges are turned on, i.e. when the partition functions are "flavored". We begin with a new proof of the transformation law for the modular transformation of such partition functions. Then we proceed to apply modular bootstrap techniques to constrain… ▽ More We study the implications of modular invariance on 2d CFT partition functions with abelian or non-abelian currents when chemical potentials for the charges are turned on, i.e. when the partition functions are "flavored". We begin with a new proof of the transformation law for the modular transformation of such partition functions. Then we proceed to apply modular bootstrap techniques to constrain the spectrum of charged states in the theory. We improve previous upper bounds on the state with the greatest "mass-to-charge" ratio in such theories, as well as upper bounds on the weight of the lightest charged state and the charge of the weakest charged state in the theory. We apply the extremal functional method to theories that saturate such bounds, and in several cases we find the resulting prediction for the occupation numbers are precisely integers. Because such theories sometimes do not saturate a bound on the full space of states but do saturate a bound in the neutral sector of states, we find that adding flavor allows the extremal functional method to solve for some partition functions that would not be accessible to it otherwise. △ Less

Submitted 4 May, 2018; v1 submitted 5 September, 2017; originally announced September 2017.

Comments: 45 pages, 16 Figures v3: typos corrected, expanded appendix on numeric implementation

arXiv:1707.02467 [pdf, ps, other]

Random Walks on Small World Networks

Authors: Martin E. Dyer, Andreas Galanis, Leslie Ann Goldberg, Mark Jerrum, Eric Vigoda

Abstract: We study the mixing time of random walks on small-world networks modelled as follows: starting with the 2-dimensional periodic grid, each pair of vertices $\{u,v\}$ with distance $d>1$ is added as a "long-range" edge with probability proportional to $d^{-r}$, where $r\geq 0$ is a parameter of the model. Kleinberg studied a close variant of this network model and proved that the (decentralised) rou… ▽ More We study the mixing time of random walks on small-world networks modelled as follows: starting with the 2-dimensional periodic grid, each pair of vertices $\{u,v\}$ with distance $d>1$ is added as a "long-range" edge with probability proportional to $d^{-r}$, where $r\geq 0$ is a parameter of the model. Kleinberg studied a close variant of this network model and proved that the (decentralised) routing time is $O((\log n)^2)$ when $r=2$ and $n^{Ω(1)}$ when $r\neq 2$. Here, we prove that the random walk also undergoes a phase transition at $r=2$, but in this case the phase transition is of a different form. We establish that the mixing time is $Θ(\log n)$ for $r<2$, $O((\log n)^4)$ for $r=2$ and $n^{Ω(1)}$ for $r>2$. △ Less

Submitted 26 February, 2020; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: To appear in Transactions of Algorithms (TALG)

arXiv:1702.06139 [pdf, other]

doi 10.1007/JHEP11(2017)060

Spinning Geodesic Witten Diagrams

Authors: Ethan Dyer, Daniel Z. Freedman, James Sully

Abstract: We present an expression for the four-point conformal blocks of symmetric traceless operators of arbitrary spin as an integral over a pair of geodesics in Anti-de Sitter space, generalizing the geodesic Witten diagram formalism of Hijano et al [arXiv:1508.00501] to arbitrary spin. As an intermediate step in the derivation, we identify a convenient basis of bulk three-point interaction vertices whi… ▽ More We present an expression for the four-point conformal blocks of symmetric traceless operators of arbitrary spin as an integral over a pair of geodesics in Anti-de Sitter space, generalizing the geodesic Witten diagram formalism of Hijano et al [arXiv:1508.00501] to arbitrary spin. As an intermediate step in the derivation, we identify a convenient basis of bulk three-point interaction vertices which give rise to all possible boundary three point structures. We highlight a direct connection between the representation of the conformal block as a geodesic Witten diagram and the shadow operator formalism. △ Less

Submitted 20 February, 2017; originally announced February 2017.

Comments: 28+6 pages, 8 figures

arXiv:1611.04592 [pdf, other]

doi 10.1007/JHEP08(2017)075

2D CFT Partition Functions at Late Times

Authors: Ethan Dyer, Guy Gur-Ari

Abstract: We consider the late time behavior of the analytically continued partition function $Z(β+ it) Z(β- it)$ in holographic $2d$ CFTs. This is a probe of information loss in such theories and in their holographic duals. We show that each Virasoro character decays in time, and so information is not restored at the level of individual characters. We identify a universal decaying contribution at late time… ▽ More We consider the late time behavior of the analytically continued partition function $Z(β+ it) Z(β- it)$ in holographic $2d$ CFTs. This is a probe of information loss in such theories and in their holographic duals. We show that each Virasoro character decays in time, and so information is not restored at the level of individual characters. We identify a universal decaying contribution at late times, and conjecture that it describes the behavior of generic chaotic $2d$ CFTs out to times that are exponentially large in the central charge. It was recently suggested that at sufficiently late times one expects a crossover to random matrix behavior. We estimate an upper bound on the crossover time, which suggests that the decay is followed by a parametrically long period of late time growth. Finally, we discuss integrable theories and show how information is restored at late times by a series of characters. This hints at a possible bulk mechanism, where information is restored by an infinite sum over non-perturbative saddles. △ Less

Submitted 14 November, 2016; originally announced November 2016.

Comments: 36 pages, 7 figures

arXiv:1604.03629 [pdf, other]

Quantifying mesoscale neuroanatomy using X-ray microtomography

Authors: Eva L. Dyer, William Gray Roncal, Hugo L. Fernandes, Doga Gürsoy, Vincent De Andrade, Rafael Vescovi, Kamel Fezzaa, Xianghui Xiao, Joshua T. Vogelstein, Chris Jacobsen, Konrad P. Körding, Narayanan Kasthuri

Abstract: Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (… ▽ More Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography ($μ$CT) for producing mesoscale $(1~μm^3)$ resolution brain maps from millimeter-scale volumes of mouse brain. We introduce a pipeline for $μ$CT-based brain map** that combines methods for sample preparation, imaging, automated segmentation of image volumes into cells and blood vessels, and statistical analysis of the resulting brain structures. Our results demonstrate that X-ray tomography promises rapid quantification of large brain volumes, complementing other brain map** and connectomics efforts. △ Less

Submitted 26 July, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: 28 pages, 9 figures

arXiv:1604.03199 [pdf, other]

From sample to knowledge: Towards an integrated approach for neuroscience discovery

Authors: William Gray Roncal, Eva L Dyer, Doga Gürsoy, Konrad Kording, Narayanan Kasthuri

Abstract: Imaging methods used in modern neuroscience experiments are quickly producing large amounts of data capable of providing increasing amounts of knowledge about neuroanatomy and function. A great deal of information in these datasets is relatively unexplored and untapped. One of the bottlenecks in knowledge extraction is that often there is no feedback loop between the knowledge produced (e.g., grap… ▽ More Imaging methods used in modern neuroscience experiments are quickly producing large amounts of data capable of providing increasing amounts of knowledge about neuroanatomy and function. A great deal of information in these datasets is relatively unexplored and untapped. One of the bottlenecks in knowledge extraction is that often there is no feedback loop between the knowledge produced (e.g., graph, density estimate, or other statistic) and the earlier stages of the pipeline (e.g., acquisition). We thus advocate for the development of sample-to-knowledge discovery pipelines that one can use to optimize acquisition and processing steps with a particular end goal (i.e., piece of knowledge) in mind. We therefore propose that optimization takes place not just within each processing stage but also between adjacent (and non-adjacent) steps of the pipeline. Furthermore, we explore the existing categories of knowledge representation and models to motivate the types of experiments and analysis needed to achieve the ultimate goal. To illustrate this approach, we provide an experimental paradigm to answer questions about large-scale synaptic distributions through a multimodal approach combining X-ray microtomography and electron microscopy. △ Less

Submitted 23 January, 2017; v1 submitted 11 April, 2016; originally announced April 2016.

Comments: first two authors contributed equally. 8 pages, 2 figures. v2: added acknowledgments

arXiv:1603.09745 [pdf, other]

doi 10.1007/JHEP08(2016)041

Universal Bounds on Charged States in 2d CFT and 3d Gravity

Authors: Nathan Benjamin, Ethan Dyer, A. Liam Fitzpatrick, Shamit Kachru

Abstract: We derive an explicit bound on the dimension of the lightest charged state in two dimensional conformal field theories with a global abelian symmetry. We find that the bound scales with $c$ and provide examples that parametrically saturate this bound. We also prove than any such theory must contain a state with charge-to-mass ratio above a minimal lower bound. We comment on the implications for ch… ▽ More We derive an explicit bound on the dimension of the lightest charged state in two dimensional conformal field theories with a global abelian symmetry. We find that the bound scales with $c$ and provide examples that parametrically saturate this bound. We also prove than any such theory must contain a state with charge-to-mass ratio above a minimal lower bound. We comment on the implications for charged states in three dimensional theories of gravity. △ Less

Submitted 18 July, 2016; v1 submitted 31 March, 2016; originally announced March 2016.

Comments: 33 pages, 1 figure; v2: additional refs and comments added

arXiv:1603.08524 [pdf, other]

doi 10.1007/JHEP08(2016)023

Small Black Holes and Near-Extremal CFTs

Authors: Nathan Benjamin, Ethan Dyer, A. Liam Fitzpatrick, Alexander Maloney, Eric Perlmutter

Abstract: Pure theories of AdS$_3$ quantum gravity are conjectured to be dual to CFTs with sparse spectra of light primary operators. The sparsest possible spectrum consistent with modular invariance includes only black hole states above the vacuum. Witten conjectured the existence of a family of extremal CFTs, which realize this spectrum for all admissible values of the central charge. We consider the quan… ▽ More Pure theories of AdS$_3$ quantum gravity are conjectured to be dual to CFTs with sparse spectra of light primary operators. The sparsest possible spectrum consistent with modular invariance includes only black hole states above the vacuum. Witten conjectured the existence of a family of extremal CFTs, which realize this spectrum for all admissible values of the central charge. We consider the quantum corrections to the classical spectrum, and propose a specific modification of Witten's conjecture which takes into account the existence of "small" black hole states. These have zero classical horizon area, with a calculable entropy attributed solely to loop effects. Our conjecture passes various consistency checks, especially when generalized to include theories with supersymmetry. In theories with $\mathcal{N}=2$ supersymmetry, this "near-extremal CFT" proposal precisely evades the no-go results of Gaberdiel et al. △ Less

Submitted 28 March, 2016; originally announced March 2016.

Comments: 41 pages + appendices, 6 figures

arXiv:1602.02191 [pdf, other]

Convex Relaxation Regression: Black-Box Optimization of Smooth Functions by Learning Their Convex Envelopes

Authors: Mohammad Gheshlaghi Azar, Eva Dyer, Konrad Kording

Abstract: Finding efficient and provable methods to solve non-convex optimization problems is an outstanding challenge in machine learning and optimization theory. A popular approach used to tackle non-convex problems is to use convex relaxation techniques to find a convex surrogate for the problem. Unfortunately, convex relaxations typically must be found on a problem-by-problem basis. Thus, providing a ge… ▽ More Finding efficient and provable methods to solve non-convex optimization problems is an outstanding challenge in machine learning and optimization theory. A popular approach used to tackle non-convex problems is to use convex relaxation techniques to find a convex surrogate for the problem. Unfortunately, convex relaxations typically must be found on a problem-by-problem basis. Thus, providing a general-purpose strategy to estimate a convex relaxation would have a wide reaching impact. Here, we introduce Convex Relaxation Regression (CoRR), an approach for learning convex relaxations for a class of smooth functions. The main idea behind our approach is to estimate the convex envelope of a function $f$ by evaluating $f$ at a set of $T$ random points and then fitting a convex function to these function evaluations. We prove that with probability greater than $1-δ$, the solution of our algorithm converges to the global optimizer of $f$ with error $\mathcal{O} \Big( \big(\frac{\log(1/δ) }{T} \big)^α \Big)$ for some $α> 0$. Our approach enables the use of convex optimization tools to solve a class of non-convex optimization problems. △ Less

Submitted 3 March, 2016; v1 submitted 5 February, 2016; originally announced February 2016.

Journal ref: Proc. of the Conference on Uncertainty in Artificial Intelligence, pg. 22-31, 2016

arXiv:1507.00004 [pdf, ps, other]

doi 10.1088/1751-8113/48/49/495401

An Extremal N=2 Superconformal Field Theory

Authors: Nathan Benjamin, Ethan Dyer, A. Liam Fitzpatrick, Shamit Kachru

Abstract: We provide an example of an extremal chiral ${\cal N}=2$ superconformal field theory at $c=24$. The construction is based on a ${\mathbb Z}_2$ orbifold of the theory associated to the $A_{1}^{24}$ Niemeier lattice. The statespace is governed by representations of the sporadic group $M_{23}$. We provide an example of an extremal chiral ${\cal N}=2$ superconformal field theory at $c=24$. The construction is based on a ${\mathbb Z}_2$ orbifold of the theory associated to the $A_{1}^{24}$ Niemeier lattice. The statespace is governed by representations of the sporadic group $M_{23}$. △ Less

Submitted 30 June, 2015; originally announced July 2015.

Comments: 20 pages

Showing 1–50 of 63 results for author: Dyer, E