-
Effective Long-Context Scaling of Foundation Models
Authors:
Wenhan Xiong,
**gyu Liu,
Igor Molybog,
Hejia Zhang,
Prajjwal Bhargava,
Rui Hou,
Louis Martin,
Rashi Rungta,
Karthik Abinav Sankararaman,
Barlas Oguz,
Madian Khabsa,
Han Fang,
Yashar Mehdad,
Sharan Narang,
Kshitiz Malik,
Angela Fan,
Shruti Bhosale,
Sergey Edunov,
Mike Lewis,
Sinong Wang,
Hao Ma
Abstract:
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchm…
▽ More
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.
△ Less
Submitted 13 November, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Llama 2: Open Foundation and Fine-Tuned Chat Models
Authors:
Hugo Touvron,
Louis Martin,
Kevin Stone,
Peter Albert,
Amjad Almahairi,
Yasmine Babaei,
Nikolay Bashlykov,
Soumya Batra,
Prajjwal Bhargava,
Shruti Bhosale,
Dan Bikel,
Lukas Blecher,
Cristian Canton Ferrer,
Moya Chen,
Guillem Cucurull,
David Esiobu,
Jude Fernandes,
Jeremy Fu,
Wenyin Fu,
Brian Fuller,
Cynthia Gao,
Vedanuj Goswami,
Naman Goyal,
Anthony Hartshorn,
Saghar Hosseini
, et al. (43 additional authors not shown)
Abstract:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be…
▽ More
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
△ Less
Submitted 19 July, 2023; v1 submitted 18 July, 2023;
originally announced July 2023.
-
HaSa: Hardness and Structure-Aware Contrastive Knowledge Graph Embedding
Authors:
Honggen Zhang,
June Zhang,
Igor Molybog
Abstract:
We consider a contrastive learning approach to knowledge graph embedding (KGE) via InfoNCE. For KGE, efficient learning relies on augmenting the training data with negative triples. However, most KGE works overlook the bias from generating the negative triples-false negative triples (factual triples missing from the knowledge graph). We argue that the generation of high-quality (i.e., hard) negati…
▽ More
We consider a contrastive learning approach to knowledge graph embedding (KGE) via InfoNCE. For KGE, efficient learning relies on augmenting the training data with negative triples. However, most KGE works overlook the bias from generating the negative triples-false negative triples (factual triples missing from the knowledge graph). We argue that the generation of high-quality (i.e., hard) negative triples might lead to an increase in false negative triples. To mitigate the impact of false negative triples during the generation of hard negative triples, we propose the Hardness and Structure-aware (\textbf{HaSa}) contrastive KGE method, which alleviates the effect of false negative triples while generating the hard negative triples. Experiments show that HaSa improves the performance of InfoNCE-based KGE approaches and achieves state-of-the-art results in several metrics for WN18RR datasets and competitive results for FB15k-237 datasets compared to both classic and pre-trained LM-based KGE methods.
△ Less
Submitted 14 October, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
A Theory on Adam Instability in Large-Scale Machine Learning
Authors:
Igor Molybog,
Peter Albert,
Moya Chen,
Zachary DeVito,
David Esiobu,
Naman Goyal,
Punit Singh Koura,
Sharan Narang,
Andrew Poulton,
Ruan Silva,
Binh Tang,
Diana Liskovich,
Puxin Xu,
Yuchen Zhang,
Melanie Kambadur,
Stephen Roller,
Susan Zhang
Abstract:
We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent…
▽ More
We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.
△ Less
Submitted 25 April, 2023; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Over-parametrization via Lifting for Low-rank Matrix Sensing: Conversion of Spurious Solutions to Strict Saddle Points
Authors:
Ziye Ma,
Igor Molybog,
Javad Lavaei,
Somayeh Sojoudi
Abstract:
This paper studies the role of over-parametrization in solving non-convex optimization problems. The focus is on the important class of low-rank matrix sensing, where we propose an infinite hierarchy of non-convex problems via the lifting technique and the Burer-Monteiro factorization. This contrasts with the existing over-parametrization technique where the search rank is limited by the dimension…
▽ More
This paper studies the role of over-parametrization in solving non-convex optimization problems. The focus is on the important class of low-rank matrix sensing, where we propose an infinite hierarchy of non-convex problems via the lifting technique and the Burer-Monteiro factorization. This contrasts with the existing over-parametrization technique where the search rank is limited by the dimension of the matrix and it does not allow a rich over-parametrization of an arbitrary degree. We show that although the spurious solutions of the problem remain stationary points through the hierarchy, they will be transformed into strict saddle points (under some technical conditions) and can be escaped via local search methods. This is the first result in the literature showing that over-parametrization creates a negative curvature for esca** spurious solutions. We also derive a bound on how much over-parametrization is requited to enable the elimination of spurious solutions.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
When Does MAML Objective Have Benign Landscape?
Authors:
Igor Molybog,
Javad Lavaei
Abstract:
The paper studies the complexity of the optimization problem behind the Model-Agnostic Meta-Learning (MAML) algorithm. The goal of the study is to determine the global convergence of MAML on sequential decision-making tasks possessing a common structure. We are curious to know when, if at all, the benign landscape of the underlying tasks results in a benign landscape of the corresponding MAML obje…
▽ More
The paper studies the complexity of the optimization problem behind the Model-Agnostic Meta-Learning (MAML) algorithm. The goal of the study is to determine the global convergence of MAML on sequential decision-making tasks possessing a common structure. We are curious to know when, if at all, the benign landscape of the underlying tasks results in a benign landscape of the corresponding MAML objective. For illustration, we analyze the landscape of the MAML objective on LQR tasks to determine what types of similarities in their structures enable the algorithm to converge to the globally optimal solution.
△ Less
Submitted 10 December, 2020; v1 submitted 31 May, 2020;
originally announced June 2020.