Search | arXiv e-print repository

Quantifying Variance in Evaluation Benchmarks

Authors: Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

Abstract: Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the… ▽ More Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2307.09288 [pdf, other]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini , et al. (43 additional authors not shown)

Abstract: In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be… ▽ More In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. △ Less

Submitted 19 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

arXiv:2304.09871 [pdf, other]

A Theory on Adam Instability in Large-Scale Machine Learning

Authors: Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, Susan Zhang

Abstract: We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent… ▽ More We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters. △ Less

Submitted 25 April, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

arXiv:2211.09085 [pdf, other]

Galactica: A Large Language Model for Science

Authors: Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic

Abstract: Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can sto… ▽ More Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:1312.4475 [pdf, ps, other]

Almost split sequences for Knorr lattices

Authors: Andrew Poulton

Abstract: Let $O$ be a complete d.v.r. and $G$ a finite group. We give two applications of an adjunction in the stable category of $OG$. The first application gives necessary and sufficient conditions for the middle term of an almost split sequence terminating in a Knorr lattice to be indecomposable. The second characterises the stable endomorphism rings of Heller lattices of kG-modules. Let $O$ be a complete d.v.r. and $G$ a finite group. We give two applications of an adjunction in the stable category of $OG$. The first application gives necessary and sufficient conditions for the middle term of an almost split sequence terminating in a Knorr lattice to be indecomposable. The second characterises the stable endomorphism rings of Heller lattices of kG-modules. △ Less

Submitted 15 March, 2014; v1 submitted 16 December, 2013; originally announced December 2013.

arXiv:1204.4459 [pdf]

An Interference-Aware Virtual Clustering Paradigm for Resource Management in Cognitive Femtocell Networks

Authors: Faisal Tariq, Laurence S. Dooley, Adrian S. Poulton

Abstract: Femtocells represent a promising alternative solution for high quality wireless access in indoor scenarios where conventional cellular system coverage can be poor. Femtocell access points (FAP) are normally randomly deployed by the end user, so only post deployment network planning is possible. Furthermore, this uncoordinated deployment creates the potential for severe interference to co-located f… ▽ More Femtocells represent a promising alternative solution for high quality wireless access in indoor scenarios where conventional cellular system coverage can be poor. Femtocell access points (FAP) are normally randomly deployed by the end user, so only post deployment network planning is possible. Furthermore, this uncoordinated deployment creates the potential for severe interference to co-located femtocells, especially in dense deployments. This paper presents a new femtocell network architecture using a generalized virtual cluster femtocell (GVCF) paradigm, which groups together FAP, which are allocated to the same femtocell gateway (FGW), into logical clusters. This guarantees severely interfering and overlap** femtocells are assigned to different clusters, and since each cluster operates on a different band of frequencies, the corresponding virtual cluster controller only has to manage its own FAP members, so the overall system complexity is low. The performance of the GVCF algorithm is analysed from both a resource availability and cluster number perspective, and a novel strategy is proposed for dynamically adapting these to network environment changes, while upholding quality-of-service requirements. Simulation results conclusively corroborate the superior performance of the GVCF model in interference mitigation, particularly in high density FAP scenarios. △ Less

Submitted 19 April, 2012; originally announced April 2012.

arXiv:0911.2672 [pdf, ps, other]

Maps admitting trialities but not dualities

Authors: Gareth A. Jones, Andrew Poulton

Abstract: We use group theory to construct infinite families of maps on surfaces which are invariant under Wilson's map operations of order 3 but not under the operations of order 2, such as duality and Petrie duality. We use group theory to construct infinite families of maps on surfaces which are invariant under Wilson's map operations of order 3 but not under the operations of order 2, such as duality and Petrie duality. △ Less

Submitted 13 November, 2009; originally announced November 2009.

Comments: 19 pages, 1 figure

MSC Class: 05C25; 05C10; 20B25

Showing 1–7 of 7 results for author: Poulton, A