-
Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
Authors:
Shamane Siriwardhana,
Mark McQuade,
Thomas Gauthier,
Lucas Atkins,
Fernando Fernandes Neto,
Luke Meyers,
Anneketh Vij,
Tyler Odenthal,
Charles Goddard,
Mary MacCarthy,
Jacob Solawetz
Abstract:
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of int…
▽ More
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Arcee's MergeKit: A Toolkit for Merging Large Language Models
Authors:
Charles Goddard,
Shamane Siriwardhana,
Malikeh Ehghaghi,
Luke Meyers,
Vlad Karpukhin,
Brian Benedict,
Mark McQuade,
Jacob Solawetz
Abstract:
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to uti…
▽ More
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
△ Less
Submitted 20 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
Authors:
Nhi Pham,
Lachlan Pham,
Adam L. Meyers
Abstract:
The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Sin…
▽ More
The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models.
We aim to address the issue of bias at its root - the data itself. We curate a dataset of tweets from countries with high proportions of underserved English variety speakers, and propose an annotation framework of six categorical classifications along a pseudo-spectrum that measures the degree of standard English and that thereby indirectly aims to surface the manifestations of English varieties in these tweets. Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries, labeled by annotators who are from those countries and can communicate in regionally-dominant varieties of English. Our corpus highlights the accuracy discrepancies in pre-trained language identifiers between western English and non-western (i.e., less standard) English varieties. We hope to contribute to the growing literature identifying and reducing the implicit demographic discrepancies in NLP.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Infectious disease surveillance needs for the United States: lessons from COVID-19
Authors:
Marc Lipsitch,
Mary T. Bassett,
John S. Brownstein,
Paul Elliott,
David Eyre,
M. Kate Grabowski,
James A. Hay,
Michael Johansson,
Stephen M. Kissler,
Daniel B. Larremore,
Jennifer Layden,
Justin Lessler,
Ruth Lynfield,
Duncan MacCannell,
Lawrence C. Madoff,
C. Jessica E. Metcalf,
Lauren A. Meyers,
Sylvia K. Ofori,
Celia Quinn,
Ana I. Ramos Bento,
Nick Reich,
Steven Riley,
Roni Rosenfeld,
Matthew H. Samore,
Rangarajan Sampath
, et al. (5 additional authors not shown)
Abstract:
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while l…
▽ More
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while looking to jurisdictions in the U.S. and beyond to learn lessons about the value of specific data types. In this report, we define the range of decisions for which surveillance data are required, the data elements needed to inform these decisions and to calibrate inputs and outputs of transmission-dynamic models, and the types of data needed to inform decisions by state, territorial, local, and tribal health authorities. We define actions needed to ensure that such data will be available and consider the contribution of such efforts to improving health equity.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Towards Automatic Honey Bee Flower-Patch Assays with Paint Marking Re-Identification
Authors:
Luke Meyers,
Josué Rodríguez Cordero,
Carlos Corrada Bravo,
Fanfan Noel,
José Agosto-Rivera,
Tugrul Giray,
Rémi Mégret
Abstract:
In this paper, we show that paint markings are a feasible approach to automatize the analysis of behavioral assays involving honey bees in the field where marking has to be as lightweight as possible. We contribute a novel dataset for bees re-identification with paint-markings with 4392 images and 27 identities. Contrastive learning with a ResNet backbone and triplet loss led to identity represent…
▽ More
In this paper, we show that paint markings are a feasible approach to automatize the analysis of behavioral assays involving honey bees in the field where marking has to be as lightweight as possible. We contribute a novel dataset for bees re-identification with paint-markings with 4392 images and 27 identities. Contrastive learning with a ResNet backbone and triplet loss led to identity representation features with almost perfect recognition in closed setting where identities are known in advance. Diverse experiments evaluate the capability to generalize to separate IDs, and show the impact of using different body parts for identification, such as using the unmarked abdomen only. In addition, we show the potential to fully automate the visit detection and provide preliminary results of compute time for future real-time deployment in the field on an edge device.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Authors:
Varun Kompella,
Roberto Capobianco,
Stacy Jong,
Jonathan Browne,
Spencer Fox,
Lauren Meyers,
Peter Wurman,
Peter Stone
Abstract:
The year 2020 has seen the COVID-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world are faced with the challenge of protecting public health, while kee** the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention polici…
▽ More
The year 2020 has seen the COVID-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world are faced with the challenge of protecting public health, while kee** the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention policies. However, to date,the even the most data-driven intervention policies rely on heuristics. In this paper, we study how reinforcement learning (RL) can be used to optimize mitigation policies that minimize the economic impact without overwhelming the hospital capacity. Our main contributions are (1) a novel agent-based pandemic simulator which, unlike traditional models, is able to model fine-grained interactions among people at specific locations in a community; and (2) an RL-based methodology for optimizing fine-grained mitigation policies within this simulator. Our results validate both the overall simulator behavior and the learned policies under realistic conditions.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
Periodicity in Movement Patterns Shapes Epidemic Risk in Urban Environments
Authors:
Zhanwei Du,
Spencer J Fox,
Petter Holme,
Jiming Liu,
Alison P. Galvani,
Lauren Ancel Meyers
Abstract:
Daily variation in human mobility modulates the speed and severity of emerging outbreaks, yet most epidemiological studies assume static contact patterns. With a highly mobile population exceeding 24 million people, Shanghai, China is a transportation hub at high risk for the importation and subsequent global propagation of infectious diseases. Here, we use a dynamic metapopulation model informed…
▽ More
Daily variation in human mobility modulates the speed and severity of emerging outbreaks, yet most epidemiological studies assume static contact patterns. With a highly mobile population exceeding 24 million people, Shanghai, China is a transportation hub at high risk for the importation and subsequent global propagation of infectious diseases. Here, we use a dynamic metapopulation model informed by hourly transit data for Shanghai to estimate epidemic risks across thousands of outbreak scenarios. We find that the rate of initial epidemic growth varies by more than twenty-fold, depending on the hour and neighborhood of disease introduction. The riskiest introductions are those occurring close to the city center and on Fridays--which bridge weekday and weekend transit patterns and thereby connect otherwise disconnected portions of the population. The identification of these spatio-temporal hotspots can inform more efficient targets for sentinel surveillance and strategies for mitigating transmission.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Multiscale Network Generation
Authors:
Alexander Gutfraind,
Lauren Ancel Meyers,
Ilya Safro
Abstract:
Networks are widely used in science and technology to represent relationships between entities, such as social or ecological links between organisms, enzymatic interactions in metabolic systems, or computer infrastructure. Statistical analyses of networks can provide critical insights into the structure, function, dynamics, and evolution of those systems. However, the structures of real-world netw…
▽ More
Networks are widely used in science and technology to represent relationships between entities, such as social or ecological links between organisms, enzymatic interactions in metabolic systems, or computer infrastructure. Statistical analyses of networks can provide critical insights into the structure, function, dynamics, and evolution of those systems. However, the structures of real-world networks are often not known completely, and they may exhibit considerable variation so that no single network is sufficiently representative of a system. In such situations, researchers may turn to proxy data from related systems, sophisticated methods for network inference, or synthetic networks. Here, we introduce a flexible method for synthesizing realistic ensembles of networks starting from a known network, through a series of map**s that coarsen and later refine the network structure by randomized editing. The method, MUSKETEER, preserves structural properties with minimal bias, including unknown or unspecified features, while introducing realistic variability at multiple scales. Using examples from several domains, we show that MUSKETEER produces the intended stochasticity while achieving greater fidelity across a suite of network properties than do other commonly used network generation algorithms.
△ Less
Submitted 18 July, 2012;
originally announced July 2012.
-
Evolving Clustered Random Networks
Authors:
Shweta Bansal,
Shashank Khandelwal,
Lauren Ancel Meyers
Abstract:
We propose a Markov chain simulation method to generate simple connected random graphs with a specified degree sequence and level of clustering. The networks generated by our algorithm are random in all other respects and can thus serve as generic models for studying the impacts of degree distributions and clustering on dynamical processes as well as null models for detecting other structural pr…
▽ More
We propose a Markov chain simulation method to generate simple connected random graphs with a specified degree sequence and level of clustering. The networks generated by our algorithm are random in all other respects and can thus serve as generic models for studying the impacts of degree distributions and clustering on dynamical processes as well as null models for detecting other structural properties in empirical networks.
△ Less
Submitted 4 August, 2008;
originally announced August 2008.