-
Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
Authors:
Shamane Siriwardhana,
Mark McQuade,
Thomas Gauthier,
Lucas Atkins,
Fernando Fernandes Neto,
Luke Meyers,
Anneketh Vij,
Tyler Odenthal,
Charles Goddard,
Mary MacCarthy,
Jacob Solawetz
Abstract:
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of int…
▽ More
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Arcee's MergeKit: A Toolkit for Merging Large Language Models
Authors:
Charles Goddard,
Shamane Siriwardhana,
Malikeh Ehghaghi,
Luke Meyers,
Vlad Karpukhin,
Brian Benedict,
Mark McQuade,
Jacob Solawetz
Abstract:
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to uti…
▽ More
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
△ Less
Submitted 20 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
Authors:
Nhi Pham,
Lachlan Pham,
Adam L. Meyers
Abstract:
The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Sin…
▽ More
The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models.
We aim to address the issue of bias at its root - the data itself. We curate a dataset of tweets from countries with high proportions of underserved English variety speakers, and propose an annotation framework of six categorical classifications along a pseudo-spectrum that measures the degree of standard English and that thereby indirectly aims to surface the manifestations of English varieties in these tweets. Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries, labeled by annotators who are from those countries and can communicate in regionally-dominant varieties of English. Our corpus highlights the accuracy discrepancies in pre-trained language identifiers between western English and non-western (i.e., less standard) English varieties. We hope to contribute to the growing literature identifying and reducing the implicit demographic discrepancies in NLP.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Infectious disease surveillance needs for the United States: lessons from COVID-19
Authors:
Marc Lipsitch,
Mary T. Bassett,
John S. Brownstein,
Paul Elliott,
David Eyre,
M. Kate Grabowski,
James A. Hay,
Michael Johansson,
Stephen M. Kissler,
Daniel B. Larremore,
Jennifer Layden,
Justin Lessler,
Ruth Lynfield,
Duncan MacCannell,
Lawrence C. Madoff,
C. Jessica E. Metcalf,
Lauren A. Meyers,
Sylvia K. Ofori,
Celia Quinn,
Ana I. Ramos Bento,
Nick Reich,
Steven Riley,
Roni Rosenfeld,
Matthew H. Samore,
Rangarajan Sampath
, et al. (5 additional authors not shown)
Abstract:
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while l…
▽ More
The COVID-19 pandemic has highlighted the need to upgrade systems for infectious disease surveillance and forecasting and modeling of the spread of infection, both of which inform evidence-based public health guidance and policies. Here, we discuss requirements for an effective surveillance system to support decision making during a pandemic, drawing on the lessons of COVID-19 in the U.S., while looking to jurisdictions in the U.S. and beyond to learn lessons about the value of specific data types. In this report, we define the range of decisions for which surveillance data are required, the data elements needed to inform these decisions and to calibrate inputs and outputs of transmission-dynamic models, and the types of data needed to inform decisions by state, territorial, local, and tribal health authorities. We define actions needed to ensure that such data will be available and consider the contribution of such efforts to improving health equity.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Towards Automatic Honey Bee Flower-Patch Assays with Paint Marking Re-Identification
Authors:
Luke Meyers,
Josué Rodríguez Cordero,
Carlos Corrada Bravo,
Fanfan Noel,
José Agosto-Rivera,
Tugrul Giray,
Rémi Mégret
Abstract:
In this paper, we show that paint markings are a feasible approach to automatize the analysis of behavioral assays involving honey bees in the field where marking has to be as lightweight as possible. We contribute a novel dataset for bees re-identification with paint-markings with 4392 images and 27 identities. Contrastive learning with a ResNet backbone and triplet loss led to identity represent…
▽ More
In this paper, we show that paint markings are a feasible approach to automatize the analysis of behavioral assays involving honey bees in the field where marking has to be as lightweight as possible. We contribute a novel dataset for bees re-identification with paint-markings with 4392 images and 27 identities. Contrastive learning with a ResNet backbone and triplet loss led to identity representation features with almost perfect recognition in closed setting where identities are known in advance. Diverse experiments evaluate the capability to generalize to separate IDs, and show the impact of using different body parts for identification, such as using the unmarked abdomen only. In addition, we show the potential to fully automate the visit detection and provide preliminary results of compute time for future real-time deployment in the field on an edge device.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Surveillance Testing for Rapid Detection of Outbreaks in Facilities
Authors:
Yanyue Ding,
Sudesh K. Agrawal,
**cheng Cao,
Lauren Meyers,
John J. Hasenbein
Abstract:
This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the mo…
▽ More
This paper develops an agent-based disease spread model on a contact network in an effort to guide efforts at surveillance testing in small to moderate facilities such as nursing homes and meat-packing plants. The model employs Monte Carlo simulations of viral spread sample paths in the contact network. The original motivation was to detect COVID-19 outbreaks quickly in such facilities, but the model can be applied to any communicable disease. In particular, the model provides guidance on how many test to administer each day and on the importance of the testing order among staff or workers.
△ Less
Submitted 30 September, 2021;
originally announced October 2021.
-
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Authors:
Varun Kompella,
Roberto Capobianco,
Stacy Jong,
Jonathan Browne,
Spencer Fox,
Lauren Meyers,
Peter Wurman,
Peter Stone
Abstract:
The year 2020 has seen the COVID-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world are faced with the challenge of protecting public health, while kee** the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention polici…
▽ More
The year 2020 has seen the COVID-19 virus lead to one of the worst global pandemics in history. As a result, governments around the world are faced with the challenge of protecting public health, while kee** the economy running to the greatest extent possible. Epidemiological models provide insight into the spread of these types of diseases and predict the effects of possible intervention policies. However, to date,the even the most data-driven intervention policies rely on heuristics. In this paper, we study how reinforcement learning (RL) can be used to optimize mitigation policies that minimize the economic impact without overwhelming the hospital capacity. Our main contributions are (1) a novel agent-based pandemic simulator which, unlike traditional models, is able to model fine-grained interactions among people at specific locations in a community; and (2) an RL-based methodology for optimizing fine-grained mitigation policies within this simulator. Our results validate both the overall simulator behavior and the learned policies under realistic conditions.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
Early Detection of Influenza outbreaks in the United States
Authors:
Kai Liu,
Ravi Srinivasan,
Lauren Ancel Meyers
Abstract:
Public health surveillance systems often fail to detect emerging infectious diseases, particularly in resource limited settings. By integrating relevant clinical and internet-source data, we can close critical gaps in coverage and accelerate outbreak detection. Here, we present a multivariate algorithm that uses freely available online data to provide early warning of emerging influenza epidemics…
▽ More
Public health surveillance systems often fail to detect emerging infectious diseases, particularly in resource limited settings. By integrating relevant clinical and internet-source data, we can close critical gaps in coverage and accelerate outbreak detection. Here, we present a multivariate algorithm that uses freely available online data to provide early warning of emerging influenza epidemics in the US. We evaluated 240 candidate predictors and found that the most predictive combination does \textit{not} include surveillance or electronic health records data, but instead consists of eight Google search and Wikipedia pageview time series reflecting changing levels of interest in influenza-related topics. In cross validation on 2010-2016 data, this algorithm sounds alarms an average of 16.4 weeks prior to influenza activity reaching the Center for Disease Control and Prevention (CDC) threshold for declaring the start of the season. In an out-of-sample test on data from the rapidly-emerging fall wave of the 2009 H1N1 pandemic, it recognized the threat five weeks in advance of this surveillance threshold. Simpler algorithms, including fixed week-of-the-year triggers, lag the optimized alarms by only a few weeks when detecting seasonal influenza, but fail to provide early warning in the 2009 pandemic scenario. This demonstrates a robust method for designing next generation outbreak detection algorithms. By combining scan statistics with machine learning, it identifies tractable combinations of data sources (from among thousands of candidates) that can provide early warning of emerging infectious disease threats worldwide.
△ Less
Submitted 3 March, 2019;
originally announced March 2019.
-
Periodicity in Movement Patterns Shapes Epidemic Risk in Urban Environments
Authors:
Zhanwei Du,
Spencer J Fox,
Petter Holme,
Jiming Liu,
Alison P. Galvani,
Lauren Ancel Meyers
Abstract:
Daily variation in human mobility modulates the speed and severity of emerging outbreaks, yet most epidemiological studies assume static contact patterns. With a highly mobile population exceeding 24 million people, Shanghai, China is a transportation hub at high risk for the importation and subsequent global propagation of infectious diseases. Here, we use a dynamic metapopulation model informed…
▽ More
Daily variation in human mobility modulates the speed and severity of emerging outbreaks, yet most epidemiological studies assume static contact patterns. With a highly mobile population exceeding 24 million people, Shanghai, China is a transportation hub at high risk for the importation and subsequent global propagation of infectious diseases. Here, we use a dynamic metapopulation model informed by hourly transit data for Shanghai to estimate epidemic risks across thousands of outbreak scenarios. We find that the rate of initial epidemic growth varies by more than twenty-fold, depending on the hour and neighborhood of disease introduction. The riskiest introductions are those occurring close to the city center and on Fridays--which bridge weekday and weekend transit patterns and thereby connect otherwise disconnected portions of the population. The identification of these spatio-temporal hotspots can inform more efficient targets for sentinel surveillance and strategies for mitigating transmission.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Local risk perception enhances epidemic control
Authors:
José L. Herrera,
Lauren Ancel Meyers
Abstract:
As infectious disease outbreaks emerge, public health agencies often enact vaccination and social distancing measures to slow transmission. Their success depends on not only strategies and resources, but also public adherence. Individual willingness to take precautions may be influenced by global factors, such as news media, or local factors, such as infected family members or friends. Here, we co…
▽ More
As infectious disease outbreaks emerge, public health agencies often enact vaccination and social distancing measures to slow transmission. Their success depends on not only strategies and resources, but also public adherence. Individual willingness to take precautions may be influenced by global factors, such as news media, or local factors, such as infected family members or friends. Here, we compare three modes of epidemiological decision-making in the midst of a growing outbreak. Individuals decide whether to adopt a recommended intervention based on overall disease prevalence, the proportion of social contacts infected, or the number of social contacts infected. While all strategies can substantially mitigate transmission, vaccinating (or self isolating) based on the number of infected acquaintances is expected to achieve the greatest herd immunity and number of infections averted, while requiring the fewest intervention resources.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
Socioeconomic bias in influenza surveillance
Authors:
Samuel V. Scarpino,
James G. Scott,
Rosalind M. Eggo,
Bruce Clements,
Nedialko B. Dimitrov,
Lauren Ancel Meyers
Abstract:
Individuals in low socioeconomic brackets are considered at-risk for develo** influenza-related complications and often exhibit higher than average influenza-related hospitalization rates. This disparity has been attributed to various factors, including restricted access to preventative and therapeutic health care, limited sick leave, and household structure. Adequate influenza surveillance in t…
▽ More
Individuals in low socioeconomic brackets are considered at-risk for develo** influenza-related complications and often exhibit higher than average influenza-related hospitalization rates. This disparity has been attributed to various factors, including restricted access to preventative and therapeutic health care, limited sick leave, and household structure. Adequate influenza surveillance in these at-risk populations is a critical precursor to accurate risk assessments and effective intervention. However, the United States of America's primary national influenza surveillance system (ILINet) monitors outpatient healthcare providers, which may be largely inaccessible to lower socioeconomic populations. Recent initiatives to incorporate internet-source and hospital electronic medical records data into surveillance systems seek to improve the timeliness, coverage, and accuracy of outbreak detection and situational awareness. Here, we use a flexible statistical framework for integrating multiple surveillance data sources to evaluate the adequacy of traditional (ILINet) and next generation (BioSense 2.0 and Google Flu Trends) data for situational awareness of influenza across poverty levels. We find that zip codes in the highest poverty quartile are a critical blind-spot for ILINet that the integration of next generation data fails to ameliorate.
△ Less
Submitted 1 April, 2018;
originally announced April 2018.
-
Multiscale Network Generation
Authors:
Alexander Gutfraind,
Lauren Ancel Meyers,
Ilya Safro
Abstract:
Networks are widely used in science and technology to represent relationships between entities, such as social or ecological links between organisms, enzymatic interactions in metabolic systems, or computer infrastructure. Statistical analyses of networks can provide critical insights into the structure, function, dynamics, and evolution of those systems. However, the structures of real-world netw…
▽ More
Networks are widely used in science and technology to represent relationships between entities, such as social or ecological links between organisms, enzymatic interactions in metabolic systems, or computer infrastructure. Statistical analyses of networks can provide critical insights into the structure, function, dynamics, and evolution of those systems. However, the structures of real-world networks are often not known completely, and they may exhibit considerable variation so that no single network is sufficiently representative of a system. In such situations, researchers may turn to proxy data from related systems, sophisticated methods for network inference, or synthetic networks. Here, we introduce a flexible method for synthesizing realistic ensembles of networks starting from a known network, through a series of map**s that coarsen and later refine the network structure by randomized editing. The method, MUSKETEER, preserves structural properties with minimal bias, including unknown or unspecified features, while introducing realistic variability at multiple scales. Using examples from several domains, we show that MUSKETEER produces the intended stochasticity while achieving greater fidelity across a suite of network properties than do other commonly used network generation algorithms.
△ Less
Submitted 18 July, 2012;
originally announced July 2012.
-
The Impact of Past Epidemics on Future Disease Dynamics
Authors:
Shweta Bansal,
Lauren Ancel Meyers
Abstract:
Many pathogens spread primarily via direct contact between infected and susceptible hosts. Thus, the patterns of contacts or contact network of a population fundamentally shapes the course of epidemics. While there is a robust and growing theory for the dynamics of single epidemics in networks, we know little about the impacts of network structure on long term epidemic or endemic transmission. F…
▽ More
Many pathogens spread primarily via direct contact between infected and susceptible hosts. Thus, the patterns of contacts or contact network of a population fundamentally shapes the course of epidemics. While there is a robust and growing theory for the dynamics of single epidemics in networks, we know little about the impacts of network structure on long term epidemic or endemic transmission. For seasonal diseases like influenza, pathogens repeatedly return to populations with complex and changing patterns of susceptibility and immunity acquired through prior infection. Here, we develop two mathematical approaches for modeling consecutive seasonal outbreaks of a partially-immunizing infection in a population with contact heterogeneity. Using methods from percolation theory we consider both leaky immunity, where all previously infected individuals gain partial immunity, and perfect immunity, where a fraction of previously infected individuals are fully immune. By restructuring the epidemiologically active portion of their host population, such diseases limit the potential of future outbreaks. We speculate that these dynamics can result in evolutionary pressure to increase infectiousness.
△ Less
Submitted 12 October, 2009;
originally announced October 2009.
-
Early Real-time Estimation of Infectious Disease Reproduction Number
Authors:
Bahman Davoudi,
Babak Pourbohloul,
Joel Miller,
Rafael Meza,
Lauren Ancel Meyers,
David J. D. Earn
Abstract:
When an infectious disease strikes a population, the number of newly reported cases is often the only available information that one can obtain during early stages of the outbreak. An important goal of early outbreak analysis is to obtain a reliable estimate for the basic reproduction number, $R_{0}$, from the limited information available. We present a novel method that enables us to make a relia…
▽ More
When an infectious disease strikes a population, the number of newly reported cases is often the only available information that one can obtain during early stages of the outbreak. An important goal of early outbreak analysis is to obtain a reliable estimate for the basic reproduction number, $R_{0}$, from the limited information available. We present a novel method that enables us to make a reliable real-time estimate of the reproduction number at a much earlier stage compared to other available methods. Our method takes into account the possibility that a disease has a wide distribution of infectious period and that the degree distribution of the contact network is heterogeneous. We validate our analytical framework with numerical simulations.
△ Less
Submitted 7 February, 2012; v1 submitted 5 May, 2009;
originally announced May 2009.
-
Evolving Clustered Random Networks
Authors:
Shweta Bansal,
Shashank Khandelwal,
Lauren Ancel Meyers
Abstract:
We propose a Markov chain simulation method to generate simple connected random graphs with a specified degree sequence and level of clustering. The networks generated by our algorithm are random in all other respects and can thus serve as generic models for studying the impacts of degree distributions and clustering on dynamical processes as well as null models for detecting other structural pr…
▽ More
We propose a Markov chain simulation method to generate simple connected random graphs with a specified degree sequence and level of clustering. The networks generated by our algorithm are random in all other respects and can thus serve as generic models for studying the impacts of degree distributions and clustering on dynamical processes as well as null models for detecting other structural properties in empirical networks.
△ Less
Submitted 4 August, 2008;
originally announced August 2008.
-
SIR epidemics in dynamic contact networks
Authors:
Erik Volz,
Lauren Ancel Meyers
Abstract:
Contact patterns in populations fundamentally influence the spread of infectious diseases. Current mathematical methods for epidemiological forecasting on networks largely assume that contacts between individuals are fixed, at least for the duration of an outbreak. In reality, contact patterns may be quite fluid, with individuals frequently making and breaking social or sexual relationships. Her…
▽ More
Contact patterns in populations fundamentally influence the spread of infectious diseases. Current mathematical methods for epidemiological forecasting on networks largely assume that contacts between individuals are fixed, at least for the duration of an outbreak. In reality, contact patterns may be quite fluid, with individuals frequently making and breaking social or sexual relationships. Here we develop a mathematical approach to predicting disease transmission on dynamic networks in which each individual has a characteristic behavior (typical contact number), but the identities of their contacts change in time. We show that dynamic contact patterns shape epidemiological dynamics in ways that cannot be adequately captured in static network models or mass-action models. Our new model interpolates smoothly between static network models and mass-action models using a mixing parameter, thereby providing a bridge between disparate classes of epidemiological models. Using epidemiological and sexual contact data from an Atlanta high school, we then demonstrate the utility of this method for forecasting and controlling sexually transmitted disease outbreaks.
△ Less
Submitted 15 May, 2007;
originally announced May 2007.
-
A Comparative Analysis of Influenza Vaccination Programs
Authors:
Shweta Bansal,
Babak Pourbohloul,
Lauren Ancel Meyers
Abstract:
The threat of avian influenza and the 2004-2005 influenza vaccine supply shortage in the United States has sparked a debate about optimal vaccination strategies to reduce the burden of morbidity and mortality caused by the influenza virus. We present a comparative analysis of two classes of suggested vaccination strategies: mortality-based strategies that target high risk populations and morbidi…
▽ More
The threat of avian influenza and the 2004-2005 influenza vaccine supply shortage in the United States has sparked a debate about optimal vaccination strategies to reduce the burden of morbidity and mortality caused by the influenza virus. We present a comparative analysis of two classes of suggested vaccination strategies: mortality-based strategies that target high risk populations and morbidity-based that target high prevalence populations. Applying the methods of contact network epidemiology to a model of disease transmission in a large urban population, we evaluate the efficacy of these strategies across a wide range of viral transmission rates and for two different age-specific mortality distributions. We find that the optimal strategy depends critically on the viral transmission level (reproductive rate) of the virus: morbidity-based strategies outperform mortality-based strategies for moderately transmissible strains, while the reverse is true for highly transmissible strains. These results hold for a range of mortality rates reported for prior influenza epidemics and pandemics. Furthermore, we show that vaccination delays and multiple introductions of disease into the community have a more detrimental impact on morbidity-based strategies than mortality-based strategies. If public health officials have reasonable estimates of the viral transmission rate and the frequency of new introductions into the community prior to an outbreak, then these methods can guide the design of optimal vaccination priorities. When such information is unreliable or not available, as is often the case, this study recommends mortality-based vaccination priorities.
△ Less
Submitted 3 October, 2006; v1 submitted 22 January, 2006;
originally announced January 2006.