-
Using digital traces to build prospective and real-time county-level early warning systems to anticipate COVID-19 outbreaks in the United States
Authors:
Lucas M. Stolerman,
Leonardo Clemente,
Canelle Poirier,
Kris V. Parag,
Atreyee Majumder,
Serge Masyn,
Bernd Resch,
Mauricio Santillana
Abstract:
The ongoing COVID-19 pandemic continues to affect communities around the world. To date, almost 6 million people have died as a consequence of COVID-19, and more than one-quarter of a billion people are estimated to have been infected worldwide. The design of appropriate and timely mitigation strategies to curb the effects of this and future disease outbreaks requires close monitoring of their spa…
▽ More
The ongoing COVID-19 pandemic continues to affect communities around the world. To date, almost 6 million people have died as a consequence of COVID-19, and more than one-quarter of a billion people are estimated to have been infected worldwide. The design of appropriate and timely mitigation strategies to curb the effects of this and future disease outbreaks requires close monitoring of their spatio-temporal trajectories. We present machine learning methods to anticipate sharp increases in COVID-19 activity in US counties in real-time. Our methods leverage Internet-based digital traces -- e.g., disease-related Internet search activity from the general population and clinicians, disease-relevant Twitter micro-blogs, and outbreak trajectories from neighboring locations -- to monitor potential changes in population-level health trends. Motivated by the need for finer spatial-resolution epidemiological insights to improve local decision-making, we build upon previous retrospective research efforts originally conceived at the state level and in the early months of the pandemic. Our methods -- tested in real-time and in an out-of-sample manner on a subset of 97 counties distributed across the US -- frequently anticipated sharp increases in COVID-19 activity 1-6 weeks before the onset of local outbreaks (defined as the time when the effective reproduction number $R_t$ becomes larger than 1 consistently). Given the continued emergence of COVID-19 variants of concern -- such as the most recent one, Omicron -- and the fact that multiple countries have not had full access to vaccines, the framework we present, while conceived for the county-level in the US, could be helpful in countries where similar data sources are available.
△ Less
Submitted 12 March, 2022;
originally announced March 2022.
-
High-resolution Spatio-temporal Model for County-level COVID-19 Activity in the U.S
Authors:
Shixiang Zhu,
Alexander Bukharin,
Liyan Xie,
Mauricio Santillana,
Shihao Yang,
Yao Xie
Abstract:
We present an interpretable high-resolution spatio-temporal model to estimate COVID-19 deaths together with confirmed cases one-week ahead of the current time, at the county-level and weekly aggregated, in the United States. A notable feature of our spatio-temporal model is that it considers the (a) temporal auto- and pairwise correlation of the two local time series (confirmed cases and death of…
▽ More
We present an interpretable high-resolution spatio-temporal model to estimate COVID-19 deaths together with confirmed cases one-week ahead of the current time, at the county-level and weekly aggregated, in the United States. A notable feature of our spatio-temporal model is that it considers the (a) temporal auto- and pairwise correlation of the two local time series (confirmed cases and death of the COVID-19), (b) dynamics between locations (propagation between counties), and (c) covariates such as local within-community mobility and social demographic factors. The within-community mobility and demographic factors, such as total population and the proportion of the elderly, are included as important predictors since they are hypothesized to be important in determining the dynamics of COVID-19. To reduce the model's high-dimensionality, we impose sparsity structures as constraints and emphasize the impact of the top ten metropolitan areas in the nation, which we refer (and treat within our models) as hubs in spreading the disease. Our retrospective out-of-sample county-level predictions were able to forecast the subsequently observed COVID-19 activity accurately. The proposed multi-variate predictive models were designed to be highly interpretable, with clear identification and quantification of the most important factors that determine the dynamics of COVID-19. Ongoing work involves incorporating more covariates, such as education and income, to improve prediction accuracy and model interpretability.
△ Less
Submitted 20 August, 2021; v1 submitted 15 September, 2020;
originally announced September 2020.
-
An Early Warning Approach to Monitor COVID-19 Activity with Multiple Digital Traces in Near Real-Time
Authors:
Nicole E. Kogan,
Leonardo Clemente,
Parker Liautaud,
Justin Kaashoek,
Nicholas B. Link,
Andre T. Nguyen,
Fred S. Lu,
Peter Huybers,
Bernd Resch,
Clemens Havas,
Andreas Petutschnig,
Jessica Davis,
Matteo Chinazzi,
Backtosch Mustafa,
William P. Hanage,
Alessandro Vespignani,
Mauricio Santillana
Abstract:
Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decre…
▽ More
Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decreasing state-level US COVID-19 activity between January and June 2020. We estimate the timing of sharp changes in each data stream using a simple Bayesian model that calculates in near real-time the probability of exponential growth or decay. Analysis of COVID-19-related activity on social network microblogs, Internet searches, point-of-care medical software, and a metapopulation mechanistic model, as well as fever anomalies captured by smart thermometer networks, shows exponential growth roughly 2-3 weeks prior to comparable growth in confirmed COVID-19 cases and 3-4 weeks prior to comparable growth in COVID-19 deaths across the US over the last 6 months. We further observe exponential decay in confirmed cases and deaths 5-6 weeks after implementation of NPIs, as measured by anonymized and aggregated human mobility data from mobile phones. Finally, we propose a combined indicator for exponential growth in multiple data streams that may aid in develo** an early warning system for future COVID-19 outbreaks. These efforts represent an initial exploratory framework, and both continued study of the predictive power of digital indicators as well as further development of the statistical approach are needed.
△ Less
Submitted 3 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Fever and mobility data indicate social distancing has reduced incidence of communicable disease in the United States
Authors:
Parker Liautaud,
Peter Huybers,
Mauricio Santillana
Abstract:
In March of 2020, many U.S. state governments encouraged or mandated restrictions on social interactions to slow the spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2 that has spread to nearly 180 countries. Estimating the effectiveness of these social-distancing strategies is challenging because surveillance of COVID-19 has been limited, with tests generally being priorit…
▽ More
In March of 2020, many U.S. state governments encouraged or mandated restrictions on social interactions to slow the spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2 that has spread to nearly 180 countries. Estimating the effectiveness of these social-distancing strategies is challenging because surveillance of COVID-19 has been limited, with tests generally being prioritized for high-risk or hospitalized cases according to temporally and regionally varying criteria. Here we show that reductions in mobility across U.S. counties with at least 100 confirmed cases of COVID-19 led to reductions in fever incidences, as captured by smart thermometers, after a mean lag of 6.5 days ($90\%$ within 3--10 days) that is consistent with the incubation period of COVID-19. Furthermore, counties with larger decreases in mobility subsequently achieved greater reductions in fevers ($p<0.01$), with the notable exception of New York City and its immediate vicinity. These results indicate that social distancing has reduced the transmission of influenza like illnesses, including COVID 19, and support social distancing as an effective strategy for slowing the spread of COVID-19.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models
Authors:
Dianbo Liu,
Leonardo Clemente,
Canelle Poirier,
Xiyu Ding,
Matteo Chinazzi,
Jessica T Davis,
Alessandro Vespignani,
Mauricio Santillana
Abstract:
We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from C…
▽ More
We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Towards the Use of Neural Networks for Influenza Prediction at Multiple Spatial Resolutions
Authors:
Emily L. Aiken,
Andre T. Nguyen,
Mauricio Santillana
Abstract:
We introduce the use of a Gated Recurrent Unit (GRU) for influenza prediction at the state- and city-level in the US, and experiment with the inclusion of real-time flu-related Internet search data. We find that a GRU has lower prediction error than current state-of-the-art methods for data-driven influenza prediction at time horizons of over two weeks. In contrast with other machine learning appr…
▽ More
We introduce the use of a Gated Recurrent Unit (GRU) for influenza prediction at the state- and city-level in the US, and experiment with the inclusion of real-time flu-related Internet search data. We find that a GRU has lower prediction error than current state-of-the-art methods for data-driven influenza prediction at time horizons of over two weeks. In contrast with other machine learning approaches, the inclusion of real-time Internet search data does not improve GRU predictions.
△ Less
Submitted 13 November, 2019; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Advances in using Internet searches to track dengue
Authors:
Shihao Yang,
S. C. Kou,
Fred Lu,
John S. Brownstein,
Nicholas Brooke,
Mauricio Santillana
Abstract:
Dengue is a mosquito-borne disease that threatens more than half of the world's population. Despite being endemic to over 100 countries, government-led efforts and mechanisms to timely identify and track the emergence of new infections are still lacking in many affected areas. Multiple methodologies that leverage the use of Internet-based data sources have been proposed as a way to complement deng…
▽ More
Dengue is a mosquito-borne disease that threatens more than half of the world's population. Despite being endemic to over 100 countries, government-led efforts and mechanisms to timely identify and track the emergence of new infections are still lacking in many affected areas. Multiple methodologies that leverage the use of Internet-based data sources have been proposed as a way to complement dengue surveillance efforts. Among these, the trends in dengue-related Google searches have been shown to correlate with dengue activity. We extend a methodological framework, initially proposed and validated for flu surveillance, to produce near real-time estimates of dengue cases in five countries/regions: Mexico, Brazil, Thailand, Singapore and Taiwan. Our result shows that our modeling framework can be used to improve the tracking of dengue activity in multiple locations around the world.
△ Less
Submitted 8 December, 2016;
originally announced December 2016.
-
Relatedness of the Incidence Decay with Exponential Adjustment (IDEA) Model, "Farr's Law" and Compartmental Difference Equation SIR Models
Authors:
Mauricio Santillana,
Ashleigh Tuite,
Tahmina Nasserie,
Paul Fine,
David Champredon,
Leonid Chindelevitch,
Jonathan Dushoff,
David Fisman
Abstract:
Mathematical models are often regarded as recent innovations in the description and analysis of infectious disease outbreaks and epidemics, but simple models have been in use for projection of epidemic trajectories for more than a century. We recently described a single equation model (the incidence decay with exponential adjustment, or IDEA, model) that can be used for short term forecasting. In…
▽ More
Mathematical models are often regarded as recent innovations in the description and analysis of infectious disease outbreaks and epidemics, but simple models have been in use for projection of epidemic trajectories for more than a century. We recently described a single equation model (the incidence decay with exponential adjustment, or IDEA, model) that can be used for short term forecasting. In the mid-19th century, Dr. William Farr developed a single equation approach (Farr's law) for epidemic forecasting. We show here that the two models are in fact identical, and can be expressed in terms of one another, and also in terms of a susceptible-infectious-removed (SIR) compartmental model with improving control. This demonstrates that the concept of the reproduction number, R0, is implicit to Farr's (pre-microbial era) work, and also suggests that control of epidemics, whether via behavior change or intervention, is as integral to the natural history of epidemics as is the dynamics of disease transmission.
△ Less
Submitted 3 March, 2016;
originally announced March 2016.
-
Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance
Authors:
Mauricio Santillana,
Andre Nguyen,
Tamara Louie,
Anna Zink,
Josh Gray,
Iyue Sung,
John S. Brownstein
Abstract:
Accurate real-time monitoring systems of influenza outbreaks help public health officials make informed decisions that may help save lives. We show that information extracted from cloud-based electronic health records databases, in combination with machine learning techniques and historical epidemiological information, have the potential to accurately and reliably provide near real-time regional p…
▽ More
Accurate real-time monitoring systems of influenza outbreaks help public health officials make informed decisions that may help save lives. We show that information extracted from cloud-based electronic health records databases, in combination with machine learning techniques and historical epidemiological information, have the potential to accurately and reliably provide near real-time regional predictions of flu outbreaks in the United States.
△ Less
Submitted 12 December, 2015;
originally announced December 2015.
-
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance
Authors:
Mauricio Santillana,
Andre T. Nguyen,
Mark Dredze,
Michael J. Paul,
John S. Brownstein
Abstract:
We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like…
▽ More
We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013-2014 (retrospective) and 2014-2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons
△ Less
Submitted 27 August, 2015;
originally announced August 2015.
-
Estimating numerical errors due to operator splitting in global atmospheric chemistry models: Transport and chemistry
Authors:
Mauricio Santillana,
Ling Zhang,
Robert Yantosca
Abstract:
We present upper bounds for the numerical errors introduced when using operator splitting methods to integrate transport and non-linear chemistry processes in global chemical transport models (CTM). We show that (a) operator splitting strategies that evaluate the stiff non-linear chemistry operator at the end of the time step are more accurate, and (b) the results of numerical simulations that use…
▽ More
We present upper bounds for the numerical errors introduced when using operator splitting methods to integrate transport and non-linear chemistry processes in global chemical transport models (CTM). We show that (a) operator splitting strategies that evaluate the stiff non-linear chemistry operator at the end of the time step are more accurate, and (b) the results of numerical simulations that use different operator splitting strategies differ by at most 10 percent, in a prototype one-dimensional non-linear chemistry-transport model. We find similar upper bounds in operator splitting numerical errors in global CTM simulations.
△ Less
Submitted 6 November, 2015; v1 submitted 11 May, 2015;
originally announced May 2015.
-
Accurate estimation of influenza epidemics using Google search data via ARGO
Authors:
Shihao Yang,
Mauricio Santillana,
S. C. Kou
Abstract:
Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search-based trac…
▽ More
Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search-based tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in people's online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions.
△ Less
Submitted 16 November, 2015; v1 submitted 4 May, 2015;
originally announced May 2015.
-
Quantifying the loss of information in source attribution problems using the adjoint method in global models of atmospheric chemical transport
Authors:
Mauricio Santillana
Abstract:
It is of crucial importance to be able to identify the location of atmospheric pollution sources in our planet. Global models of atmospheric transport in combination with diverse Earth observing systems are a natural choice to achieve this goal. It is shown that the ability to successfully reconstruct the location and magnitude of an instantaneous source in global chemical transport models (CTMs)…
▽ More
It is of crucial importance to be able to identify the location of atmospheric pollution sources in our planet. Global models of atmospheric transport in combination with diverse Earth observing systems are a natural choice to achieve this goal. It is shown that the ability to successfully reconstruct the location and magnitude of an instantaneous source in global chemical transport models (CTMs) decreases rapidly as a function of the time interval between the pollution release and the observation time. A simple way to quantitatively characterize this phenomenon is proposed based on the effective -undesired- numerical diffusion present in current Eulerian CTMs and verified using idealized numerical experiments. The approach presented consists of using the adjoint-based optimization method in a state-of-the-art CTM, GEOS-Chem, to reconstruct the location and magnitude of a realistic pollution plume for multiple time scales. The findings obtained from these numerical experiments suggest a time scale of 2 days after which the accuracy of the adjoint-based optimization methodology is compromised considerably in current global CTMs. In conjunction with the mean atmospheric velocity, the aforementioned time scale leads to an estimate of a length scale of about 1700km, downwind from the source, beyond which measurements, in conjunction with current global CTMs, may not be successfully utilized to reconstruct continuous-in-time sources. The approach presented here can be utilized to characterize the capabilities and limitations of adjoint-based optimization inversions in other regional and global Eulerian CTMs.
△ Less
Submitted 25 November, 2013;
originally announced November 2013.