Search | arXiv e-print repository

arXiv:2203.06505 [pdf, other]

Using digital traces to build prospective and real-time county-level early warning systems to anticipate COVID-19 outbreaks in the United States

Authors: Lucas M. Stolerman, Leonardo Clemente, Canelle Poirier, Kris V. Parag, Atreyee Majumder, Serge Masyn, Bernd Resch, Mauricio Santillana

Abstract: The ongoing COVID-19 pandemic continues to affect communities around the world. To date, almost 6 million people have died as a consequence of COVID-19, and more than one-quarter of a billion people are estimated to have been infected worldwide. The design of appropriate and timely mitigation strategies to curb the effects of this and future disease outbreaks requires close monitoring of their spa… ▽ More The ongoing COVID-19 pandemic continues to affect communities around the world. To date, almost 6 million people have died as a consequence of COVID-19, and more than one-quarter of a billion people are estimated to have been infected worldwide. The design of appropriate and timely mitigation strategies to curb the effects of this and future disease outbreaks requires close monitoring of their spatio-temporal trajectories. We present machine learning methods to anticipate sharp increases in COVID-19 activity in US counties in real-time. Our methods leverage Internet-based digital traces -- e.g., disease-related Internet search activity from the general population and clinicians, disease-relevant Twitter micro-blogs, and outbreak trajectories from neighboring locations -- to monitor potential changes in population-level health trends. Motivated by the need for finer spatial-resolution epidemiological insights to improve local decision-making, we build upon previous retrospective research efforts originally conceived at the state level and in the early months of the pandemic. Our methods -- tested in real-time and in an out-of-sample manner on a subset of 97 counties distributed across the US -- frequently anticipated sharp increases in COVID-19 activity 1-6 weeks before the onset of local outbreaks (defined as the time when the effective reproduction number $R_t$ becomes larger than 1 consistently). Given the continued emergence of COVID-19 variants of concern -- such as the most recent one, Omicron -- and the fact that multiple countries have not had full access to vaccines, the framework we present, while conceived for the county-level in the US, could be helpful in countries where similar data sources are available. △ Less

Submitted 12 March, 2022; originally announced March 2022.

arXiv:2009.07356 [pdf, other]

High-resolution Spatio-temporal Model for County-level COVID-19 Activity in the U.S

Authors: Shixiang Zhu, Alexander Bukharin, Liyan Xie, Mauricio Santillana, Shihao Yang, Yao Xie

Abstract: We present an interpretable high-resolution spatio-temporal model to estimate COVID-19 deaths together with confirmed cases one-week ahead of the current time, at the county-level and weekly aggregated, in the United States. A notable feature of our spatio-temporal model is that it considers the (a) temporal auto- and pairwise correlation of the two local time series (confirmed cases and death of… ▽ More We present an interpretable high-resolution spatio-temporal model to estimate COVID-19 deaths together with confirmed cases one-week ahead of the current time, at the county-level and weekly aggregated, in the United States. A notable feature of our spatio-temporal model is that it considers the (a) temporal auto- and pairwise correlation of the two local time series (confirmed cases and death of the COVID-19), (b) dynamics between locations (propagation between counties), and (c) covariates such as local within-community mobility and social demographic factors. The within-community mobility and demographic factors, such as total population and the proportion of the elderly, are included as important predictors since they are hypothesized to be important in determining the dynamics of COVID-19. To reduce the model's high-dimensionality, we impose sparsity structures as constraints and emphasize the impact of the top ten metropolitan areas in the nation, which we refer (and treat within our models) as hubs in spreading the disease. Our retrospective out-of-sample county-level predictions were able to forecast the subsequently observed COVID-19 activity accurately. The proposed multi-variate predictive models were designed to be highly interpretable, with clear identification and quantification of the most important factors that determine the dynamics of COVID-19. Ongoing work involves incorporating more covariates, such as education and income, to improve prediction accuracy and model interpretability. △ Less

Submitted 20 August, 2021; v1 submitted 15 September, 2020; originally announced September 2020.

arXiv:2007.00756 [pdf, other]

An Early Warning Approach to Monitor COVID-19 Activity with Multiple Digital Traces in Near Real-Time

Authors: Nicole E. Kogan, Leonardo Clemente, Parker Liautaud, Justin Kaashoek, Nicholas B. Link, Andre T. Nguyen, Fred S. Lu, Peter Huybers, Bernd Resch, Clemens Havas, Andreas Petutschnig, Jessica Davis, Matteo Chinazzi, Backtosch Mustafa, William P. Hanage, Alessandro Vespignani, Mauricio Santillana

Abstract: Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decre… ▽ More Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decreasing state-level US COVID-19 activity between January and June 2020. We estimate the timing of sharp changes in each data stream using a simple Bayesian model that calculates in near real-time the probability of exponential growth or decay. Analysis of COVID-19-related activity on social network microblogs, Internet searches, point-of-care medical software, and a metapopulation mechanistic model, as well as fever anomalies captured by smart thermometer networks, shows exponential growth roughly 2-3 weeks prior to comparable growth in confirmed COVID-19 cases and 3-4 weeks prior to comparable growth in COVID-19 deaths across the US over the last 6 months. We further observe exponential decay in confirmed cases and deaths 5-6 weeks after implementation of NPIs, as measured by anonymized and aggregated human mobility data from mobile phones. Finally, we propose a combined indicator for exponential growth in multiple data streams that may aid in develo** an early warning system for future COVID-19 outbreaks. These efforts represent an initial exploratory framework, and both continued study of the predictive power of digital indicators as well as further development of the statistical approach are needed. △ Less

Submitted 3 July, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

arXiv:2004.09911 [pdf, other]

Fever and mobility data indicate social distancing has reduced incidence of communicable disease in the United States

Authors: Parker Liautaud, Peter Huybers, Mauricio Santillana

Abstract: In March of 2020, many U.S. state governments encouraged or mandated restrictions on social interactions to slow the spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2 that has spread to nearly 180 countries. Estimating the effectiveness of these social-distancing strategies is challenging because surveillance of COVID-19 has been limited, with tests generally being priorit… ▽ More In March of 2020, many U.S. state governments encouraged or mandated restrictions on social interactions to slow the spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2 that has spread to nearly 180 countries. Estimating the effectiveness of these social-distancing strategies is challenging because surveillance of COVID-19 has been limited, with tests generally being prioritized for high-risk or hospitalized cases according to temporally and regionally varying criteria. Here we show that reductions in mobility across U.S. counties with at least 100 confirmed cases of COVID-19 led to reductions in fever incidences, as captured by smart thermometers, after a mean lag of 6.5 days ($90\%$ within 3--10 days) that is consistent with the incubation period of COVID-19. Furthermore, counties with larger decreases in mobility subsequently achieved greater reductions in fevers ($p<0.01$), with the notable exception of New York City and its immediate vicinity. These results indicate that social distancing has reduced the transmission of influenza like illnesses, including COVID 19, and support social distancing as an effective strategy for slowing the spread of COVID-19. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: 11 pages, 3 figures

arXiv:2004.04019 [pdf, other]

A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models

Authors: Dianbo Liu, Leonardo Clemente, Canelle Poirier, Xiyu Ding, Matteo Chinazzi, Jessica T Davis, Alessandro Vespignani, Mauricio Santillana

Abstract: We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from C… ▽ More We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers. △ Less

Submitted 8 April, 2020; originally announced April 2020.

arXiv:1911.02673 [pdf, other]

Towards the Use of Neural Networks for Influenza Prediction at Multiple Spatial Resolutions

Authors: Emily L. Aiken, Andre T. Nguyen, Mauricio Santillana

Abstract: We introduce the use of a Gated Recurrent Unit (GRU) for influenza prediction at the state- and city-level in the US, and experiment with the inclusion of real-time flu-related Internet search data. We find that a GRU has lower prediction error than current state-of-the-art methods for data-driven influenza prediction at time horizons of over two weeks. In contrast with other machine learning appr… ▽ More We introduce the use of a Gated Recurrent Unit (GRU) for influenza prediction at the state- and city-level in the US, and experiment with the inclusion of real-time flu-related Internet search data. We find that a GRU has lower prediction error than current state-of-the-art methods for data-driven influenza prediction at time horizons of over two weeks. In contrast with other machine learning approaches, the inclusion of real-time Internet search data does not improve GRU predictions. △ Less

Submitted 13 November, 2019; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract; Added Footer

arXiv:1612.02812 [pdf, other]

Advances in using Internet searches to track dengue

Authors: Shihao Yang, S. C. Kou, Fred Lu, John S. Brownstein, Nicholas Brooke, Mauricio Santillana

Abstract: Dengue is a mosquito-borne disease that threatens more than half of the world's population. Despite being endemic to over 100 countries, government-led efforts and mechanisms to timely identify and track the emergence of new infections are still lacking in many affected areas. Multiple methodologies that leverage the use of Internet-based data sources have been proposed as a way to complement deng… ▽ More Dengue is a mosquito-borne disease that threatens more than half of the world's population. Despite being endemic to over 100 countries, government-led efforts and mechanisms to timely identify and track the emergence of new infections are still lacking in many affected areas. Multiple methodologies that leverage the use of Internet-based data sources have been proposed as a way to complement dengue surveillance efforts. Among these, the trends in dengue-related Google searches have been shown to correlate with dengue activity. We extend a methodological framework, initially proposed and validated for flu surveillance, to produce near real-time estimates of dengue cases in five countries/regions: Mexico, Brazil, Thailand, Singapore and Taiwan. Our result shows that our modeling framework can be used to improve the tracking of dengue activity in multiple locations around the world. △ Less

Submitted 8 December, 2016; originally announced December 2016.

arXiv:1603.01134 [pdf, other]

Relatedness of the Incidence Decay with Exponential Adjustment (IDEA) Model, "Farr's Law" and Compartmental Difference Equation SIR Models

Authors: Mauricio Santillana, Ashleigh Tuite, Tahmina Nasserie, Paul Fine, David Champredon, Leonid Chindelevitch, Jonathan Dushoff, David Fisman

Abstract: Mathematical models are often regarded as recent innovations in the description and analysis of infectious disease outbreaks and epidemics, but simple models have been in use for projection of epidemic trajectories for more than a century. We recently described a single equation model (the incidence decay with exponential adjustment, or IDEA, model) that can be used for short term forecasting. In… ▽ More Mathematical models are often regarded as recent innovations in the description and analysis of infectious disease outbreaks and epidemics, but simple models have been in use for projection of epidemic trajectories for more than a century. We recently described a single equation model (the incidence decay with exponential adjustment, or IDEA, model) that can be used for short term forecasting. In the mid-19th century, Dr. William Farr developed a single equation approach (Farr's law) for epidemic forecasting. We show here that the two models are in fact identical, and can be expressed in terms of one another, and also in terms of a susceptible-infectious-removed (SIR) compartmental model with improving control. This demonstrates that the concept of the reproduction number, R0, is implicit to Farr's (pre-microbial era) work, and also suggests that control of epidemics, whether via behavior change or intervention, is as integral to the natural history of epidemics as is the dynamics of disease transmission. △ Less

Submitted 3 March, 2016; originally announced March 2016.

arXiv:1512.03990 [pdf]

doi 10.1038/srep25732

Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance

Authors: Mauricio Santillana, Andre Nguyen, Tamara Louie, Anna Zink, Josh Gray, Iyue Sung, John S. Brownstein

Abstract: Accurate real-time monitoring systems of influenza outbreaks help public health officials make informed decisions that may help save lives. We show that information extracted from cloud-based electronic health records databases, in combination with machine learning techniques and historical epidemiological information, have the potential to accurately and reliably provide near real-time regional p… ▽ More Accurate real-time monitoring systems of influenza outbreaks help public health officials make informed decisions that may help save lives. We show that information extracted from cloud-based electronic health records databases, in combination with machine learning techniques and historical epidemiological information, have the potential to accurately and reliably provide near real-time regional predictions of flu outbreaks in the United States. △ Less

Submitted 12 December, 2015; originally announced December 2015.

arXiv:1508.06941 [pdf]

doi 10.1371/journal.pcbi.1004513

Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance

Authors: Mauricio Santillana, Andre T. Nguyen, Mark Dredze, Michael J. Paul, John S. Brownstein

Abstract: We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like… ▽ More We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013-2014 (retrospective) and 2014-2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons △ Less

Submitted 27 August, 2015; originally announced August 2015.

arXiv:1505.02835 [pdf, other]

doi 10.1016/j.jcp.2015.10.052

Estimating numerical errors due to operator splitting in global atmospheric chemistry models: Transport and chemistry

Authors: Mauricio Santillana, Ling Zhang, Robert Yantosca

Abstract: We present upper bounds for the numerical errors introduced when using operator splitting methods to integrate transport and non-linear chemistry processes in global chemical transport models (CTM). We show that (a) operator splitting strategies that evaluate the stiff non-linear chemistry operator at the end of the time step are more accurate, and (b) the results of numerical simulations that use… ▽ More We present upper bounds for the numerical errors introduced when using operator splitting methods to integrate transport and non-linear chemistry processes in global chemical transport models (CTM). We show that (a) operator splitting strategies that evaluate the stiff non-linear chemistry operator at the end of the time step are more accurate, and (b) the results of numerical simulations that use different operator splitting strategies differ by at most 10 percent, in a prototype one-dimensional non-linear chemistry-transport model. We find similar upper bounds in operator splitting numerical errors in global CTM simulations. △ Less

Submitted 6 November, 2015; v1 submitted 11 May, 2015; originally announced May 2015.

Journal ref: Journal of Computational Physics, 2015

arXiv:1505.00864 [pdf, other]

doi 10.1073/pnas.1515373112

Accurate estimation of influenza epidemics using Google search data via ARGO

Authors: Shihao Yang, Mauricio Santillana, S. C. Kou

Abstract: Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search-based trac… ▽ More Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search-based tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in people's online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions. △ Less

Submitted 16 November, 2015; v1 submitted 4 May, 2015; originally announced May 2015.

Comments: 23 pages, 2 figures, Proceedings of the National Academy of Sciences (2015)

arXiv:1311.6315 [pdf, other]

Quantifying the loss of information in source attribution problems using the adjoint method in global models of atmospheric chemical transport

Authors: Mauricio Santillana

Abstract: It is of crucial importance to be able to identify the location of atmospheric pollution sources in our planet. Global models of atmospheric transport in combination with diverse Earth observing systems are a natural choice to achieve this goal. It is shown that the ability to successfully reconstruct the location and magnitude of an instantaneous source in global chemical transport models (CTMs)… ▽ More It is of crucial importance to be able to identify the location of atmospheric pollution sources in our planet. Global models of atmospheric transport in combination with diverse Earth observing systems are a natural choice to achieve this goal. It is shown that the ability to successfully reconstruct the location and magnitude of an instantaneous source in global chemical transport models (CTMs) decreases rapidly as a function of the time interval between the pollution release and the observation time. A simple way to quantitatively characterize this phenomenon is proposed based on the effective -undesired- numerical diffusion present in current Eulerian CTMs and verified using idealized numerical experiments. The approach presented consists of using the adjoint-based optimization method in a state-of-the-art CTM, GEOS-Chem, to reconstruct the location and magnitude of a realistic pollution plume for multiple time scales. The findings obtained from these numerical experiments suggest a time scale of 2 days after which the accuracy of the adjoint-based optimization methodology is compromised considerably in current global CTMs. In conjunction with the mean atmospheric velocity, the aforementioned time scale leads to an estimate of a length scale of about 1700km, downwind from the source, beyond which measurements, in conjunction with current global CTMs, may not be successfully utilized to reconstruct continuous-in-time sources. The approach presented here can be utilized to characterize the capabilities and limitations of adjoint-based optimization inversions in other regional and global Eulerian CTMs. △ Less

Submitted 25 November, 2013; originally announced November 2013.

Showing 1–13 of 13 results for author: Santillana, M