Epidemic-induced local awareness behavior inferred from surveys
and genetic sequence data

Gergely Ódor Department of Network and Data Science, Central European University, Vienna, Austria Márton Karsai Department of Network and Data Science, Central European University, Vienna, Austria National Laboratory of Health Security, HUN-REN Alfréd Rényi Institute of Mathematics, Budapest, Hungary
Abstract

Behavior-disease models suggest that if individuals are aware and take preventive actions when the prevalence of the disease increases among their close contacts, then the pandemic can be contained in a cost-effective way. To measure the true impact of local awareness behavior on epidemic spreading, we propose an efficient approach to identify superspreading events and assign corresponding Event Containment Scores (ECSs) in clinical genetic sequence data.

We validate ECS as a measure of local awareness in simulation experiments, and we find that ECS was correlated positively with policy stringency during the COVID-19 pandemic. Finally, we observe a temporary drop in ECS during the Omicron wave in most European countries, matching a survey experiment we carried out at the same time. Our findings bring important insight into the field of awareness modeling through the analysis of large-scale genetic sequence data, one of the most promising data sources in epidemics research.

1 Introduction

The COVID-19 pandemic has highlighted several pivotal shortcomings that demand comprehensive examination within our society [1]. One of the most important lessons was the need for more effective social interventions, which can ensure the adherence to the necessary containment measures during future pandemics [1, 2]. Manifesting as a social dilemma, restrictive measures generate a conflict between long-term collective interest and short-term self-interest [3], and it can be difficult to convince individuals to cooperate, especially if the cooperative behavior needs to be sustained for longer time periods [4, 5, 6]. Among interventions that raise awareness and promote cooperative behavior, a combination of community engagement, accurate monitoring, and transparent reporting of the impact of restrictions has been found the most consistently effective approach [7, 8].

Recognizing the importance of the problem, the research community responded to the emergence of the COVID-19 pandemic by closely monitoring and actively reporting the changes in epidemic awareness [9, 10]. However, most of these studies focused on global awareness (i.e., adherence to governmental restrictions), while only a few studies exist on the impact of local awareness (i.e., behavioral changes adaptive to the local prevalence of the disease), even though there is substantial model-based evidence that local awareness can be more effective in reducing the pandemic threshold and reducing the size of the epidemic compared to its global counterpart [11, 12, 13, 14]. The bias towards global awareness can be partially explained by the limited data availability on the local scale, due to privacy concerns [15, 16].

To fill the gap in monitoring local awareness behavior, we conducted a representative telephone survey asking 9000 participants over 9 months during the Delta and the Omicron waves in Hungary as part of the MASZK national survey [17]. The responders were asked to rate their willingness to undertake stricter preventive measures (such as increased mask wearing or social distancing) if the prevalence of the disease increased among their close contacts. The survey results show an unexpected pattern (Figure 1 (a)). While local awareness scores stayed relatively constant throughout the collection period, including the Delta wave of the pandemic, we observed a drop in awareness during the Omicron wave, which rebounded promptly after the wave has ended.

Refer to caption
Figure 1: (a) The MASZK Hungarian telephone survey, with 1000 participants in each of the 9 months, shows that the mean awareness score remains relatively constant throughout the recording period, except during the Omicron wave, when the awareness scores drop. The government imposed preventive measures (mask wearing, social distancing) show a different temporal pattern. (b) Our proposed pipeline to process synthetic (blue) and real genetic sequence data (grey) to compute Event Containment Scores (ECS) – a proxy for local awareness behavior.

The measured awareness scores show a distinctive temporal pattern compared to the standard protective measures, which we also assessed in the same survey. Figure 1 (a) shows that mask wearing stayed constant throughout both the Delta and the Omicron waves, while social distancing dropped during the Omicron wave, but did not rebound after the wave has ended. These additional survey results also rule out the hypotheses that the drop in awareness scores can be explained exclusively by the responders inability to perform stricter measures during the Omicron wave, or by the relatively lower risk of hospitalization and death posed by the Omicron variant.

According to our interpretation, the observed drop in awareness scores can be attributed to a form of pandemic fatigue [4, 5]; the demotivation to engage in preventive behavior due to the complex interplay of various psychological factors. However, since the general adherence to regulations showed a very different pattern compared to the awareness behavior in Figure 1 (a), the observed “awareness fatigue” is likely to have a very different psychological explanation, which our survey was not designed to reveal. Instead of speculating about the mechanisms of the observed phenomenon, we focus on two important questions about the impact of our finding: (i) do other countries show similar changes in awareness behavior? (ii) does the observed drop in self-reported awareness have a measurable impact on the spread of the epidemic? To answer these questions we turn to the analysis large-scale genetic sequence data, which contains hidden, but accessible information about the local spread of the epidemic.

1.1 Inferring Local Awareness from Genetic Sequence Data

While genetic data raises relatively minor privacy concerns [18], and it is unparelleled in terms of availablity, extracting behavioral information from genetic sequences is a challenging task. In phylodynamics [19, 20], human behavior is typically inferred based on the phylogenetic tree reconstructed from the observed sequences [21]. However, current tree reconstruction methods have a number of limitations. First, traditional methods are computationally intensive and it is difficult to scale them to datasets with more than a few thousand sequences [22, 23]. Since the COVID-19 pandemic, there has been significant process in develo** more scalable methods [24], and releasing publicly available trees for further analysis [25, 26]. However, processing millions of SARS-CoV-2 genetic sequences remains a challenge [27], and the publicly shared pre-computed trees do not have the same coverage as the Global Initiative on Sharing All Influenza Data (GISAID) dataset, which contains over 16 million SARS-CoV-2 genetic sequences, with a 5-15% sampling rate in several countries [28]. Second, working with general-purpose methods or highly pre-processed datasets can significantly lower the statistical power of our results, especially since previous methods were not optimized to measure local-awareness behavior. Instead, we process this new dataset of unprecedented size, by focusing on a simple and tractable statistic that does not require the reconstruction of the phylogenetic tree – the size distribution of the clusters of identical genetic sequences over time. Similar tree-free methods with different applications have been recently proposed by [29, 30, 31]. In essence, we break up the global epidemic into thousands of sub-epidemics with identical genetic code to infer patterns of local awareness. Since each sub-epidemic contains only very noisy information about the general local awareness patterns in the population, we focus on one of the most robust features of the dataset: Superspreading Events (SSEs).

The role of SSEs as the driving force of the COVID-19 pandemic was well-established in early 2020 [32]. Since then, there has been a remarkable research effort to understand the potential of targeted interventions to prevent or contain SSEs [33, 34, 35], and to document the effect of these interventions in case studies based on contract tracing [36, 37]. It has also been shown that the downstream infection patterns of SSEs can be observed from phylogenetic trees [38], which can be used to infer signs of awareness behavior. However, since phylogenetic trees are not applicable on large-scale datasets, a new methodology to quantify the impact of awareness behavior from genetic data is needed.

Inspired by [31, 38], we develop a pipeline to detect SSEs based on the size distribution of clusters of identical genetic sequences, and to measure the resulting secondary infections by assigning each SSE an Event Containment Score (ECS, see Figure 1 (b)). Intuitively, ECS is a proxy for the level of adaptive local-awareness behavior, which we confirm via extensive simulation results on synthetic epidemic models with local awareness. To validate the ECS score in real data, we compute the ECS of European countries based a dataset of over 5 million genetic sequences collected through 145 weeks. We demonstrate that the ECS correlates positively with the Oxford Containment Health Indices [39] in the selected countries, but not with some of the potential confounders, such as sampling rate, attack rate of the sizes of the SSEs. Finally, by comparing ECS scores during an epidemic wave and between waves in each country in our dataset, we observe that local awareness dropped during the Omicron wave in multiple countries during the COVID-19 pandemic, and that it had a measurable impact on the spread of the disease. In addition to providing evidence for the impact of local awareness in multiple countries, our methods pave the way for future interdisciplinary studies that monitor behavioral patterns using large-scale genetic sequence data.

2 Results

2.1 Event Containment Scores on the COVID-19 Genetic Dataset

Our analysis is based on the detection of SSEs and the assignment of ECSs to each SSE by quantifying secondary infections (Figure 1 (b)). As the first step of the pipeline, we downloaded the entire GISAID EpiCoV database between March 2020 and March 2023 [28]. Although the database contains sequences from over 200 countries worldwide, throughout the paper we focused on European countries, since this region had the highest sampling rate, with suitably different but comparable countries from a behavioral perspective. Besides the raw nucleotide sequences, the dataset also contains various metadata, such as the date and the location of the sample (usually at the country or county level). Moreover, the database contains the amino-acid-level substitutions of each sequence compared to the WIV04 reference sequence collected in late 2019 in Wuhan. Although the amino-acid-level substitution data is more aggregated than the raw genetic data (three nucleotides encode one amino acid, with multiple triplets having the same encoding), it still contains highly detailed information about the genetic code of the samples, and it is computationally more tractable to process, since the alignment of the raw genetic codes can be omitted. We preprocess the dataset by partitioning the genetic sequences with identical amino acid substitutions into subsets, which we call Amino Acid Collision Clusters (AACCs) We group together AACCs that were collected in the same country and that belong to the same variant, as it is often assumed that SARS-CoV-2 viruses with identical Greek letters (e.g., Alpha, Delta, Omicron, see Figure 2 (a)) have similar fitness profiles [40]; there is no selection between them, and the infection probability and recovery time of the patient are similar.

Refer to caption
Figure 2: (a) Bar plot showing the number of SARS-CoV-2 genetic sequences collected in Austria and uploaded to the GISAID platform over time, for each of the major variants, and the number of reported cases (red). (b) Visualization of the size of 6 AACCs in Austria over time. Within these 6 AACCs, our proposed thresholding approach detected 5 Superspreading Events shown with square markers (often at the beginning of an AACC). The color of the squares marks the sign of the Event Containment Scores.

We detect SSEs in each AACC by tracking unexpectedly large increases in their size after proper normalization (Methods 4.1). Our SSE detection method is closely related to previous thresholding approaches [31, 38], requires only minor preprocessing, and the detected SSEs agree with our intuition after visual inspection (Figure 2 (b)). Thereafter, we assign Event Containment Scores (ECSs) to each SSE by comparing the size of the AACCs after SSEs and after appropriately selected baseline events (Methods 4.2). Finally, to acquire aggregate descriptions of event containment, we compute the median of ECSECS\mathrm{ECS}roman_ECS values in each country-variant pair c𝑐citalic_c, denoted by ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (the output of the pipeline in Figure 1 (b)). Intuitively, a positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT means that SSEs typically led to smaller AACC sizes, and therefore fewer secondary infections than the baselines, i.e. the SSEs were well-contained (Figure 2 (b), red squares). Similarly, a negative ECSECS\mathrm{ECS}roman_ECS would suggest SSEs that were not contained as well as the baselines (Figure 2 (b), blue squares).

Both the SSE detection and the ECS assignment algorithms are efficient but imperfect methods, potentially introducing significant amounts of noise in our results. However, we expect that if enough SSEs are detected in a country-variant pair, the median of the ECS values will still contain information about event containment, and subsequently, awareness behavior. We confirm this hypothesis by the analysis of COVID-19 genetic sequences in this section, and by simulation results in section 2.2.

We compute the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values for all country-variant pairs with at least 20 detected SSEs, and we analyse how these values are related to behavioral metrics and potential confounding factors. Concentrating on the Delta and the Omicron variants, as these two variants had the highest sampling rate and the highest number of countries with at least 20 SSEs, Figure 3 (a) and (d) show a large variability between the computed ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values, suggesting a large variability in the efficacy of SSE containment in these countries. For some countries, such as Austria and Germany, the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values are positive in both waves, suggesting efficient SSE containment. For other countries, such as Denmark, Switzerland and Sweden, the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values are negative, suggesting inefficient SSE containment compared to the baseline. There are also countries, such as Ireland and Slovenia, where the sign of the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT value changes between the two waves.

Refer to caption
Figure 3: Event Containment Scores (green) and various exogenous variables (blue) in European countries with at least 20 detected SSEs in the Delta (subplots (a)-(c)) and the Omicron (subplots (d)-(f)) waves. The exogenous variables are: (a),(d) sequencing rate; (b),(e) attack rate; (c),(f) CHI (Methods 4.2). All bar plots show median values and corresponding confidence intervals (2.52.52.52.5th and 97.597.597.597.5th percentiles), with a maximum threshold 3 on the confidence intervals of ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values. Grey background signifies a statistically significant correlation between ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable (Table 1).
Sequencing rate Attack rate CHI
Delta (a) Omicron (d) Delta (b) Omicron (e) Delta (c) Omicron (f)
Spearman-r statistic -0.833 -0.283 -0.033 0.050 0.733 0.850
Spearman-r p-value 0.005 0.460 0.932 0.898 0.025 0.004
Table 1: Spearman-r statistics and p-values computed between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values and the exogenous variables for each of the six plots in Figure 3 (a)-(f). Cells with a significant p-value are colored grey.

To understand the factors that could explain the variability in the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values, we compute the sampling rate, the attack rate and the Containment Health Index (CHI) in each (country-wave) pair (Methods 4.2). CHI is a composite epidemic response measure based on thirteen policy indicators maintained by the Oxford Coronavirus Government Response Tracker (OxCGRT) project, similarly to the stringency index [39]. We plot these exogenous variables in Figure 3, and we compute the Spearman-r statistic between them and the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values (Table 1). We find statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values and the CHI in both waves (Figure 3 (c) and (f)), and between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values and the sampling rate in the Delta wave (Figure 3 (a)). Since the sampling rate is a potential confounding factor, the latter result could suggest that our results are artefacts of the sampling procedure. However, in the Delta wave, sampling rate and CHI happened to be highly and negatively correlated, potentially because certain countries aimed to lift the economic burden of strict containment policies by a higher quality sequencing and monitoring project. In the other waves the correlation is only significant between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the CHI (i.e., in the Omicron wave, and in the Alpha wave with a different SSE detection threshold, see Supplementary Material B.1).

The correlation between the Containment Health Index and the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values in Figure 3 is an indication that ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT measures a behavioral signal instead of noise or confounding effects. To test whether this behavior is indeed local awareness behavior, we conduct a large-scale simulation experiment in Section 2.2, and subsequently we return to the empirical analysis in Section 2.3.

2.2 Event Containment Scores on Synthetic Genetic Sequence Data

We set up a synthetic pipeline to generate genetic sequence datasets similar to the GISAID EpiCoV dataset, which we can analyse with our SSE detection and ECS assignment pipeline (Figure 1 (b)). First, we simulate Susceptible-Infected-Recovered (SIR) epidemics on various synthetic and real networks (Methods 4.3-4.4), then we apply the Jukes-Cantor (JC) [41] genetic substitution model on the resulting infection tree to produce genetic sequence data (Methods 4.5), and finally we compute the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values as before (except that now c𝑐citalic_c denotes the model parameters instead of the country-variant pair, see also Methods 4.2).

For the underlying network, we select four real social networks and three types of synthetic random networks. Two company friendship networks [42], that encode personal connections (recorded by Facebook), have medium size (around 5000 nodes), and have similar characteristics as the contact networks on which a viral disease (such as SARS-CoV-2) can spread. Two online social networks, the Google+ friendship network [43], and the Twitter mutual mention network [44] are large (over 200,000 nodes), and they model the underlying network of online contagion processes (e.g., rumor, misinformation). All 4 networks have a heterogeneous degree distribution, and a relative high clustering coefficient (Supplementary Figure 20). To model these characteristics separately, we select three synthetic network models: the Configuration Model has a heterogeneous degree distribution but no clustering, the Stochastic Block Model (SBM) has high clustering but a homogeneous degree distribution, and the Geometric Inhomogeneous Random Graph (GIRG) model [45], which has both a heterogeneous degree distribution and high clustering (Methods 4.3). On all network models, due to the heterogenous degree distribution (or the community structure in case of the SBM), we expect large infection events that can be detected via the SSE detection pipeline outlined in Section 2.1.

We model local and global awareness in our simulations as a modification of the SIR model with adaptively changing infection probabilities (Methods 4.4). Inspired by [46], for local awareness we set the infection probability of an infectious node u𝑢uitalic_u at time t𝑡titalic_t to be

βu,t=β0eαlIu,t,subscript𝛽𝑢𝑡subscript𝛽0superscript𝑒subscript𝛼𝑙subscript𝐼𝑢𝑡\beta_{u,t}=\beta_{0}e^{-\alpha_{l}I_{u,t}},italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (1)

where β0[0,1]subscript𝛽001\beta_{0}\in[0,1]italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the basic infection probability, αlsubscript𝛼𝑙\alpha_{l}italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT sets the strength of the local awareness behavior, and Iu,tsubscript𝐼𝑢𝑡I_{u,t}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT is the number of infectious neighbors of node u𝑢uitalic_u at time t𝑡titalic_t. In case of the global awareness, all infectious nodes u𝑢uitalic_u have the same infection probability at time t𝑡titalic_t:

βu,t=β0eαgIt/N,subscript𝛽𝑢𝑡subscript𝛽0superscript𝑒subscript𝛼𝑔subscript𝐼𝑡𝑁\beta_{u,t}=\beta_{0}e^{-\alpha_{g}I_{t}/N},italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_N end_POSTSUPERSCRIPT , (2)

where Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the total number of infectious nodes in the network, αgsubscript𝛼𝑔\alpha_{g}italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT sets the strength of the global awareness behavior, and N𝑁Nitalic_N is the size of the network. The exponential function in equation (1) (resp., (2)) aims to model a scenario where each neighbor (resp., node) may alert node u𝑢uitalic_u about their infectious status, and each of these independent alerts cause a multiplicative reduction in the infection probability (similarly to alternative approaches where awareness is modeled as another contagion process, and the probability of staying unaware decays exponentially in the number of aware neighbors [11, 12, 13]). As a robustness check, we also implement linearly decaying awareness functions, since it has been reported that they may be more cost-effective based on an epi-economic point of view [47] (Supplementary Figure 19).

Refer to caption
Figure 4: Event Containment Scores (ECS) computed on genetic sequence data generated from simulated epidemics on synthetic and real networks as a function of (a) the local, (b) the global awareness function parameter, (c) the infection probability and (d) the subsampling probability. When not stated otherwise, all parameters are set to be their default values αl=0subscript𝛼𝑙0\alpha_{l}=0italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0, αg=0subscript𝛼𝑔0\alpha_{g}=0italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 0, β0=0.3subscript𝛽00.3\beta_{0}=0.3italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.3 and p=0𝑝0p=0italic_p = 0. We observe positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values only in case of local awareness.

In Figure 4, we plot the dependence of ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on the awareness-strength parameters αlsubscript𝛼𝑙\alpha_{l}italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and αgsubscript𝛼𝑔\alpha_{g}italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and two potential confounding factors: the basic infection probability β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the sampling probability p𝑝pitalic_p. The results indicate that ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT primarily depends on the parameter αlsubscript𝛼𝑙\alpha_{l}italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (Figure 4 (a)). Importantly, we were only able to generate positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values with the local awareness model, which is a strong indication that that the positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values observed in the empirical dataset (Figure 3) are signs of local awareness behavior.

The observation that only local awareness can produce positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values has an intuitive explanation. When a SSE occurs, there is usually a common trait between the individuals that become infected at the same time; they all tend belong to the same community as the initial infector. It is also likely, that there exist many additional individuals that belong to the same community, but do not become immediately infected. Indeed, reports of early SSEs during COVID-19 do not report all individuals becoming infected in the communities at the same time [48, 49], and the same is true in simulations, unless the infection probability inside the community is close to 1. If the structure of the contact network remains unchanged after the SSE, then these additional community members become infected in the next timestep (week), which causes the number of sequences in the AACC to grow, and therefore produces a negative ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT value. Note that there are extreme examples of static networks and epidemic parameters that produce a positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT value. For instance, in a star network with infection probability close to 1, an epidemic from the center node produces a single SSE, and then dies out in the next step, resulting in ECSc>0subscriptECS𝑐0\mathrm{ECS}_{c}>0roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT > 0. However, we conclude that besides a few extreme cases, positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values – such as the ones observed in the empirical dataset in Figure 3 – are signs of local awareness behavior.

2.3 Local Awareness During And Between Waves in Genetic Sequence Data

Having validated ECSs in the GISAID EpiCoV and in synthetic datasets, we return to the main question posed in the Introduction; whether drops in local awareness behavior can be observed in the genetic sequence dataset during the Omicron wave of the COVID-19 pandemic. So far, we computed the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT value as the median of all ECS values for a country-variant pair. However, since in certain countries, most notably in the United Kingdom, we detect thousands of SSEs in multiple variants, we can obtain a signal at a higher temporal resolution by computing the median of ECS values on a monthly basis. The resulting signal (Figure 5 (a)), obtained purely based on genetic sequence data, shares a remarkable similarly with the Hungarian awareness survey results in Figure 1 (a). Both curves show a relatively stable signal between October 2021 and July 2022, with a significant drop during the (largest) peak(s) of the Omicron variant, suggesting that the two methods may measure similar behavioral patterns.

Refer to caption
Figure 5: (a) ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values computed on a monthly basis in the UK in the Alpha, Delta and the Omicron variants. We observe a two drops in ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: a milder one in July 2021 during the Delta wave, and a stronger one in December 2021 - February 2022 during the Omicron wave. The last datapoint of the Delta variant and the first datapoint of the Omicron variant are within 10 days apart, as in the UK the Omicron variant took over the Delta in within that time interval. (b) ECSECS\mathrm{ECS}roman_ECS during waves vs outside of waves in various country-variant pairs. Higher quality datapoints computed using more superspreading events are darker (see colorbar). Datapoints below the dashed line hint at drops in local awareness, typically during the Omicron variant.

Given the different nature of the two datasets, obtained with a different methodologies from different countries, it is important to interpret the comparison of Figures 1 and 5 (a) carefully. A significant difference between the two countries is that both the Delta and the Omicron waves arrived a few months earlier in the UK compared to Hungary, and during the UK Delta wave, the reported case counts showed plateau instead of a clear epidemic wave. The Omicron wave arrived to the UK during this plateau, whereas in Hungary the case counts dropped between the two waves. Moreover, the UK had more stringent government restrictions implemented at the end of 2021 than Hungary [39], which could have an influence on local behavioral patterns as well.

Since other countries in the genetic dataset have too few samples to perform a monthly aggregation as we did in the UK, for each country-variant pair we subdivide the dataset into two groups: “during wave” and “outside of wave”. More precisely, in each country-variant pair, we rank the SSEs based on the total number of reported cases in the country at the time, and we classify the top 20% of the SSEs as “during wave” and the remaining 80% as “outside of wave” (Supplementary Material C). Thereafter, we compute the median of the ECS values in each country-variant-during/outside triplet, where at least 20 SSEs were detected. The results (Figure 5 (b)) are consistent our observation in the Hungarian survey dataset: in European countries with a large genetic sequence datasets, local awareness (ECS) was lower during the wave compared to outside of the wave during the Omicron variant but not during the Alpha and the Delta variants.

3 Discussion

In epidemic surveillance, there is usually a trade-off between the breadth and the depth of the data we can access. On one end, we have aggregate case counts, that give a macroscopic view on the epidemic, one the other end we have a handful of case-studies, which tell about the local spread. Survey results provide a representative depiction of self-reported human behavior, however, they lack sufficient information on disease spread to support conclusions beyond forming hypotheses.

In this paper, we observe local awareness behavior in two complementary datasets: a Hungarian survey dataset and the dataset of clinical genetic sequences collected during the COVID-19 pandemic. We first show that the survey results indicate a drop in local awareness behavior during the Omicron wave of the COVID-19 pandemic. Based on the survey results, we formulate a question, whether this drop occurred and caused noticeable changes in the spread of the disease in other countries as well. To address this question, we introduce a methodology that utilizes genetic sequence data, striking a new balance between micro and macroscopic epidemic surveillance.

As with any trade-off, our proposed analysis comes with a number of limitations. We identify SSEs based on simple thresholding of sequence counts, which is less accurate than manual contact tracing, where more metadata and more context about infection events can be taken into account. Consequently, we only compute highly aggregated statistics on the detected events. One ECS gives only very noisy information about the outcome of each SSE, and only the median of all ECSs, the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT value has the statistical power to say anything about local awareness in region c𝑐citalic_c. Since the amount of genetic sequences we have available since COVID-19 is unprecedented, and the new tools to analyse it are just being developed [22], our results too have to be interpreted very carefully, and should be confirmed by further research.

Despite these limitations, the new methodology we propose brings immediate and exciting contributions into epidemic surveillance and modeling. While local awareness has been thoroughly studied in the modeling literature [11, 12, 13, 14], there has been little empirical evidence about its impact in real epidemics. We provide such evidence by showing that positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT corresponds to local awareness behavior in simulations, and that ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT was positive in several European countries during the COVID-19 pandemic.

On a more operational side, by studying ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values, we are able to measure how effectively different countries managed to contain SSEs in different waves. We observe that this effectiveness is highly correlated with the containment policies implemented in each country, which is a reassuring finding. We envision that similar analyses will be used to evaluate the effectiveness of the implemented policies in future pandemics, potentially generating a positive feedback loop between cooperative preventive behavior and epidemic containment. Unfortunately, even with the rapid advancement of genetic sequencing technologies, the financial burden of achieving the sampling rate necessary for our proposed analysis is quite high, and we cannot expect that we will have the same coverage in every pandemic. Deciding how much sequencing is actually needed for epidemic surveillance is currently an active research topic, as the cost-benefit tradeoffs are still being debated [50]. Our analysis adds to this discussion by bringing a new potential benefit of dense genetic sequencing.

Finally, we highlight the importance of continuing this research towards more specific questions, such as understanding the socioeconomic factors that determine whether SSEs are effectively contained or not, and whether the measured local awareness behavior is more centrally regulated (e.g. by a public health organization), or decentralized and self-motivated as it was asked in the questionnaire in Figure 1 (a). Large-scale genetic data analysis provides a new opportunity to answer these questions, and to further our understanding about the underlying mechanisms of behavior-disease models.

4 Methods

The overview of the various steps of the simulation pipeline are illustrated in Figure 1. The details on the preprocessing of AACCs are included in the main text. We give the details on the detection of SSEs (Section 4.1), the computation of the ECSs (Section 4.2), the generation of synthetic networks (Section 4.3), the SIR model with local and global awareness (Section 4.4), and the generation of synthetic genetic sequences (Section 4.5) below.

4.1 SSE detection

See Supplementary Material A for a detailed explanation of these methodological choices. We index the size of AACCs by the time t𝑡titalic_t (integer value measured in weeks since the first sequence), their country-variant pair denoted by c𝑐citalic_c, and their cluster index i𝑖iitalic_i (Figure 2 (b)). We track the normalized changes in AACC sizes defined as

NormChangec,i(t)=AACCc,i(t+1)AACCc,i(t)max(1,AACCc,i(t)),subscriptNormChange𝑐𝑖𝑡subscriptAACC𝑐𝑖𝑡1subscriptAACC𝑐𝑖𝑡max1subscriptAACC𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)=\frac{\mathrm{AACC}_{c,i}(t+1)-\mathrm{AACC}_{c,i% }(t)}{\mathrm{max}(1,\sqrt{\mathrm{AACC}_{c,i}(t)})},roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) - roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG roman_max ( 1 , square-root start_ARG roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG ) end_ARG , (3)

where AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) denotes the size of the AACC indexed by (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ). We say that a SSE happens at time i𝑖iitalic_i in AACC (c,i)𝑐𝑖(c,i)( italic_c , italic_i ) if NormChangec,i(t)subscriptNormChange𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) is larger than a threshold, which is set to by 9 by default, and we give a robustness analysis in Supplementary Material B.1.

4.2 ECS assignment

See Supplementary Material A for a detailed explanation of these methodological choices. In each country-variant pair, with at least 20 detected SSEs, we match each SSE (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ) with at least 2m2𝑚2m2 italic_m baseline events (not SSEs) based on AACC sizes (see Supplementary Material B.2 for a robustness analysis on the value of m𝑚mitalic_m). We outline a procedure that ensures that compared to (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ), at least m𝑚mitalic_m larger and m𝑚mitalic_m smaller AACCs are always selected as baselines, however, if there are a large number of AACCs with the same size as (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ), then we select all of them to avoid arbitrary selections and to make use of the available data.

Formally, let us denote the cluster indices (resp., time indices) of the matched AACCs by I(c,i,t)𝐼𝑐𝑖𝑡I(c,i,t)italic_I ( italic_c , italic_i , italic_t ) (resp., T(c,i,t))T(c,i,t))italic_T ( italic_c , italic_i , italic_t ) )). First, we sort all AACCs by size to create an order 𝒪𝒪\mathcal{O}caligraphic_O. We construct I(c,i,t)𝐼𝑐𝑖𝑡I(c,i,t)italic_I ( italic_c , italic_i , italic_t ) (resp., T(c,i,t))T(c,i,t))italic_T ( italic_c , italic_i , italic_t ) )) by taking the union of the cluster (resp., time) indices of all AACCs with the same size as (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ), as well as the m𝑚mitalic_m closest smaller and the m𝑚mitalic_m closest larger AACCs to (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ) in 𝒪𝒪\mathcal{O}caligraphic_O. Then, the median baseline NormChange values at time t𝑡titalic_t are defined as

Baselinec,i(t)=medianj(NormChangec,I(c,i,t)j(T(c,i,t)j)),subscriptBaseline𝑐𝑖𝑡subscriptmedian𝑗subscriptNormChange𝑐𝐼subscript𝑐𝑖𝑡𝑗𝑇subscript𝑐𝑖𝑡𝑗\mathrm{Baseline}_{c,i}(t)=\mathop{\text{median}}_{j}\left(\mathrm{NormChange}% _{c,I(c,i,t)_{j}}(T(c,i,t)_{j})\right),roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = median start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_NormChange start_POSTSUBSCRIPT italic_c , italic_I ( italic_c , italic_i , italic_t ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T ( italic_c , italic_i , italic_t ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , (4)

where the NormChange function is defined in equation (3). Thereafter, ECSc,i(t)subscriptECS𝑐𝑖𝑡\mathrm{ECS}_{c,i}(t)roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) is computed as

ECSc,i(t)=Baselinec,i(t+1)NormChangec,i(t+1).subscriptECS𝑐𝑖𝑡subscriptBaseline𝑐𝑖𝑡1subscriptNormChange𝑐𝑖𝑡1\mathrm{ECS}_{c,i}(t)=\mathrm{Baseline}_{c,i}(t+1)-\mathrm{NormChange}_{c,i}(t% +1).roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) - roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) . (5)

and ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is defined as the median of the ECSc,i(t)subscriptECS𝑐𝑖𝑡\mathrm{ECS}_{c,i}(t)roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values for all SSEs (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ).

In Figure 3, ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values are compared with various exogenous variables (sampling rate, attack rate, CHI). These exogenous variables are computed for each country on a weekly basis based on publicly available datasets on the case counts [51], population counts [52], and the Oxford Containment Health Index [39]. Then, each SSE in the dataset is matched with the exogenous variables based on the time and country information. Finally, the plotted values are computed as the median of the exogenous variables of the SSEs corresponding to index c𝑐citalic_c (which are also used to compute ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT).

4.3 Generating synthetic networks

Geometric Inhomogenous Random Graphs (GIRGs) were generated by sampling the spatial coordinates and the expected degrees of the nodes, and then connecting them by edges with a probability given by a kernel function, which is inversely proportional with the spatial distance, and assures the desired node degrees [53]. We used the Python implementation [54] for the sampling procedure with degree exponent τ=3.5𝜏3.5\tau=3.5italic_τ = 3.5 and parameters α=2.3𝛼2.3\alpha=2.3italic_α = 2.3, C1=0.8subscript𝐶10.8C_{1}=0.8italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.8. We tuned C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT numerically to achieve the desired average degree (by default 3). Configuration models are generated by degree-preserving edge shuffling of the edges of the generated GIRG networks. SBMs were generated with blocks of size 50. The connection probabilities inside and between of the blocks were tuned so that for each node, half of it’s average degree were inside the block, and half of it’s average degree were outside the block. All synthetic networks had 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT nodes, and we took the largest connected component if the network was not connected. We include a visualization of the size, degree distribution and average clustering coefficient of the generated networks in Supplementary Figure 20.

4.4 SIR model extended with local and global awareness

On both synthetic and real networks, we used our own implementation of the SIR model. We model local and global awareness by setting the infection probability of an infectious node u𝑢uitalic_u to any other susceptible node v𝑣vitalic_v at time t𝑡titalic_t to a function βu,tsubscript𝛽𝑢𝑡\beta_{u,t}italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT. In case of local awareness, βu,tsubscript𝛽𝑢𝑡\beta_{u,t}italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT depends on on Iu,tsubscript𝐼𝑢𝑡I_{u,t}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT, the number of infected neighbors of u𝑢uitalic_u at t𝑡titalic_t, and in case of global awareness, βu,tsubscript𝛽𝑢𝑡\beta_{u,t}italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT depends on Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the total number of infected nodes at time t𝑡titalic_t. The specific awareness functions we implemented are shown in Table 2. The default values for the basic infection probability β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the recovery probability γ𝛾\gammaitalic_γ were always 0.30.30.30.3.

No awareness: βu,t=β0subscript𝛽𝑢𝑡subscript𝛽0\beta_{u,t}=\beta_{0}italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
Exponential local awareness (1): βu,t=β0exp(αlIu,t)subscript𝛽𝑢𝑡subscript𝛽0expsubscript𝛼𝑙subscript𝐼𝑢𝑡\beta_{u,t}=\beta_{0}\cdot\mathrm{exp}(-\alpha_{l}I_{u,t})italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ roman_exp ( - italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT )
Exponential global awareness (2): βu,t=β0exp(αgIt/N)subscript𝛽𝑢𝑡subscript𝛽0expsubscript𝛼𝑔subscript𝐼𝑡𝑁\beta_{u,t}=\beta_{0}\cdot\mathrm{exp}(-\alpha_{g}I_{t}/N)italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ roman_exp ( - italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_N )
Linear local awareness: βu,t=β01/(1+αlIu,t)subscript𝛽𝑢𝑡subscript𝛽011subscript𝛼𝑙subscript𝐼𝑢𝑡\beta_{u,t}=\beta_{0}\cdot 1/(1+\alpha_{l}I_{u,t})italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ 1 / ( 1 + italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT )
Linear global awareness: βu,t=β01/(1+αgIt/N))\beta_{u,t}=\beta_{0}\cdot 1/(1+\alpha_{g}I_{t}/N))italic_β start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ 1 / ( 1 + italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_N ) ),
Table 2: The specific awareness functions implemented in our synthetic models.

4.5 Generating synthetic genetic sequences

Once the epidemic process has been simulated, we assign synthetic genetic sequences to each node of the infection tree using the JC genetic substitution model [41], which is the simplest genetic substitution model we could select for our application. More concretely, we assign strings of size 10 consisting of the digits {0,1,2,3}0123\{0,1,2,3\}{ 0 , 1 , 2 , 3 } to each infected node using the following procedure. First, we assign a uniformly randomly chosen string to the root of the infection tree. Thereafter, for each edge of the infection tree, we sample each digit of the string of the parent node with probability 1/20, change it to a uniformly random new digit (among the other three digits), and assign the resulting string to the child node. These parameters assure that we have on average one mutation in every 2 timesteps (weeks) agreeing with estimates from the literature [55]. Our synthetic genetic sequences are much shorter than the COVID-19 genetic sequences for the sake of computational efficiency.

5 Data and Code Availability

All genome sequences and associated metadata are published in GISAID’s EpiCoV database. To view the contributors of each individual sequence with details such as accession number, Virus name, Collection date, Originating Lab and Submitting Lab and the list of Authors, visit 10.55876/gis8.240404rn. The MASZK survey data is available upon request from the authors. The code for the genetic data generation and analysis pipeline shown in Figure 1 (b) will be made available at https://github.com/odorgergo/sse-awareness upon publication.

6 Acknowledgements

We thank Eszter Ari, Andreas Bergthaler, and Tamás Stirling for their insightful comments and remarks. GÓ was supported by the Swiss National Science Foundation, under grant number P500PT-211129. M.K. was supported by the CHIST-ERA project SAI: FWF I 5205-N; the SoBigData++ H2020-871042; SoBigData-PPP HORIZON-INFRA-2021-DEV-02 program under grant agreement No 101079043, and the National Laboratory for Health Security, Alfréd Rényi Institute, RRF-2.3.1-21-2022-00006.

7 Author contributions

GÓ and MK conceptualized the research design. GÓ conducted the data analysis, performed the synthetic simulations, created the visualizations and wrote the first draft of the manuscript. MK acquired the survey data and supervised the research. GÓ and MK edited the final version of the manuscript.

References

  • [1] J. D. Sachs, S. S. A. Karim, L. Aknin, J. Allen, K. Brosbøl, F. Colombo, G. C. Barron, M. F. Espinosa, V. Gaspar, A. Gaviria, et al., “The lancet commission on lessons for the future from the covid-19 pandemic,” The Lancet, vol. 400, no. 10359, pp. 1224–1280, 2022.
  • [2] R. K. Webster, S. K. Brooks, L. E. Smith, L. Woodland, S. Wessely, and G. J. Rubin, “How to improve adherence with quarantine: rapid review of the evidence,” Public health, vol. 182, pp. 163–169, 2020.
  • [3] J. J. V. Bavel, K. Baicker, P. S. Boggio, V. Capraro, A. Cichocka, M. Cikara, M. J. Crockett, A. J. Crum, K. M. Douglas, J. N. Druckman, et al., “Using social and behavioural science to support covid-19 pandemic response,” Nature human behaviour, vol. 4, no. 5, pp. 460–471, 2020.
  • [4] W. H. Organization et al., “Pandemic fatigue–reinvigorating the public to prevent covid-19: policy framework for supporting pandemic prevention and management,” tech. rep., World Health Organization. Regional Office for Europe, 2020.
  • [5] A. Haktanir, N. Can, T. Seki, M. F. Kurnaz, and B. Dilmaç, “Do we experience pandemic fatigue? current state, predictors, and prevention,” Current Psychology, vol. 41, no. 10, pp. 7314–7325, 2022.
  • [6] F. Jørgensen, A. Bor, M. S. Rasmussen, M. F. Lindholt, and M. B. Petersen, “Pandemic fatigue fueled political discontent during the covid-19 pandemic,” Proceedings of the National Academy of Sciences, vol. 119, no. 48, p. e2201266119, 2022.
  • [7] C. Stevenson, J. R. Wakefield, I. Felsner, J. Drury, and S. Costa, “Collectively co** with coronavirus: Local community identification predicts giving support and lockdown adherence during the covid-19 pandemic,” British Journal of Social Psychology, vol. 60, no. 4, pp. 1403–1418, 2021.
  • [8] G. Kraft-Todd, E. Yoeli, S. Bhanot, and D. Rand, “Promoting cooperation in the field,” Current Opinion in Behavioral Sciences, vol. 3, pp. 96–101, 2015.
  • [9] M. S. Wolf, M. Serper, L. Opsasnick, R. M. O’Conor, L. Curtis, J. Y. Benavente, G. Wismer, S. Batio, M. Eifler, P. Zheng, et al., “Awareness, attitudes, and actions related to covid-19 among adults with chronic conditions at the onset of the us outbreak: a cross-sectional survey,” Annals of internal medicine, vol. 173, no. 2, pp. 100–109, 2020.
  • [10] R. M. Jaber, B. Mafrachi, A. Al-Ani, and M. Shkara, “Awareness and perception of covid-19 among the general population: A middle eastern survey,” PloS one, vol. 16, no. 4, p. e0250461, 2021.
  • [11] S. Funk, E. Gilad, C. Watkins, and V. A. Jansen, “The spread of awareness and its impact on epidemic outbreaks,” Proceedings of the National Academy of Sciences, vol. 106, no. 16, pp. 6872–6877, 2009.
  • [12] N. Perra, D. Balcan, B. Gonçalves, and A. Vespignani, “Towards a characterization of behavior-disease models,” PloS one, vol. 6, no. 8, p. e23084, 2011.
  • [13] I. Z. Kiss, J. Cassell, M. Recker, and P. L. Simon, “The impact of information transmission on epidemic outbreaks,” Mathematical biosciences, vol. 225, no. 1, pp. 1–10, 2010.
  • [14] A. Teslya, T. M. Pham, N. G. Godijk, M. E. Kretzschmar, M. C. Bootsma, and G. Rozhnova, “Impact of self-imposed prevention measures and short-term government-imposed social distancing on mitigating and delaying a covid-19 epidemic: A modelling study,” PLoS medicine, vol. 17, no. 7, p. e1003166, 2020.
  • [15] S. Funk, S. Bansal, C. T. Bauch, K. T. Eames, W. J. Edmunds, A. P. Galvani, and P. Klepac, “Nine challenges in incorporating the dynamics of behaviour in infectious diseases models,” Epidemics, vol. 10, pp. 21–25, 2015.
  • [16] F. Verelst, L. Willem, and P. Beutels, “Behavioural change models for infectious disease transmission: a systematic review (2010–2015),” Journal of The Royal Society Interface, vol. 13, no. 125, p. 20160820, 2016.
  • [17] M. Karsai, J. Koltai, O. Vásárhelyi, and G. Röst, “Hungary in mask/maszk in hungary,” Corvinus Journal of Sociology and Social Policy, no. 2, 2020.
  • [18] L. Song, H. Liu, F. S. L. Brinkman, E. Gill, E. J. Griffiths, W. W. L. Hsiao, S. Savić-Kallesøe, S. Moreira, G. Van Domselaar, M. H. Zawati, et al., “Addressing privacy concerns in sharing viral sequences and minimum contextual data in a public repository during the covid-19 pandemic,” Frontiers in genetics, vol. 12, p. 716541, 2022.
  • [19] E. M. Volz, S. L. Kosakovsky Pond, M. J. Ward, A. J. Leigh Brown, and S. D. Frost, “Phylodynamics of infectious disease epidemics,” Genetics, vol. 183, no. 4, pp. 1421–1430, 2009.
  • [20] G. Baele, S. Dellicour, M. A. Suchard, P. Lemey, and B. Vrancken, “Recent advances in computational phylodynamics,” Current opinion in virology, vol. 31, pp. 24–32, 2018.
  • [21] E. M. Volz, K. Koelle, and T. Bedford, “Viral phylodynamics,” PLoS computational biology, vol. 9, no. 3, p. e1002947, 2013.
  • [22] E. B. Hodcroft, N. De Maio, R. Lanfear, D. R. MacCannell, B. Q. Minh, H. A. Schmidt, A. Stamatakis, N. Goldman, and C. Dessimoz, “Want to track pandemic variants faster? fix the bioinformatics bottleneck,” Nature, vol. 591, no. 7848, pp. 30–33, 2021.
  • [23] L. Cappello, J. Kim, S. Liu, and J. A. Palacios, “Statistical challenges in tracking the evolution of sars-cov-2,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 37, no. 2, p. 162, 2022.
  • [24] Y. Turakhia, B. Thornlow, A. S. Hinrichs, N. De Maio, L. Gozashti, R. Lanfear, D. Haussler, and R. Corbett-Detig, “Ultrafast sample placement on existing trees (usher) enables real-time phylogenetics for the sars-cov-2 pandemic,” Nature genetics, vol. 53, no. 6, pp. 809–816, 2021.
  • [25] J. McBroome, B. Thornlow, A. S. Hinrichs, A. Kramer, N. De Maio, N. Goldman, D. Haussler, R. Corbett-Detig, and Y. Turakhia, “A daily-updated database and tools for comprehensive sars-cov-2 mutation-annotated trees,” Molecular biology and evolution, vol. 38, no. 12, pp. 5819–5824, 2021.
  • [26] M. Hunt, A. S. Hinrichs, D. Anderson, L. Karim, B. L. Dearlove, J. Knaggs, B. Constantinides, P. W. Fowler, G. Rodger, T. L. Street, et al., “Addressing pandemic-wide systematic errors in the sars-cov-2 phylogeny,” bioRxiv, pp. 2024–04, 2024.
  • [27] C. Ye, B. Thornlow, A. S. Hinrichs, D. Torvi, R. Lanfear, R. Corbett-Detig, and Y. Turakhia, “matoptimize: A parallel tree optimization method enables online phylogenetics for sars-cov-2 (preprint),” 2022.
  • [28] S. Elbe and G. Buckland-Merrett, “Data, disease and diplomacy: Gisaid’s innovative contribution to global health,” Global challenges, vol. 1, no. 1, pp. 33–46, 2017.
  • [29] A. Bernasconi, L. Mari, R. Casagrandi, and S. Ceri, “Data-driven analysis of amino acid change dynamics timely reveals sars-cov-2 variant emergence,” Scientific Reports, vol. 11, no. 1, p. 21068, 2021.
  • [30] C. Tran-Kiem and T. Bedford, “Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences,” Proceedings of the National Academy of Sciences, vol. 121, no. 15, p. e2305299121, 2024.
  • [31] X. Bello, J. Pardo-Seco, A. Gómez-Carballa, H. Weissensteiner, F. Martinón-Torres, and A. Salas, “Covidphy: A tool for phylogeographic analysis of sars-cov-2 variation,” Environmental Research, vol. 204, p. 111909, 2022.
  • [32] D. Lewis, “Superspreading drives the covid pandemic–and could help to tame it.,” Nature, vol. 590, no. 7847, pp. 544–547, 2021.
  • [33] B. M. Althouse, E. A. Wenger, J. C. Miller, S. V. Scarpino, A. Allard, L. Hébert-Dufresne, and H. Hu, “Superspreading events in the transmission dynamics of sars-cov-2: Opportunities for interventions and control,” PLoS biology, vol. 18, no. 11, p. e3000897, 2020.
  • [34] T. R. Frieden and C. T. Lee, “Identifying and interrupting superspreading events—implications for control of severe acute respiratory syndrome coronavirus 2,” Emerging infectious diseases, vol. 26, no. 6, p. 1059, 2020.
  • [35] M. P. Kain, M. L. Childs, A. D. Becker, and E. A. Mordecai, “Chop** the tail: How preventing superspreading can help to maintain covid-19 control,” Epidemics, vol. 34, p. 100430, 2021.
  • [36] H. Streeck, B. Schulte, B. M. Kümmerer, E. Richter, T. Höller, C. Fuhrmann, E. Bartok, R. Dolscheid-Pommerich, M. Berger, L. Wessendorf, et al., “Infection fatality rate of sars-cov2 in a super-spreading event in germany,” Nature communications, vol. 11, no. 1, p. 5829, 2020.
  • [37] H. Y. Lam, T. S. Lam, C. H. Wong, W. H. Lam, E. L. C. Mei, Y. L. C. Kuen, W. L. T. Wai, B. H. C. Hin, K. H. Wong, and S. K. Chuang, “A superspreading event involving a cluster of 14 coronavirus disease 2019 (covid-19) infections from a family gathering in hong kong special administrative region sar (china),” Western Pacific Surveillance and Response Journal: WPSAR, vol. 11, no. 4, p. 36, 2020.
  • [38] J. E. Lemieux, K. J. Siddle, B. M. Shaw, C. Loreth, S. F. Schaffner, A. Gladden-Young, G. Adams, T. Fink, C. H. Tomkins-Tinch, L. A. Krasilnikova, et al., “Phylogenetic analysis of sars-cov-2 in boston highlights the impact of superspreading events,” Science, vol. 371, no. 6529, p. eabe3261, 2021.
  • [39] T. Hale, N. Angrist, R. Goldszmidt, B. Kira, A. Petherick, T. Phillips, S. Webster, E. Cameron-Blake, L. Hallas, S. Majumdar, et al., “A global panel database of pandemic policies (oxford covid-19 government response tracker),” Nature human behaviour, vol. 5, no. 4, pp. 529–538, 2021.
  • [40] Q. Yu, J. A. Ascensao, T. Okada, C.-. G. U. C.-U. consortium, O. Boyd, E. Volz, and O. Hallatschek, “Lineage frequency time series reveal elevated levels of genetic drift in sars-cov-2 transmission in england,” bioRxiv, pp. 2022–11, 2022.
  • [41] T. H. Jukes and C. R. Cantor, “Evolution of protein molecules. 21–132 munro hn ed mammalian protein metabolism academic press,” New York, 1969.
  • [42] M. Fire and R. Puzis, “Organization mining using online social networks,” Networks and Spatial Economics, vol. 16, pp. 545–578, 2016.
  • [43] M. Fire, L. Tenenboim-Chekina, R. Puzis, O. Lesser, L. Rokach, and Y. Elovici, “Computationally efficient link prediction in a variety of social networks,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 5, no. 1, pp. 1–25, 2014.
  • [44] S. Unicomb, G. Iñiguez, J. Kertész, and M. Karsai, “Reentrant phase transitions in threshold driven contagion on multiplex networks,” Physical Review E, vol. 100, no. 4, p. 040301, 2019.
  • [45] K. Bringmann, R. Keusch, and J. Lengler, “Geometric inhomogeneous random graphs,” Theoretical Computer Science, vol. 760, pp. 35–54, 2019.
  • [46] Q. Wu, X. Fu, M. Small, and X.-J. Xu, “The impact of awareness on epidemic spreading in networks,” Chaos: an interdisciplinary journal of nonlinear science, vol. 22, no. 1, 2012.
  • [47] L. A. N. Fard, M. Starnini, and M. Tizzoni, “Modeling adaptive forward-looking behavior in epidemics on networks,” arXiv preprint arXiv:2301.04947, 2023.
  • [48] T. Sekizuka, K. Itokawa, T. Kageyama, S. Saito, I. Takayama, H. Asanuma, N. Nao, R. Tanaka, M. Hashino, T. Takahashi, et al., “Haplotype networks of sars-cov-2 infections in the diamond princess cruise ship outbreak,” Proceedings of the National Academy of Sciences, vol. 117, no. 33, pp. 20198–20201, 2020.
  • [49] Y. Zhang, Y. Li, L. Wang, M. Li, and X. Zhou, “Evaluating transmission heterogeneity and super-spreading event of covid-19 in a metropolis of china,” International journal of environmental research and public health, vol. 17, no. 10, p. 3705, 2020.
  • [50] F. Wegner, B. Cabrera Gil, T. Araud, C. Beckmann, N. Beerenwinkel, C. Bertelli, M. Carrara, L. Cerutti, C. Chen, S. Cordey, et al., “How much should we sequence? an analysis of the swiss sars-cov-2 surveillance effort,” medRxiv, pp. 2023–08, 2023.
  • [51] E. Dong, H. Du, and L. Gardner, “An interactive web-based dashboard to track covid-19 in real time,” The Lancet infectious diseases, vol. 20, no. 5, pp. 533–534, 2020.
  • [52] U. Nations, “Department of economic and social affairs,” Population Division, 2020.
  • [53] K. Bringmann, R. Keusch, and J. Lengler, “Geometric inhomogeneous random graphs,” Theoretical Computer Science, vol. 760, pp. 35–54, 2019.
  • [54] “Implementation of geometric inhomogeneous random graphs,” 2020. https://github.com/joostjor/random-graphs/blob/master/girg.py.
  • [55] A. Gómez-Carballa, J. Pardo-Seco, X. Bello, F. Martinón-Torres, and A. Salas, “Superspreading in the emergence of covid-19 variants,” Trends in Genetics, vol. 37, no. 12, pp. 1069–1080, 2021.
  • [56] T. R. Mercer and M. Salit, “Testing at scale during the covid-19 pandemic,” Nature Reviews Genetics, vol. 22, no. 7, pp. 415–426, 2021.
  • [57] J. O. Lloyd-Smith, S. J. Schreiber, P. E. Kopp, and W. M. Getz, “Superspreading and the effect of individual variation on disease emergence,” Nature, vol. 438, no. 7066, pp. 355–359, 2005.

Appendix A Detailed explanation of SSE detection and ECS assignment

Refer to caption
Figure 6: (a) Bar plot showing the number of SARS-CoV-2 genetic sequences collected in Austria and uploaded to the GISAID platform over time, for each of the major variants. (b) Visualization of the sizes of 16 AACCs in Austria over time. Each individual plot shows the number of sequences in the AACC at a given date (denoted by AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t )). The red squares (located often towards the beginning of an AACC) mark the SSEs detected using our proposed method. (c) Histogram of the NormChangec,i(t)subscriptNormChange𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) and the Baselinec,i(t)subscriptBaseline𝑐𝑖𝑡\mathrm{Baseline}_{c,i}(t)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values in Austria during the Delta wave. By definition, these values are larger than 9 for SSEs, and at most 9 for baselines. (d) Histogram of the NormChangec,i(t+1)subscriptNormChange𝑐𝑖𝑡1\mathrm{NormChange}_{c,i}(t+1)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ), the Baselinec,i(t+1)subscriptBaseline𝑐𝑖𝑡1\mathrm{Baseline}_{c,i}(t+1)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) values, and the resulting ECS values in Austria during the Delta wave. The outcome of the pipeline, the ECSAustria,DeltasubscriptECSAustriaDelta\mathrm{ECS}_{\mathrm{Austria,Delta}}roman_ECS start_POSTSUBSCRIPT roman_Austria , roman_Delta end_POSTSUBSCRIPT value is the median of the plotted ECS values (in this case 0.9).

We index AACCs only by the time t𝑡titalic_t (integer value measured in weeks since the first sequence), their country-variant pair denoted by c𝑐citalic_c, and their cluster index i𝑖iitalic_i (Figure 6 (b)). In order to track changes in AACC sizes, we are interested in the Normalized Change values defined as

NormChangec,i(t)=AACCc,i(t+1)AACCc,i(t)max(1,AACCc,i(t)),subscriptNormChange𝑐𝑖𝑡subscriptAACC𝑐𝑖𝑡1subscriptAACC𝑐𝑖𝑡max1subscriptAACC𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)=\frac{\mathrm{AACC}_{c,i}(t+1)-\mathrm{AACC}_{c,i% }(t)}{\mathrm{max}(1,\sqrt{\mathrm{AACC}_{c,i}(t)})},roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) - roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG roman_max ( 1 , square-root start_ARG roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG ) end_ARG , (6)

where AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) denotes the size of the AACC indexed by (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ). The normalization with the square root of the AACC size accounts for the natural fluctuation of the cluster sizes. Indeed, assuming that the patients in the AACCs at time t𝑡titalic_t independently infect an identically distributed random number of new patients with the same amino acid signature at time t+1𝑡1t+1italic_t + 1, by the Central Limit Theorem, we expect the fluctuations of AACCc,i(t+1)subscriptAACC𝑐𝑖𝑡1\mathrm{AACC}_{c,i}(t+1)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) to be proportional to the square root of AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ). Due to this normalization, NormChange values tend to be close to zero; in most countries 95% of the values fall between -3 and 5. We consider exceptionally large NormChange values as a sign of a SSE. Inspired by [38], we choose the threshold for the NormChange value of an SSE to be 9, and we provide a robustness analysis on this threshold parameter in Section B.1. The proposed SSE detection method is efficient, requires only minor preprocessing, and the detected SSEs agree with our intuition after visual inspection (Figure 6 (b)).

Similarly to previous SSE detection methods based on thresholding genetic sequence counts [31, 38], our proposed method is imperfect, leading to both false positives and false negatives. However, since we only apply aggregate statistics on the identified SSEs, even such imperfect methods can provide important results, especially if the confounding factors can be ruled out. The main confounding factor in this case is sampling bias, as we know that different countries collected and sequenced samples with different strategies and at different rates [56]. To control for country-specific biases, we match each SSE (c,i,t)𝑐𝑖𝑡(c,i,t)( italic_c , italic_i , italic_t ) with multiple baseline AACC timesteps with the same country-variant index c𝑐citalic_c and with similar size. We denote the median of the NormChange(t)NormChange𝑡\mathrm{NormChange}(t)roman_NormChange ( italic_t ) values of the baselines as Baselinec,i(t)subscriptBaseline𝑐𝑖𝑡\mathrm{Baseline}_{c,i}(t)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) (Methods 4.2). As shown in Figure 6 (c), the NormChange values at t𝑡titalic_t of SSEs are all larger than a threshold and follow broad distribution, whereas the distribution of the Baselinec,i(t)subscriptBaseline𝑐𝑖𝑡\mathrm{Baseline}_{c,i}(t)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values is concentrated below the threshold (in Austria during the Delta wave; the results are similar other country-variant pairs shown in Supplementary Material C). Once the baselines are matched, we define our main notion of interest, the ECS, as the difference between the baseline value and the SSE NormChange value at time t+1𝑡1t+1italic_t + 1:

ECSc,i(t)=Baselinec,i(t+1)NormChangec,i(t+1).subscriptECS𝑐𝑖𝑡subscriptBaseline𝑐𝑖𝑡1subscriptNormChange𝑐𝑖𝑡1\mathrm{ECS}_{c,i}(t)=\mathrm{Baseline}_{c,i}(t+1)-\mathrm{NormChange}_{c,i}(t% +1).roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) - roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) . (7)

We present the distribution of NormChangec,i(t+1)subscriptNormChange𝑐𝑖𝑡1\mathrm{NormChange}_{c,i}(t+1)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) values for SSEs, the Baselinec,i(t+1)subscriptBaseline𝑐𝑖𝑡1\mathrm{Baseline}_{c,i}(t+1)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) values for baseline events, along with the resulting ECS values in Austria, in Figure 6 (d). Since all country-variant pairs c𝑐citalic_c in our dataset had similarly broad, but unimodal ECS distributions as Figure 6 (d), we focused on their median values denoted by ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (the output of the pipeline in Figure 1 (b)). As the mutation rate of SARS-CoV-2 (about one mutation every 2 weeks [55]) was higher than the effective reproduction rate (most of the time below 2), and AACCs can be thought of as sub-critical spreading processes, it is no surprise that the median values of the NormChangec,i(t+1)subscriptNormChange𝑐𝑖𝑡1\mathrm{NormChange}_{c,i}(t+1)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) values are negative for both the SSEs and the baselines. However, the sign of ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT adds non-trivial information. A positive ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT means that the normalized change of the number of genetic sequences in SSEs was smaller than in the baseline, which suggests that the SSEs led to fewer secondary infections than a similarly sized non-SSE clusters of infectious individuals, i.e. the SSEs were well-contained. Similarly, a negative ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT would suggest SSEs that were not contained as well as the baselines in the same country during the same variant.

Appendix B Robustness analyses

B.1 Threshold for SSE Detection

We detect Superspreading Events (SSEs) by applying a threshold on the NormChangec,i(t)subscriptNormChange𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values defined in equation (6). By default, this threshold is set to be 9 following [38], who chose this value based on the theoretical justification of [57]. A notable difference between our approach and the referenced papers is that they assume the SSEs to start from a single source, which can be identified in the dataset (e.g. via contact tracing), and they apply the threshold on the number of secondary cases of the source. In our approach, we do not assume that we can identify the source of the SSE, we are only interested in detecting the occurrence of SSEs based on AACC sizes. For instance, if an AACCc,i(t)=10subscriptAACC𝑐𝑖𝑡10\mathrm{AACC}_{c,i}(t)=10roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = 10 and AACCc,i(t+1)=100subscriptAACC𝑐𝑖𝑡1100\mathrm{AACC}_{c,i}(t+1)=100roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = 100, then we suspect that this unexpected increase is due to a SSE that occurred at t𝑡titalic_t, but we do not know which patient caused the SSE. In principle, it is possible that not one but multiple patients with the same amino acid signature caused independent and simultaneous SSEs, however, since this is an unlikely event, we can safely ignore it without significantly impacting our aggregate statistics. In our approach, it is important to also account for the fact that AACCc,isubscriptAACC𝑐𝑖\mathrm{AACC}_{c,i}roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT changing from 5 to 50 is not the same as a change from 500 to 545, as larger AACC sizes also have larger natural fluctuations. Assuming that (due to the Central Limit Theorem), if no SSE occurs, AACC sizes behave similarly to Gaussian random variables with their mean and variance proportional to AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ), we normalize the AACC size changes by the square root of AACCc,i(t)subscriptAACC𝑐𝑖𝑡\mathrm{AACC}_{c,i}(t)roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) in the definition of the NormChange function. When AACCc,i(t)=1subscriptAACC𝑐𝑖𝑡1\mathrm{AACC}_{c,i}(t)=1roman_AACC start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) = 1, then we get back the setup of [38], which motivated us to choose the same threshold for SSE detection as they did.

Refer to caption
Figure 7: The number of SSEs detected at different thresholds stratified by country. Only datapoints with y-value above 20 are shown. As the threshold increases, fewer and fewer events are classified as SSEs.

To further strengthen the validity of our results, we present a robustness analysis on the threshold parameter. First, in Figure 7 we show the number of detected SSEs in various European countries as a function of the threshold parameter, if at least 20 SSEs were detected (and therefore qualified for our analysis). As expected, the number of SSEs is a monotone decreasing function of the threshold. Moreover, due to the log-scale it appears that the number of detected SSEs decrease exponentially with the threshold, indicating that it is sufficient to perform the robustness analysis in a relatively narrow parameter range. We selected the interval [7,11] because a threshold of 11 only detects a minimal number of SSEs in many countries, making them ineligible for our analysis, while a threshold of 7 results in a high number of SSEs, potentially leading to an excessive number of false positives.

In Figures 8-12, we recreated Figure 3 for each integer SSE detection threshold in the range [7,11], for all of the major SARS-CoV-2 variants. While there is some variability in the results for different thresholds (mostly due to new countries entering the dataset as the threshold decreases), besides the correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the sampling date in the Delta wave also mentioned in the main text, the most significant correlations remain between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the CHI in the Delta and the Omicron waves. Moreover, for the threshold value 7 the correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the CHI becomes significant even in the Alpha wave, which is most likely explained by the fact that lower thresholds add more countries to the dataset, and increase the statistical power of the results. These additional results further strengthen the conclusion made in the main text, that ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is most correlated with the CHI (the most direct measure of human behavior) among the available exogenous variables, which includes potential confounding factors (sampling rate, attack rate).

B.2 Threshold for the Number of Baseline Events

In Methods 4.2, we defined a parameter m𝑚mitalic_m, which sets the minimum number of baseline events that are matched with each the detected SSE in the dataset. We expect that if we chose one baseline event, then the results could look very noisy, therefore we set m=10𝑚10m=10italic_m = 10 to ensure at least 2m=202𝑚202m=202 italic_m = 20 baseline events by default. In Figures 13-15 we recreate Figure 3 with m{5,20,40}𝑚52040m\in\{5,20,40\}italic_m ∈ { 5 , 20 , 40 } to show that the precise value of m𝑚mitalic_m is is not important, as long as m𝑚mitalic_m is sufficiently high.

Appendix C NormChange and ECS values in each country-variant pair

In Figure 2 (c) and (d) we plotted the histogram of the Normchangec,i(t)subscriptNormchange𝑐𝑖𝑡\mathrm{Normchange}_{c,i}(t)roman_Normchange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ), the Baselinec,i(t)subscriptBaseline𝑐𝑖𝑡\mathrm{Baseline}_{c,i}(t)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ), the NormChangec,i(t+1)subscriptNormChange𝑐𝑖𝑡1\mathrm{NormChange}_{c,i}(t+1)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ), the Baselinec,i(t+1)subscriptBaseline𝑐𝑖𝑡1\mathrm{Baseline}_{c,i}(t+1)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ), and the resulting ECSc,i(t)subscriptECS𝑐𝑖𝑡\mathrm{ECS}_{c,i}(t)roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values of the detected SSEs for Austria during the Delta wave as an example. For completeness, in the left and the middle columns of Figures 16-18 we include the same plots for all country-variant pairs that are included in Figure 5 (b).

Furthermore, in the right column of Figures 16-18 we include a plot of the ECSc,i(t)subscriptECS𝑐𝑖𝑡\mathrm{ECS}_{c,i}(t)roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values and the number of reported cases against the time variable t𝑡titalic_t. In this plot we also show the automatically detected “during wave” periods that are used to create in Figure 5 (b).

Refer to caption
Figure 8: The figure shows how Figure 3 would look like if a threshold of 7 was chosen instead of the default value (9) for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 9: The figure shows how Figure 3 would look like if a threshold of 8 was chosen instead of the default value (9) for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 10: The figure shows how Figure 3 for the pre-Alpha, Alpha, Delta, and Omicron waves with the default threshold (9). Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 11: The figure shows how Figure 3 would look like if a threshold of 10 was chosen instead of the default value (9) for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 12: The figure shows how Figure 3 would look like if a threshold of 11 was chosen instead of the default value (9) for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 13: The figure shows how Figure 3 would look like if m=5𝑚5m=5italic_m = 5 chosen instead of the default value (m=10𝑚10m=10italic_m = 10) when matching baseline events to SSEs for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 14: The figure shows how Figure 3 would look like if m=20𝑚20m=20italic_m = 20 chosen instead of the default value (m=10𝑚10m=10italic_m = 10) when matching baseline events to SSEs for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 15: The figure shows how Figure 3 would look like if m=40𝑚40m=40italic_m = 40 chosen instead of the default value (m=10𝑚10m=10italic_m = 10) when matching baseline events to SSEs for the pre-Alpha, Alpha, Delta, and Omicron waves. Grey background signifies a statistically significant correlation between the ECScsubscriptECS𝑐\mathrm{ECS}_{c}roman_ECS start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the exogenous variable.
Refer to caption
Figure 16: Detailed visualization of the NormChange and the ECS values in each country-variant pair that was included in Figure 5 (b). Left column: Histogram of the NormChangec,i(t)subscriptNormChange𝑐𝑖𝑡\mathrm{NormChange}_{c,i}(t)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) and the Baselinec,i(t)subscriptBaseline𝑐𝑖𝑡\mathrm{Baseline}_{c,i}(t)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values. Middle column: Histogram of the NormChangec,i(t+1)subscriptNormChange𝑐𝑖𝑡1\mathrm{NormChange}_{c,i}(t+1)roman_NormChange start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ), the Baselinec,i(t+1)subscriptBaseline𝑐𝑖𝑡1\mathrm{Baseline}_{c,i}(t+1)roman_Baseline start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) values, and the resulting ECS values. Right column: Scatter plot of the ECSc,i(t)subscriptECS𝑐𝑖𝑡\mathrm{ECS}_{c,i}(t)roman_ECS start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_t ) values vs t𝑡titalic_t and the plot of the number of reported cases vs t𝑡titalic_t. We automatically detected “during wave” periods (marked with the red area) by ranking the detected SSEs by the number of case counts reported during the same week, and taking the time interval spanned by the top 20% of the SSEs.
Refer to caption
Figure 17: Figure 16 continued.
Refer to caption
Figure 18: Figure 17 continued.
Refer to caption
Figure 19: ECS values computed on synthetically generated genetic sequence data similarly to Figure 4, except with linear local and global awareness functions (see Methods 4.4 for the precise function definition).
Refer to caption
Figure 20: Size, degree distribution and average clustering coefficient of the selected real and synthetic networks