Epidemic-induced local awareness behavior inferred from surveys
and genetic sequence data

Gergely Ódor Department of Network and Data Science, Central European University, Vienna, Austria Márton Karsai Department of Network and Data Science, Central European University, Vienna, Austria National Laboratory of Health Security, HUN-REN Alfréd Rényi Institute of Mathematics, Budapest, Hungary

Abstract

Behavior-disease models suggest that if individuals are aware and take preventive actions when the prevalence of the disease increases among their close contacts, then the pandemic can be contained in a cost-effective way. To measure the true impact of local awareness behavior on epidemic spreading, we propose an efficient approach to identify superspreading events and assign corresponding Event Containment Scores (ECSs) in clinical genetic sequence data.

We validate ECS as a measure of local awareness in simulation experiments, and we find that ECS was correlated positively with policy stringency during the COVID-19 pandemic. Finally, we observe a temporary drop in ECS during the Omicron wave in most European countries, matching a survey experiment we carried out at the same time. Our findings bring important insight into the field of awareness modeling through the analysis of large-scale genetic sequence data, one of the most promising data sources in epidemics research.

1 Introduction

The COVID-19 pandemic has highlighted several pivotal shortcomings that demand comprehensive examination within our society [1]. One of the most important lessons was the need for more effective social interventions, which can ensure the adherence to the necessary containment measures during future pandemics [1, 2]. Manifesting as a social dilemma, restrictive measures generate a conflict between long-term collective interest and short-term self-interest [3], and it can be difficult to convince individuals to cooperate, especially if the cooperative behavior needs to be sustained for longer time periods [4, 5, 6]. Among interventions that raise awareness and promote cooperative behavior, a combination of community engagement, accurate monitoring, and transparent reporting of the impact of restrictions has been found the most consistently effective approach [7, 8].

Recognizing the importance of the problem, the research community responded to the emergence of the COVID-19 pandemic by closely monitoring and actively reporting the changes in epidemic awareness [9, 10]. However, most of these studies focused on global awareness (i.e., adherence to governmental restrictions), while only a few studies exist on the impact of local awareness (i.e., behavioral changes adaptive to the local prevalence of the disease), even though there is substantial model-based evidence that local awareness can be more effective in reducing the pandemic threshold and reducing the size of the epidemic compared to its global counterpart [11, 12, 13, 14]. The bias towards global awareness can be partially explained by the limited data availability on the local scale, due to privacy concerns [15, 16].

To fill the gap in monitoring local awareness behavior, we conducted a representative telephone survey asking 9000 participants over 9 months during the Delta and the Omicron waves in Hungary as part of the MASZK national survey [17]. The responders were asked to rate their willingness to undertake stricter preventive measures (such as increased mask wearing or social distancing) if the prevalence of the disease increased among their close contacts. The survey results show an unexpected pattern (Figure 1 (a)). While local awareness scores stayed relatively constant throughout the collection period, including the Delta wave of the pandemic, we observed a drop in awareness during the Omicron wave, which rebounded promptly after the wave has ended.

Refer to caption — Figure 1: (a) The MASZK Hungarian telephone survey, with 1000 participants in each of the 9 months, shows that the mean awareness score remains relatively constant throughout the recording period, except during the Omicron wave, when the awareness scores drop. The government imposed preventive measures (mask wearing, social distancing) show a different temporal pattern. (b) Our proposed pipeline to process synthetic (blue) and real genetic sequence data (grey) to compute Event Containment Scores (ECS) – a proxy for local awareness behavior.

The measured awareness scores show a distinctive temporal pattern compared to the standard protective measures, which we also assessed in the same survey. Figure 1 (a) shows that mask wearing stayed constant throughout both the Delta and the Omicron waves, while social distancing dropped during the Omicron wave, but did not rebound after the wave has ended. These additional survey results also rule out the hypotheses that the drop in awareness scores can be explained exclusively by the responders inability to perform stricter measures during the Omicron wave, or by the relatively lower risk of hospitalization and death posed by the Omicron variant.

According to our interpretation, the observed drop in awareness scores can be attributed to a form of pandemic fatigue [4, 5]; the demotivation to engage in preventive behavior due to the complex interplay of various psychological factors. However, since the general adherence to regulations showed a very different pattern compared to the awareness behavior in Figure 1 (a), the observed “awareness fatigue” is likely to have a very different psychological explanation, which our survey was not designed to reveal. Instead of speculating about the mechanisms of the observed phenomenon, we focus on two important questions about the impact of our finding: (i) do other countries show similar changes in awareness behavior? (ii) does the observed drop in self-reported awareness have a measurable impact on the spread of the epidemic? To answer these questions we turn to the analysis large-scale genetic sequence data, which contains hidden, but accessible information about the local spread of the epidemic.

1.1 Inferring Local Awareness from Genetic Sequence Data

While genetic data raises relatively minor privacy concerns [18], and it is unparelleled in terms of availablity, extracting behavioral information from genetic sequences is a challenging task. In phylodynamics [19, 20], human behavior is typically inferred based on the phylogenetic tree reconstructed from the observed sequences [21]. However, current tree reconstruction methods have a number of limitations. First, traditional methods are computationally intensive and it is difficult to scale them to datasets with more than a few thousand sequences [22, 23]. Since the COVID-19 pandemic, there has been significant process in develo** more scalable methods [24], and releasing publicly available trees for further analysis [25, 26]. However, processing millions of SARS-CoV-2 genetic sequences remains a challenge [27], and the publicly shared pre-computed trees do not have the same coverage as the Global Initiative on Sharing All Influenza Data (GISAID) dataset, which contains over 16 million SARS-CoV-2 genetic sequences, with a 5-15% sampling rate in several countries [28]. Second, working with general-purpose methods or highly pre-processed datasets can significantly lower the statistical power of our results, especially since previous methods were not optimized to measure local-awareness behavior. Instead, we process this new dataset of unprecedented size, by focusing on a simple and tractable statistic that does not require the reconstruction of the phylogenetic tree – the size distribution of the clusters of identical genetic sequences over time. Similar tree-free methods with different applications have been recently proposed by [29, 30, 31]. In essence, we break up the global epidemic into thousands of sub-epidemics with identical genetic code to infer patterns of local awareness. Since each sub-epidemic contains only very noisy information about the general local awareness patterns in the population, we focus on one of the most robust features of the dataset: Superspreading Events (SSEs).

The role of SSEs as the driving force of the COVID-19 pandemic was well-established in early 2020 [32]. Since then, there has been a remarkable research effort to understand the potential of targeted interventions to prevent or contain SSEs [33, 34, 35], and to document the effect of these interventions in case studies based on contract tracing [36, 37]. It has also been shown that the downstream infection patterns of SSEs can be observed from phylogenetic trees [38], which can be used to infer signs of awareness behavior. However, since phylogenetic trees are not applicable on large-scale datasets, a new methodology to quantify the impact of awareness behavior from genetic data is needed.

Inspired by [31, 38], we develop a pipeline to detect SSEs based on the size distribution of clusters of identical genetic sequences, and to measure the resulting secondary infections by assigning each SSE an Event Containment Score (ECS, see Figure 1 (b)). Intuitively, ECS is a proxy for the level of adaptive local-awareness behavior, which we confirm via extensive simulation results on synthetic epidemic models with local awareness. To validate the ECS score in real data, we compute the ECS of European countries based a dataset of over 5 million genetic sequences collected through 145 weeks. We demonstrate that the ECS correlates positively with the Oxford Containment Health Indices [39] in the selected countries, but not with some of the potential confounders, such as sampling rate, attack rate of the sizes of the SSEs. Finally, by comparing ECS scores during an epidemic wave and between waves in each country in our dataset, we observe that local awareness dropped during the Omicron wave in multiple countries during the COVID-19 pandemic, and that it had a measurable impact on the spread of the disease. In addition to providing evidence for the impact of local awareness in multiple countries, our methods pave the way for future interdisciplinary studies that monitor behavioral patterns using large-scale genetic sequence data.

2 Results

2.1 Event Containment Scores on the COVID-19 Genetic Dataset

Our analysis is based on the detection of SSEs and the assignment of ECSs to each SSE by quantifying secondary infections (Figure 1 (b)). As the first step of the pipeline, we downloaded the entire GISAID EpiCoV database between March 2020 and March 2023 [28]. Although the database contains sequences from over 200 countries worldwide, throughout the paper we focused on European countries, since this region had the highest sampling rate, with suitably different but comparable countries from a behavioral perspective. Besides the raw nucleotide sequences, the dataset also contains various metadata, such as the date and the location of the sample (usually at the country or county level). Moreover, the database contains the amino-acid-level substitutions of each sequence compared to the WIV04 reference sequence collected in late 2019 in Wuhan. Although the amino-acid-level substitution data is more aggregated than the raw genetic data (three nucleotides encode one amino acid, with multiple triplets having the same encoding), it still contains highly detailed information about the genetic code of the samples, and it is computationally more tractable to process, since the alignment of the raw genetic codes can be omitted. We preprocess the dataset by partitioning the genetic sequences with identical amino acid substitutions into subsets, which we call Amino Acid Collision Clusters (AACCs) We group together AACCs that were collected in the same country and that belong to the same variant, as it is often assumed that SARS-CoV-2 viruses with identical Greek letters (e.g., Alpha, Delta, Omicron, see Figure 2 (a)) have similar fitness profiles [40]; there is no selection between them, and the infection probability and recovery time of the patient are similar.

We detect SSEs in each AACC by tracking unexpectedly large increases in their size after proper normalization (Methods 4.1). Our SSE detection method is closely related to previous thresholding approaches [31, 38], requires only minor preprocessing, and the detected SSEs agree with our intuition after visual inspection (Figure 2 (b)). Thereafter, we assign Event Containment Scores (ECSs) to each SSE by comparing the size of the AACCs after SSEs and after appropriately selected baseline events (Methods 4.2). Finally, to acquire aggregate descriptions of event containment, we compute the median of $\mathrm{ECS}$ values in each country-variant pair $c$ , denoted by $\mathrm{ECS}_{c}$ (the output of the pipeline in Figure 1 (b)). Intuitively, a positive $\mathrm{ECS}_{c}$ means that SSEs typically led to smaller AACC sizes, and therefore fewer secondary infections than the baselines, i.e. the SSEs were well-contained (Figure 2 (b), red squares). Similarly, a negative $\mathrm{ECS}$ would suggest SSEs that were not contained as well as the baselines (Figure 2 (b), blue squares).

Both the SSE detection and the ECS assignment algorithms are efficient but imperfect methods, potentially introducing significant amounts of noise in our results. However, we expect that if enough SSEs are detected in a country-variant pair, the median of the ECS values will still contain information about event containment, and subsequently, awareness behavior. We confirm this hypothesis by the analysis of COVID-19 genetic sequences in this section, and by simulation results in section 2.2.

We compute the $\mathrm{ECS}_{c}$ values for all country-variant pairs with at least 20 detected SSEs, and we analyse how these values are related to behavioral metrics and potential confounding factors. Concentrating on the Delta and the Omicron variants, as these two variants had the highest sampling rate and the highest number of countries with at least 20 SSEs, Figure 3 (a) and (d) show a large variability between the computed $\mathrm{ECS}_{c}$ values, suggesting a large variability in the efficacy of SSE containment in these countries. For some countries, such as Austria and Germany, the $\mathrm{ECS}_{c}$ values are positive in both waves, suggesting efficient SSE containment. For other countries, such as Denmark, Switzerland and Sweden, the $\mathrm{ECS}_{c}$ values are negative, suggesting inefficient SSE containment compared to the baseline. There are also countries, such as Ireland and Slovenia, where the sign of the $\mathrm{ECS}_{c}$ value changes between the two waves.

	Sequencing rate		Attack rate		CHI
	Delta (a)	Omicron (d)	Delta (b)	Omicron (e)	Delta (c)	Omicron (f)
Spearman-r statistic	-0.833	-0.283	-0.033	0.050	0.733	0.850
Spearman-r p-value	0.005	0.460	0.932	0.898	0.025	0.004

Table 1: Spearman-r statistics and p-values computed between the

\mathrm{ECS}_{c}

values and the exogenous variables for each of the six plots in Figure 3 (a)-(f). Cells with a significant p-value are colored grey.

To understand the factors that could explain the variability in the $\mathrm{ECS}_{c}$ values, we compute the sampling rate, the attack rate and the Containment Health Index (CHI) in each (country-wave) pair (Methods 4.2). CHI is a composite epidemic response measure based on thirteen policy indicators maintained by the Oxford Coronavirus Government Response Tracker (OxCGRT) project, similarly to the stringency index [39]. We plot these exogenous variables in Figure 3, and we compute the Spearman-r statistic between them and the $\mathrm{ECS}_{c}$ values (Table 1). We find statistically significant correlation between the $\mathrm{ECS}_{c}$ values and the CHI in both waves (Figure 3 (c) and (f)), and between the $\mathrm{ECS}_{c}$ values and the sampling rate in the Delta wave (Figure 3 (a)). Since the sampling rate is a potential confounding factor, the latter result could suggest that our results are artefacts of the sampling procedure. However, in the Delta wave, sampling rate and CHI happened to be highly and negatively correlated, potentially because certain countries aimed to lift the economic burden of strict containment policies by a higher quality sequencing and monitoring project. In the other waves the correlation is only significant between the $\mathrm{ECS}_{c}$ and the CHI (i.e., in the Omicron wave, and in the Alpha wave with a different SSE detection threshold, see Supplementary Material B.1).

The correlation between the Containment Health Index and the $\mathrm{ECS}_{c}$ values in Figure 3 is an indication that $\mathrm{ECS}_{c}$ measures a behavioral signal instead of noise or confounding effects. To test whether this behavior is indeed local awareness behavior, we conduct a large-scale simulation experiment in Section 2.2, and subsequently we return to the empirical analysis in Section 2.3.

2.2 Event Containment Scores on Synthetic Genetic Sequence Data

We set up a synthetic pipeline to generate genetic sequence datasets similar to the GISAID EpiCoV dataset, which we can analyse with our SSE detection and ECS assignment pipeline (Figure 1 (b)). First, we simulate Susceptible-Infected-Recovered (SIR) epidemics on various synthetic and real networks (Methods 4.3-4.4), then we apply the Jukes-Cantor (JC) [41] genetic substitution model on the resulting infection tree to produce genetic sequence data (Methods 4.5), and finally we compute the $\mathrm{ECS}_{c}$ values as before (except that now $c$ denotes the model parameters instead of the country-variant pair, see also Methods 4.2).

For the underlying network, we select four real social networks and three types of synthetic random networks. Two company friendship networks [42], that encode personal connections (recorded by Facebook), have medium size (around 5000 nodes), and have similar characteristics as the contact networks on which a viral disease (such as SARS-CoV-2) can spread. Two online social networks, the Google+ friendship network [43], and the Twitter mutual mention network [44] are large (over 200,000 nodes), and they model the underlying network of online contagion processes (e.g., rumor, misinformation). All 4 networks have a heterogeneous degree distribution, and a relative high clustering coefficient (Supplementary Figure 20). To model these characteristics separately, we select three synthetic network models: the Configuration Model has a heterogeneous degree distribution but no clustering, the Stochastic Block Model (SBM) has high clustering but a homogeneous degree distribution, and the Geometric Inhomogeneous Random Graph (GIRG) model [45], which has both a heterogeneous degree distribution and high clustering (Methods 4.3). On all network models, due to the heterogenous degree distribution (or the community structure in case of the SBM), we expect large infection events that can be detected via the SSE detection pipeline outlined in Section 2.1.

We model local and global awareness in our simulations as a modification of the SIR model with adaptively changing infection probabilities (Methods 4.4). Inspired by [46], for local awareness we set the infection probability of an infectious node $u$ at time $t$ to be

\beta_{u,t}=\beta_{0}e^{-\alpha_{l}I_{u,t}},

(1)

where $\beta_{0}\in[0,1]$ is the basic infection probability, $\alpha_{l}$ sets the strength of the local awareness behavior, and $I_{u,t}$ is the number of infectious neighbors of node $u$ at time $t$ . In case of the global awareness, all infectious nodes $u$ have the same infection probability at time $t$ :

\beta_{u,t}=\beta_{0}e^{-\alpha_{g}I_{t}/N},

(2)

where $I_{t}$ is the total number of infectious nodes in the network, $\alpha_{g}$ sets the strength of the global awareness behavior, and $N$ is the size of the network. The exponential function in equation (1) (resp., (2)) aims to model a scenario where each neighbor (resp., node) may alert node $u$ about their infectious status, and each of these independent alerts cause a multiplicative reduction in the infection probability (similarly to alternative approaches where awareness is modeled as another contagion process, and the probability of staying unaware decays exponentially in the number of aware neighbors [11, 12, 13]). As a robustness check, we also implement linearly decaying awareness functions, since it has been reported that they may be more cost-effective based on an epi-economic point of view [47] (Supplementary Figure 19).

In Figure 4, we plot the dependence of $\mathrm{ECS}_{c}$ on the awareness-strength parameters $\alpha_{l}$ and $\alpha_{g}$ and two potential confounding factors: the basic infection probability $\beta_{0}$ , and the sampling probability $p$ . The results indicate that $\mathrm{ECS}_{c}$ primarily depends on the parameter $\alpha_{l}$ (Figure 4 (a)). Importantly, we were only able to generate positive $\mathrm{ECS}_{c}$ values with the local awareness model, which is a strong indication that that the positive $\mathrm{ECS}_{c}$ values observed in the empirical dataset (Figure 3) are signs of local awareness behavior.

The observation that only local awareness can produce positive $\mathrm{ECS}_{c}$ values has an intuitive explanation. When a SSE occurs, there is usually a common trait between the individuals that become infected at the same time; they all tend belong to the same community as the initial infector. It is also likely, that there exist many additional individuals that belong to the same community, but do not become immediately infected. Indeed, reports of early SSEs during COVID-19 do not report all individuals becoming infected in the communities at the same time [48, 49], and the same is true in simulations, unless the infection probability inside the community is close to 1. If the structure of the contact network remains unchanged after the SSE, then these additional community members become infected in the next timestep (week), which causes the number of sequences in the AACC to grow, and therefore produces a negative $\mathrm{ECS}_{c}$ value. Note that there are extreme examples of static networks and epidemic parameters that produce a positive $\mathrm{ECS}_{c}$ value. For instance, in a star network with infection probability close to 1, an epidemic from the center node produces a single SSE, and then dies out in the next step, resulting in $\mathrm{ECS}_{c}>0$ . However, we conclude that besides a few extreme cases, positive $\mathrm{ECS}_{c}$ values – such as the ones observed in the empirical dataset in Figure 3 – are signs of local awareness behavior.

2.3 Local Awareness During And Between Waves in Genetic Sequence Data

Having validated ECSs in the GISAID EpiCoV and in synthetic datasets, we return to the main question posed in the Introduction; whether drops in local awareness behavior can be observed in the genetic sequence dataset during the Omicron wave of the COVID-19 pandemic. So far, we computed the $\mathrm{ECS}_{c}$ value as the median of all ECS values for a country-variant pair. However, since in certain countries, most notably in the United Kingdom, we detect thousands of SSEs in multiple variants, we can obtain a signal at a higher temporal resolution by computing the median of ECS values on a monthly basis. The resulting signal (Figure 5 (a)), obtained purely based on genetic sequence data, shares a remarkable similarly with the Hungarian awareness survey results in Figure 1 (a). Both curves show a relatively stable signal between October 2021 and July 2022, with a significant drop during the (largest) peak(s) of the Omicron variant, suggesting that the two methods may measure similar behavioral patterns.

Given the different nature of the two datasets, obtained with a different methodologies from different countries, it is important to interpret the comparison of Figures 1 and 5 (a) carefully. A significant difference between the two countries is that both the Delta and the Omicron waves arrived a few months earlier in the UK compared to Hungary, and during the UK Delta wave, the reported case counts showed plateau instead of a clear epidemic wave. The Omicron wave arrived to the UK during this plateau, whereas in Hungary the case counts dropped between the two waves. Moreover, the UK had more stringent government restrictions implemented at the end of 2021 than Hungary [39], which could have an influence on local behavioral patterns as well.

Since other countries in the genetic dataset have too few samples to perform a monthly aggregation as we did in the UK, for each country-variant pair we subdivide the dataset into two groups: “during wave” and “outside of wave”. More precisely, in each country-variant pair, we rank the SSEs based on the total number of reported cases in the country at the time, and we classify the top 20% of the SSEs as “during wave” and the remaining 80% as “outside of wave” (Supplementary Material C). Thereafter, we compute the median of the ECS values in each country-variant-during/outside triplet, where at least 20 SSEs were detected. The results (Figure 5 (b)) are consistent our observation in the Hungarian survey dataset: in European countries with a large genetic sequence datasets, local awareness (ECS) was lower during the wave compared to outside of the wave during the Omicron variant but not during the Alpha and the Delta variants.

3 Discussion

In epidemic surveillance, there is usually a trade-off between the breadth and the depth of the data we can access. On one end, we have aggregate case counts, that give a macroscopic view on the epidemic, one the other end we have a handful of case-studies, which tell about the local spread. Survey results provide a representative depiction of self-reported human behavior, however, they lack sufficient information on disease spread to support conclusions beyond forming hypotheses.

In this paper, we observe local awareness behavior in two complementary datasets: a Hungarian survey dataset and the dataset of clinical genetic sequences collected during the COVID-19 pandemic. We first show that the survey results indicate a drop in local awareness behavior during the Omicron wave of the COVID-19 pandemic. Based on the survey results, we formulate a question, whether this drop occurred and caused noticeable changes in the spread of the disease in other countries as well. To address this question, we introduce a methodology that utilizes genetic sequence data, striking a new balance between micro and macroscopic epidemic surveillance.

As with any trade-off, our proposed analysis comes with a number of limitations. We identify SSEs based on simple thresholding of sequence counts, which is less accurate than manual contact tracing, where more metadata and more context about infection events can be taken into account. Consequently, we only compute highly aggregated statistics on the detected events. One ECS gives only very noisy information about the outcome of each SSE, and only the median of all ECSs, the $\mathrm{ECS}_{c}$ value has the statistical power to say anything about local awareness in region $c$ . Since the amount of genetic sequences we have available since COVID-19 is unprecedented, and the new tools to analyse it are just being developed [22], our results too have to be interpreted very carefully, and should be confirmed by further research.

Despite these limitations, the new methodology we propose brings immediate and exciting contributions into epidemic surveillance and modeling. While local awareness has been thoroughly studied in the modeling literature [11, 12, 13, 14], there has been little empirical evidence about its impact in real epidemics. We provide such evidence by showing that positive $\mathrm{ECS}_{c}$ corresponds to local awareness behavior in simulations, and that $\mathrm{ECS}_{c}$ was positive in several European countries during the COVID-19 pandemic.

On a more operational side, by studying $\mathrm{ECS}_{c}$ values, we are able to measure how effectively different countries managed to contain SSEs in different waves. We observe that this effectiveness is highly correlated with the containment policies implemented in each country, which is a reassuring finding. We envision that similar analyses will be used to evaluate the effectiveness of the implemented policies in future pandemics, potentially generating a positive feedback loop between cooperative preventive behavior and epidemic containment. Unfortunately, even with the rapid advancement of genetic sequencing technologies, the financial burden of achieving the sampling rate necessary for our proposed analysis is quite high, and we cannot expect that we will have the same coverage in every pandemic. Deciding how much sequencing is actually needed for epidemic surveillance is currently an active research topic, as the cost-benefit tradeoffs are still being debated [50]. Our analysis adds to this discussion by bringing a new potential benefit of dense genetic sequencing.

Finally, we highlight the importance of continuing this research towards more specific questions, such as understanding the socioeconomic factors that determine whether SSEs are effectively contained or not, and whether the measured local awareness behavior is more centrally regulated (e.g. by a public health organization), or decentralized and self-motivated as it was asked in the questionnaire in Figure 1 (a). Large-scale genetic data analysis provides a new opportunity to answer these questions, and to further our understanding about the underlying mechanisms of behavior-disease models.

4 Methods

The overview of the various steps of the simulation pipeline are illustrated in Figure 1. The details on the preprocessing of AACCs are included in the main text. We give the details on the detection of SSEs (Section 4.1), the computation of the ECSs (Section 4.2), the generation of synthetic networks (Section 4.3), the SIR model with local and global awareness (Section 4.4), and the generation of synthetic genetic sequences (Section 4.5) below.

4.1 SSE detection

See Supplementary Material A for a detailed explanation of these methodological choices. We index the size of AACCs by the time $t$ (integer value measured in weeks since the first sequence), their country-variant pair denoted by $c$ , and their cluster index $i$ (Figure 2 (b)). We track the normalized changes in AACC sizes defined as

\mathrm{NormChange}_{c,i}(t)=\frac{\mathrm{AACC}_{c,i}(t+1)-\mathrm{AACC}_{c,i% }(t)}{\mathrm{max}(1,\sqrt{\mathrm{AACC}_{c,i}(t)})},

(3)

where $\mathrm{AACC}_{c,i}(t)$ denotes the size of the AACC indexed by $(c,i,t)$ . We say that a SSE happens at time $i$ in AACC $(c,i)$ if $\mathrm{NormChange}_{c,i}(t)$ is larger than a threshold, which is set to by 9 by default, and we give a robustness analysis in Supplementary Material B.1.

4.2 ECS assignment

See Supplementary Material A for a detailed explanation of these methodological choices. In each country-variant pair, with at least 20 detected SSEs, we match each SSE $(c,i,t)$ with at least $2m$ baseline events (not SSEs) based on AACC sizes (see Supplementary Material B.2 for a robustness analysis on the value of $m$ ). We outline a procedure that ensures that compared to $(c,i,t)$ , at least $m$ larger and $m$ smaller AACCs are always selected as baselines, however, if there are a large number of AACCs with the same size as $(c,i,t)$ , then we select all of them to avoid arbitrary selections and to make use of the available data.

Formally, let us denote the cluster indices (resp., time indices) of the matched AACCs by $I(c,i,t)$ (resp., $T(c,i,t))$ ). First, we sort all AACCs by size to create an order $\mathcal{O}$ . We construct $I(c,i,t)$ (resp., $T(c,i,t))$ ) by taking the union of the cluster (resp., time) indices of all AACCs with the same size as $(c,i,t)$ , as well as the $m$ closest smaller and the $m$ closest larger AACCs to $(c,i,t)$ in $\mathcal{O}$ . Then, the median baseline NormChange values at time $t$ are defined as

\mathrm{Baseline}_{c,i}(t)=\mathop{\text{median}}_{j}\left(\mathrm{NormChange}% _{c,I(c,i,t)_{j}}(T(c,i,t)_{j})\right),

(4)

where the NormChange function is defined in equation (3). Thereafter, $\mathrm{ECS}_{c,i}(t)$ is computed as

\mathrm{ECS}_{c,i}(t)=\mathrm{Baseline}_{c,i}(t+1)-\mathrm{NormChange}_{c,i}(t% +1).

(5)

and $\mathrm{ECS}_{c}$ is defined as the median of the $\mathrm{ECS}_{c,i}(t)$ values for all SSEs $(c,i,t)$ .

In Figure 3, $\mathrm{ECS}_{c}$ values are compared with various exogenous variables (sampling rate, attack rate, CHI). These exogenous variables are computed for each country on a weekly basis based on publicly available datasets on the case counts [51], population counts [52], and the Oxford Containment Health Index [39]. Then, each SSE in the dataset is matched with the exogenous variables based on the time and country information. Finally, the plotted values are computed as the median of the exogenous variables of the SSEs corresponding to index $c$ (which are also used to compute $\mathrm{ECS}_{c}$ ).

4.3 Generating synthetic networks

Geometric Inhomogenous Random Graphs (GIRGs) were generated by sampling the spatial coordinates and the expected degrees of the nodes, and then connecting them by edges with a probability given by a kernel function, which is inversely proportional with the spatial distance, and assures the desired node degrees [53]. We used the Python implementation [54] for the sampling procedure with degree exponent $\tau=3.5$ and parameters $\alpha=2.3$ , $C_{1}=0.8$ . We tuned $C_{2}$ numerically to achieve the desired average degree (by default 3). Configuration models are generated by degree-preserving edge shuffling of the edges of the generated GIRG networks. SBMs were generated with blocks of size 50. The connection probabilities inside and between of the blocks were tuned so that for each node, half of it’s average degree were inside the block, and half of it’s average degree were outside the block. All synthetic networks had $10^{4}$ nodes, and we took the largest connected component if the network was not connected. We include a visualization of the size, degree distribution and average clustering coefficient of the generated networks in Supplementary Figure 20.

4.4 SIR model extended with local and global awareness

On both synthetic and real networks, we used our own implementation of the SIR model. We model local and global awareness by setting the infection probability of an infectious node $u$ to any other susceptible node $v$ at time $t$ to a function $\beta_{u,t}$ . In case of local awareness, $\beta_{u,t}$ depends on on $I_{u,t}$ , the number of infected neighbors of $u$ at $t$ , and in case of global awareness, $\beta_{u,t}$ depends on $I_{t}$ , the total number of infected nodes at time $t$ . The specific awareness functions we implemented are shown in Table 2. The default values for the basic infection probability $\beta_{0}$ and the recovery probability $\gamma$ were always $0.3$ .

No awareness:	$\beta_{u,t}=\beta_{0}$
Exponential local awareness (1):	$\beta_{u,t}=\beta_{0}\cdot\mathrm{exp}(-\alpha_{l}I_{u,t})$
Exponential global awareness (2):	$\beta_{u,t}=\beta_{0}\cdot\mathrm{exp}(-\alpha_{g}I_{t}/N)$
Linear local awareness:	$\beta_{u,t}=\beta_{0}\cdot 1/(1+\alpha_{l}I_{u,t})$
Linear global awareness:	$\beta_{u,t}=\beta_{0}\cdot 1/(1+\alpha_{g}I_{t}/N))$ ,

Table 2: The specific awareness functions implemented in our synthetic models.

4.5 Generating synthetic genetic sequences

Once the epidemic process has been simulated, we assign synthetic genetic sequences to each node of the infection tree using the JC genetic substitution model [41], which is the simplest genetic substitution model we could select for our application. More concretely, we assign strings of size 10 consisting of the digits $\{0,1,2,3\}$ to each infected node using the following procedure. First, we assign a uniformly randomly chosen string to the root of the infection tree. Thereafter, for each edge of the infection tree, we sample each digit of the string of the parent node with probability 1/20, change it to a uniformly random new digit (among the other three digits), and assign the resulting string to the child node. These parameters assure that we have on average one mutation in every 2 timesteps (weeks) agreeing with estimates from the literature [55]. Our synthetic genetic sequences are much shorter than the COVID-19 genetic sequences for the sake of computational efficiency.

5 Data and Code Availability

All genome sequences and associated metadata are published in GISAID’s EpiCoV database. To view the contributors of each individual sequence with details such as accession number, Virus name, Collection date, Originating Lab and Submitting Lab and the list of Authors, visit 10.55876/gis8.240404rn. The MASZK survey data is available upon request from the authors. The code for the genetic data generation and analysis pipeline shown in Figure 1 (b) will be made available at https://github.com/odorgergo/sse-awareness upon publication.

6 Acknowledgements

We thank Eszter Ari, Andreas Bergthaler, and Tamás Stirling for their insightful comments and remarks. GÓ was supported by the Swiss National Science Foundation, under grant number P500PT-211129. M.K. was supported by the CHIST-ERA project SAI: FWF I 5205-N; the SoBigData++ H2020-871042; SoBigData-PPP HORIZON-INFRA-2021-DEV-02 program under grant agreement No 101079043, and the National Laboratory for Health Security, Alfréd Rényi Institute, RRF-2.3.1-21-2022-00006.

7 Author contributions

GÓ and MK conceptualized the research design. GÓ conducted the data analysis, performed the synthetic simulations, created the visualizations and wrote the first draft of the manuscript. MK acquired the survey data and supervised the research. GÓ and MK edited the final version of the manuscript.

References

[1] J. D. Sachs, S. S. A. Karim, L. Aknin, J. Allen, K. Brosbøl, F. Colombo, G. C. Barron, M. F. Espinosa, V. Gaspar, A. Gaviria, et al., “The lancet commission on lessons for the future from the covid-19 pandemic,” The Lancet, vol. 400, no. 10359, pp. 1224–1280, 2022.
[2] R. K. Webster, S. K. Brooks, L. E. Smith, L. Woodland, S. Wessely, and G. J. Rubin, “How to improve adherence with quarantine: rapid review of the evidence,” Public health, vol. 182, pp. 163–169, 2020.
[3] J. J. V. Bavel, K. Baicker, P. S. Boggio, V. Capraro, A. Cichocka, M. Cikara, M. J. Crockett, A. J. Crum, K. M. Douglas, J. N. Druckman, et al., “Using social and behavioural science to support covid-19 pandemic response,” Nature human behaviour, vol. 4, no. 5, pp. 460–471, 2020.
[4] W. H. Organization et al., “Pandemic fatigue–reinvigorating the public to prevent covid-19: policy framework for supporting pandemic prevention and management,” tech. rep., World Health Organization. Regional Office for Europe, 2020.
[5] A. Haktanir, N. Can, T. Seki, M. F. Kurnaz, and B. Dilmaç, “Do we experience pandemic fatigue? current state, predictors, and prevention,” Current Psychology, vol. 41, no. 10, pp. 7314–7325, 2022.
[6] F. Jørgensen, A. Bor, M. S. Rasmussen, M. F. Lindholt, and M. B. Petersen, “Pandemic fatigue fueled political discontent during the covid-19 pandemic,” Proceedings of the National Academy of Sciences, vol. 119, no. 48, p. e2201266119, 2022.
[7] C. Stevenson, J. R. Wakefield, I. Felsner, J. Drury, and S. Costa, “Collectively co** with coronavirus: Local community identification predicts giving support and lockdown adherence during the covid-19 pandemic,” British Journal of Social Psychology, vol. 60, no. 4, pp. 1403–1418, 2021.
[8] G. Kraft-Todd, E. Yoeli, S. Bhanot, and D. Rand, “Promoting cooperation in the field,” Current Opinion in Behavioral Sciences, vol. 3, pp. 96–101, 2015.
[9] M. S. Wolf, M. Serper, L. Opsasnick, R. M. O’Conor, L. Curtis, J. Y. Benavente, G. Wismer, S. Batio, M. Eifler, P. Zheng, et al., “Awareness, attitudes, and actions related to covid-19 among adults with chronic conditions at the onset of the us outbreak: a cross-sectional survey,” Annals of internal medicine, vol. 173, no. 2, pp. 100–109, 2020.
[10] R. M. Jaber, B. Mafrachi, A. Al-Ani, and M. Shkara, “Awareness and perception of covid-19 among the general population: A middle eastern survey,” PloS one, vol. 16, no. 4, p. e0250461, 2021.
[11] S. Funk, E. Gilad, C. Watkins, and V. A. Jansen, “The spread of awareness and its impact on epidemic outbreaks,” Proceedings of the National Academy of Sciences, vol. 106, no. 16, pp. 6872–6877, 2009.
[12] N. Perra, D. Balcan, B. Gonçalves, and A. Vespignani, “Towards a characterization of behavior-disease models,” PloS one, vol. 6, no. 8, p. e23084, 2011.
[13] I. Z. Kiss, J. Cassell, M. Recker, and P. L. Simon, “The impact of information transmission on epidemic outbreaks,” Mathematical biosciences, vol. 225, no. 1, pp. 1–10, 2010.
[14] A. Teslya, T. M. Pham, N. G. Godijk, M. E. Kretzschmar, M. C. Bootsma, and G. Rozhnova, “Impact of self-imposed prevention measures and short-term government-imposed social distancing on mitigating and delaying a covid-19 epidemic: A modelling study,” PLoS medicine, vol. 17, no. 7, p. e1003166, 2020.
[15] S. Funk, S. Bansal, C. T. Bauch, K. T. Eames, W. J. Edmunds, A. P. Galvani, and P. Klepac, “Nine challenges in incorporating the dynamics of behaviour in infectious diseases models,” Epidemics, vol. 10, pp. 21–25, 2015.
[16] F. Verelst, L. Willem, and P. Beutels, “Behavioural change models for infectious disease transmission: a systematic review (2010–2015),” Journal of The Royal Society Interface, vol. 13, no. 125, p. 20160820, 2016.
[17] M. Karsai, J. Koltai, O. Vásárhelyi, and G. Röst, “Hungary in mask/maszk in hungary,” Corvinus Journal of Sociology and Social Policy, no. 2, 2020.
[18] L. Song, H. Liu, F. S. L. Brinkman, E. Gill, E. J. Griffiths, W. W. L. Hsiao, S. Savić-Kallesøe, S. Moreira, G. Van Domselaar, M. H. Zawati, et al., “Addressing privacy concerns in sharing viral sequences and minimum contextual data in a public repository during the covid-19 pandemic,” Frontiers in genetics, vol. 12, p. 716541, 2022.
[19] E. M. Volz, S. L. Kosakovsky Pond, M. J. Ward, A. J. Leigh Brown, and S. D. Frost, “Phylodynamics of infectious disease epidemics,” Genetics, vol. 183, no. 4, pp. 1421–1430, 2009.
[20] G. Baele, S. Dellicour, M. A. Suchard, P. Lemey, and B. Vrancken, “Recent advances in computational phylodynamics,” Current opinion in virology, vol. 31, pp. 24–32, 2018.
[21] E. M. Volz, K. Koelle, and T. Bedford, “Viral phylodynamics,” PLoS computational biology, vol. 9, no. 3, p. e1002947, 2013.
[22] E. B. Hodcroft, N. De Maio, R. Lanfear, D. R. MacCannell, B. Q. Minh, H. A. Schmidt, A. Stamatakis, N. Goldman, and C. Dessimoz, “Want to track pandemic variants faster? fix the bioinformatics bottleneck,” Nature, vol. 591, no. 7848, pp. 30–33, 2021.
[23] L. Cappello, J. Kim, S. Liu, and J. A. Palacios, “Statistical challenges in tracking the evolution of sars-cov-2,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 37, no. 2, p. 162, 2022.
[24] Y. Turakhia, B. Thornlow, A. S. Hinrichs, N. De Maio, L. Gozashti, R. Lanfear, D. Haussler, and R. Corbett-Detig, “Ultrafast sample placement on existing trees (usher) enables real-time phylogenetics for the sars-cov-2 pandemic,” Nature genetics, vol. 53, no. 6, pp. 809–816, 2021.
[25] J. McBroome, B. Thornlow, A. S. Hinrichs, A. Kramer, N. De Maio, N. Goldman, D. Haussler, R. Corbett-Detig, and Y. Turakhia, “A daily-updated database and tools for comprehensive sars-cov-2 mutation-annotated trees,” Molecular biology and evolution, vol. 38, no. 12, pp. 5819–5824, 2021.
[26] M. Hunt, A. S. Hinrichs, D. Anderson, L. Karim, B. L. Dearlove, J. Knaggs, B. Constantinides, P. W. Fowler, G. Rodger, T. L. Street, et al., “Addressing pandemic-wide systematic errors in the sars-cov-2 phylogeny,” bioRxiv, pp. 2024–04, 2024.
[27] C. Ye, B. Thornlow, A. S. Hinrichs, D. Torvi, R. Lanfear, R. Corbett-Detig, and Y. Turakhia, “matoptimize: A parallel tree optimization method enables online phylogenetics for sars-cov-2 (preprint),” 2022.
[28] S. Elbe and G. Buckland-Merrett, “Data, disease and diplomacy: Gisaid’s innovative contribution to global health,” Global challenges, vol. 1, no. 1, pp. 33–46, 2017.
[29] A. Bernasconi, L. Mari, R. Casagrandi, and S. Ceri, “Data-driven analysis of amino acid change dynamics timely reveals sars-cov-2 variant emergence,” Scientific Reports, vol. 11, no. 1, p. 21068, 2021.
[30] C. Tran-Kiem and T. Bedford, “Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences,” Proceedings of the National Academy of Sciences, vol. 121, no. 15, p. e2305299121, 2024.
[31] X. Bello, J. Pardo-Seco, A. Gómez-Carballa, H. Weissensteiner, F. Martinón-Torres, and A. Salas, “Covidphy: A tool for phylogeographic analysis of sars-cov-2 variation,” Environmental Research, vol. 204, p. 111909, 2022.
[32] D. Lewis, “Superspreading drives the covid pandemic–and could help to tame it.,” Nature, vol. 590, no. 7847, pp. 544–547, 2021.
[33] B. M. Althouse, E. A. Wenger, J. C. Miller, S. V. Scarpino, A. Allard, L. Hébert-Dufresne, and H. Hu, “Superspreading events in the transmission dynamics of sars-cov-2: Opportunities for interventions and control,” PLoS biology, vol. 18, no. 11, p. e3000897, 2020.
[34] T. R. Frieden and C. T. Lee, “Identifying and interrupting superspreading events—implications for control of severe acute respiratory syndrome coronavirus 2,” Emerging infectious diseases, vol. 26, no. 6, p. 1059, 2020.
[35] M. P. Kain, M. L. Childs, A. D. Becker, and E. A. Mordecai, “Chop** the tail: How preventing superspreading can help to maintain covid-19 control,” Epidemics, vol. 34, p. 100430, 2021.
[36] H. Streeck, B. Schulte, B. M. Kümmerer, E. Richter, T. Höller, C. Fuhrmann, E. Bartok, R. Dolscheid-Pommerich, M. Berger, L. Wessendorf, et al., “Infection fatality rate of sars-cov2 in a super-spreading event in germany,” Nature communications, vol. 11, no. 1, p. 5829, 2020.
[37] H. Y. Lam, T. S. Lam, C. H. Wong, W. H. Lam, E. L. C. Mei, Y. L. C. Kuen, W. L. T. Wai, B. H. C. Hin, K. H. Wong, and S. K. Chuang, “A superspreading event involving a cluster of 14 coronavirus disease 2019 (covid-19) infections from a family gathering in hong kong special administrative region sar (china),” Western Pacific Surveillance and Response Journal: WPSAR, vol. 11, no. 4, p. 36, 2020.
[38] J. E. Lemieux, K. J. Siddle, B. M. Shaw, C. Loreth, S. F. Schaffner, A. Gladden-Young, G. Adams, T. Fink, C. H. Tomkins-Tinch, L. A. Krasilnikova, et al., “Phylogenetic analysis of sars-cov-2 in boston highlights the impact of superspreading events,” Science, vol. 371, no. 6529, p. eabe3261, 2021.
[39] T. Hale, N. Angrist, R. Goldszmidt, B. Kira, A. Petherick, T. Phillips, S. Webster, E. Cameron-Blake, L. Hallas, S. Majumdar, et al., “A global panel database of pandemic policies (oxford covid-19 government response tracker),” Nature human behaviour, vol. 5, no. 4, pp. 529–538, 2021.
[40] Q. Yu, J. A. Ascensao, T. Okada, C.-. G. U. C.-U. consortium, O. Boyd, E. Volz, and O. Hallatschek, “Lineage frequency time series reveal elevated levels of genetic drift in sars-cov-2 transmission in england,” bioRxiv, pp. 2022–11, 2022.
[41] T. H. Jukes and C. R. Cantor, “Evolution of protein molecules. 21–132 munro hn ed mammalian protein metabolism academic press,” New York, 1969.
[42] M. Fire and R. Puzis, “Organization mining using online social networks,” Networks and Spatial Economics, vol. 16, pp. 545–578, 2016.
[43] M. Fire, L. Tenenboim-Chekina, R. Puzis, O. Lesser, L. Rokach, and Y. Elovici, “Computationally efficient link prediction in a variety of social networks,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 5, no. 1, pp. 1–25, 2014.
[44] S. Unicomb, G. Iñiguez, J. Kertész, and M. Karsai, “Reentrant phase transitions in threshold driven contagion on multiplex networks,” Physical Review E, vol. 100, no. 4, p. 040301, 2019.
[45] K. Bringmann, R. Keusch, and J. Lengler, “Geometric inhomogeneous random graphs,” Theoretical Computer Science, vol. 760, pp. 35–54, 2019.
[46] Q. Wu, X. Fu, M. Small, and X.-J. Xu, “The impact of awareness on epidemic spreading in networks,” Chaos: an interdisciplinary journal of nonlinear science, vol. 22, no. 1, 2012.
[47] L. A. N. Fard, M. Starnini, and M. Tizzoni, “Modeling adaptive forward-looking behavior in epidemics on networks,” arXiv preprint arXiv:2301.04947, 2023.
[48] T. Sekizuka, K. Itokawa, T. Kageyama, S. Saito, I. Takayama, H. Asanuma, N. Nao, R. Tanaka, M. Hashino, T. Takahashi, et al., “Haplotype networks of sars-cov-2 infections in the diamond princess cruise ship outbreak,” Proceedings of the National Academy of Sciences, vol. 117, no. 33, pp. 20198–20201, 2020.
[49] Y. Zhang, Y. Li, L. Wang, M. Li, and X. Zhou, “Evaluating transmission heterogeneity and super-spreading event of covid-19 in a metropolis of china,” International journal of environmental research and public health, vol. 17, no. 10, p. 3705, 2020.
[50] F. Wegner, B. Cabrera Gil, T. Araud, C. Beckmann, N. Beerenwinkel, C. Bertelli, M. Carrara, L. Cerutti, C. Chen, S. Cordey, et al., “How much should we sequence? an analysis of the swiss sars-cov-2 surveillance effort,” medRxiv, pp. 2023–08, 2023.
[51] E. Dong, H. Du, and L. Gardner, “An interactive web-based dashboard to track covid-19 in real time,” The Lancet infectious diseases, vol. 20, no. 5, pp. 533–534, 2020.
[52] U. Nations, “Department of economic and social affairs,” Population Division, 2020.
[53] K. Bringmann, R. Keusch, and J. Lengler, “Geometric inhomogeneous random graphs,” Theoretical Computer Science, vol. 760, pp. 35–54, 2019.
[54] “Implementation of geometric inhomogeneous random graphs,” 2020. https://github.com/joostjor/random-graphs/blob/master/girg.py.
[55] A. Gómez-Carballa, J. Pardo-Seco, X. Bello, F. Martinón-Torres, and A. Salas, “Superspreading in the emergence of covid-19 variants,” Trends in Genetics, vol. 37, no. 12, pp. 1069–1080, 2021.
[56] T. R. Mercer and M. Salit, “Testing at scale during the covid-19 pandemic,” Nature Reviews Genetics, vol. 22, no. 7, pp. 415–426, 2021.
[57] J. O. Lloyd-Smith, S. J. Schreiber, P. E. Kopp, and W. M. Getz, “Superspreading and the effect of individual variation on disease emergence,” Nature, vol. 438, no. 7066, pp. 355–359, 2005.

Appendix A Detailed explanation of SSE detection and ECS assignment

We index AACCs only by the time $t$ (integer value measured in weeks since the first sequence), their country-variant pair denoted by $c$ , and their cluster index $i$ (Figure 6 (b)). In order to track changes in AACC sizes, we are interested in the Normalized Change values defined as

\mathrm{NormChange}_{c,i}(t)=\frac{\mathrm{AACC}_{c,i}(t+1)-\mathrm{AACC}_{c,i% }(t)}{\mathrm{max}(1,\sqrt{\mathrm{AACC}_{c,i}(t)})},

(6)

where $\mathrm{AACC}_{c,i}(t)$ denotes the size of the AACC indexed by $(c,i,t)$ . The normalization with the square root of the AACC size accounts for the natural fluctuation of the cluster sizes. Indeed, assuming that the patients in the AACCs at time $t$ independently infect an identically distributed random number of new patients with the same amino acid signature at time $t+1$ , by the Central Limit Theorem, we expect the fluctuations of $\mathrm{AACC}_{c,i}(t+1)$ to be proportional to the square root of $\mathrm{AACC}_{c,i}(t)$ . Due to this normalization, NormChange values tend to be close to zero; in most countries 95% of the values fall between -3 and 5. We consider exceptionally large NormChange values as a sign of a SSE. Inspired by [38], we choose the threshold for the NormChange value of an SSE to be 9, and we provide a robustness analysis on this threshold parameter in Section B.1. The proposed SSE detection method is efficient, requires only minor preprocessing, and the detected SSEs agree with our intuition after visual inspection (Figure 6 (b)).

Similarly to previous SSE detection methods based on thresholding genetic sequence counts [31, 38], our proposed method is imperfect, leading to both false positives and false negatives. However, since we only apply aggregate statistics on the identified SSEs, even such imperfect methods can provide important results, especially if the confounding factors can be ruled out. The main confounding factor in this case is sampling bias, as we know that different countries collected and sequenced samples with different strategies and at different rates [56]. To control for country-specific biases, we match each SSE $(c,i,t)$ with multiple baseline AACC timesteps with the same country-variant index $c$ and with similar size. We denote the median of the $\mathrm{NormChange}(t)$ values of the baselines as $\mathrm{Baseline}_{c,i}(t)$ (Methods 4.2). As shown in Figure 6 (c), the NormChange values at $t$ of SSEs are all larger than a threshold and follow broad distribution, whereas the distribution of the $\mathrm{Baseline}_{c,i}(t)$ values is concentrated below the threshold (in Austria during the Delta wave; the results are similar other country-variant pairs shown in Supplementary Material C). Once the baselines are matched, we define our main notion of interest, the ECS, as the difference between the baseline value and the SSE NormChange value at time $t+1$ :

\mathrm{ECS}_{c,i}(t)=\mathrm{Baseline}_{c,i}(t+1)-\mathrm{NormChange}_{c,i}(t% +1).

(7)

We present the distribution of $\mathrm{NormChange}_{c,i}(t+1)$ values for SSEs, the $\mathrm{Baseline}_{c,i}(t+1)$ values for baseline events, along with the resulting ECS values in Austria, in Figure 6 (d). Since all country-variant pairs $c$ in our dataset had similarly broad, but unimodal ECS distributions as Figure 6 (d), we focused on their median values denoted by $\mathrm{ECS}_{c}$ (the output of the pipeline in Figure 1 (b)). As the mutation rate of SARS-CoV-2 (about one mutation every 2 weeks [55]) was higher than the effective reproduction rate (most of the time below 2), and AACCs can be thought of as sub-critical spreading processes, it is no surprise that the median values of the $\mathrm{NormChange}_{c,i}(t+1)$ values are negative for both the SSEs and the baselines. However, the sign of $\mathrm{ECS}_{c}$ adds non-trivial information. A positive $\mathrm{ECS}_{c}$ means that the normalized change of the number of genetic sequences in SSEs was smaller than in the baseline, which suggests that the SSEs led to fewer secondary infections than a similarly sized non-SSE clusters of infectious individuals, i.e. the SSEs were well-contained. Similarly, a negative $\mathrm{ECS}_{c}$ would suggest SSEs that were not contained as well as the baselines in the same country during the same variant.

Appendix B Robustness analyses

B.1 Threshold for SSE Detection

We detect Superspreading Events (SSEs) by applying a threshold on the $\mathrm{NormChange}_{c,i}(t)$ values defined in equation (6). By default, this threshold is set to be 9 following [38], who chose this value based on the theoretical justification of [57]. A notable difference between our approach and the referenced papers is that they assume the SSEs to start from a single source, which can be identified in the dataset (e.g. via contact tracing), and they apply the threshold on the number of secondary cases of the source. In our approach, we do not assume that we can identify the source of the SSE, we are only interested in detecting the occurrence of SSEs based on AACC sizes. For instance, if an $\mathrm{AACC}_{c,i}(t)=10$ and $\mathrm{AACC}_{c,i}(t+1)=100$ , then we suspect that this unexpected increase is due to a SSE that occurred at $t$ , but we do not know which patient caused the SSE. In principle, it is possible that not one but multiple patients with the same amino acid signature caused independent and simultaneous SSEs, however, since this is an unlikely event, we can safely ignore it without significantly impacting our aggregate statistics. In our approach, it is important to also account for the fact that $\mathrm{AACC}_{c,i}$ changing from 5 to 50 is not the same as a change from 500 to 545, as larger AACC sizes also have larger natural fluctuations. Assuming that (due to the Central Limit Theorem), if no SSE occurs, AACC sizes behave similarly to Gaussian random variables with their mean and variance proportional to $\mathrm{AACC}_{c,i}(t)$ , we normalize the AACC size changes by the square root of $\mathrm{AACC}_{c,i}(t)$ in the definition of the NormChange function. When $\mathrm{AACC}_{c,i}(t)=1$ , then we get back the setup of [38], which motivated us to choose the same threshold for SSE detection as they did.

To further strengthen the validity of our results, we present a robustness analysis on the threshold parameter. First, in Figure 7 we show the number of detected SSEs in various European countries as a function of the threshold parameter, if at least 20 SSEs were detected (and therefore qualified for our analysis). As expected, the number of SSEs is a monotone decreasing function of the threshold. Moreover, due to the log-scale it appears that the number of detected SSEs decrease exponentially with the threshold, indicating that it is sufficient to perform the robustness analysis in a relatively narrow parameter range. We selected the interval [7,11] because a threshold of 11 only detects a minimal number of SSEs in many countries, making them ineligible for our analysis, while a threshold of 7 results in a high number of SSEs, potentially leading to an excessive number of false positives.

In Figures 8-12, we recreated Figure 3 for each integer SSE detection threshold in the range [7,11], for all of the major SARS-CoV-2 variants. While there is some variability in the results for different thresholds (mostly due to new countries entering the dataset as the threshold decreases), besides the correlation between the $\mathrm{ECS}_{c}$ and the sampling date in the Delta wave also mentioned in the main text, the most significant correlations remain between the $\mathrm{ECS}_{c}$ and the CHI in the Delta and the Omicron waves. Moreover, for the threshold value 7 the correlation between the $\mathrm{ECS}_{c}$ and the CHI becomes significant even in the Alpha wave, which is most likely explained by the fact that lower thresholds add more countries to the dataset, and increase the statistical power of the results. These additional results further strengthen the conclusion made in the main text, that $\mathrm{ECS}_{c}$ is most correlated with the CHI (the most direct measure of human behavior) among the available exogenous variables, which includes potential confounding factors (sampling rate, attack rate).

B.2 Threshold for the Number of Baseline Events

In Methods 4.2, we defined a parameter $m$ , which sets the minimum number of baseline events that are matched with each the detected SSE in the dataset. We expect that if we chose one baseline event, then the results could look very noisy, therefore we set $m=10$ to ensure at least $2m=20$ baseline events by default. In Figures 13-15 we recreate Figure 3 with $m\in\{5,20,40\}$ to show that the precise value of $m$ is is not important, as long as $m$ is sufficiently high.

Appendix C NormChange and ECS values in each country-variant pair

In Figure 2 (c) and (d) we plotted the histogram of the $\mathrm{Normchange}_{c,i}(t)$ , the $\mathrm{Baseline}_{c,i}(t)$ , the $\mathrm{NormChange}_{c,i}(t+1)$ , the $\mathrm{Baseline}_{c,i}(t+1)$ , and the resulting $\mathrm{ECS}_{c,i}(t)$ values of the detected SSEs for Austria during the Delta wave as an example. For completeness, in the left and the middle columns of Figures 16-18 we include the same plots for all country-variant pairs that are included in Figure 5 (b).

Furthermore, in the right column of Figures 16-18 we include a plot of the $\mathrm{ECS}_{c,i}(t)$ values and the number of reported cases against the time variable $t$ . In this plot we also show the automatically detected “during wave” periods that are used to create in Figure 5 (b).