\patchcmd
Abstract

Large earthquakes can be destructive and quickly wreak havoc on a landscape. To mitigate immediate threats, early warning systems have been developed to alert residents, emergency responders, and critical infrastructure operators seconds to a minute before seismic waves arrive. These warnings provide time to take precautions and prevent damage. The success of these systems relies on fast, accurate predictions of ground motion intensities, which is challenging due to the complex physics of earthquakes, wave propagation, and their intricate spatial and temporal interactions. To improve early warning, we propose a novel AI-enabled framework, WaveCastNet, for forecasting ground motions from large earthquakes. WaveCastNet integrates a novel convolutional Long Expressive Memory (ConvLEM) model into a sequence to sequence (seq2seq) forecasting framework to model long-term dependencies and multi-scale patterns in both space and time. WaveCastNet, which shares weights across spatial and temporal dimensions, requires fewer parameters compared to more resource-intensive models like transformers and thus, in turn, reduces inference times. Importantly, WaveCastNet also generalizes better than transformer-based models to different seismic scenarios, including to more rare and critical situations with higher magnitude earthquakes. Our results using simulated data from the San Francisco Bay Area demonstrate the capability to rapidly predict the intensity and timing of destructive ground motions. Importantly, our proposed approach does not require estimating earthquake magnitudes and epicenters, which are prone to errors using conventional approaches; nor does it require empirical ground motion models, which fail to capture strongly heterogeneous wave propagation effects.

Acknowledgements

NBE would like to acknowledge NSF, under Grant No. 2319621, and the U.S. Department of Energy, under Contract Number DE-AC02-05CH11231 and DE-AC02-05CH11231, for providing partial support of this work. RN would like to acknowledge the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231. AP’s work was performed at the Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This research used the computational cluster resources, including LLNL’s HPC Systems (funded by the LLNL Computing Grand Challenge), LBNL’s Lawrencium, and NERSC (under Contract No. DE-AC02-05CH11231).

plainempty WaveCastNet: An AI-enabled Wavefield Forecasting Framework for Earthquake Early Warning Dongwei Lyu1,2 Rie Nakata2,3,4∗ Pu Ren3 Michael W. Mahoney2,3,5 Arben Pitarka6 Nori Nakata2,3,7 N. Benjamin Erichson2,3Corresponding authors: N. Benjamin Erichson ([email protected]) and Rie Nakata ([email protected]). 1Department of Mathematics, UC Berkeley
2International Computer Science Institute
3Lawrence Berkeley National Laboratory
4Earthquake Research Institute, University of Tokyo
5Department of Statistics, UC Berkeley
6Lawrence Livermore National Laboratory
7Massachusetts Institute of Technology

1  Introduction

Refer to caption
\begin{overpic}[width=433.62pt]{figures/main_paper/seq2seq.pdf} \put(1.0,38.0){\Large$\textbf{c}$} \end{overpic}
Figure 1: Illustration of the problem setup and our proposed WaveCastNet model. In (a), the simulation area of interest within the San Francisco Bay Area, highlighted by the black rectangular box, is shown. Point-source earthquakes are placed along the thick white line, and an M6 earthquake rupture plane is indicated by the blue line. The red lines indicate known faults, the black triangles show the actual sensor locations, and the red triangles highlight the two sensor stations used in the discussion below. In (b), an example snapshot of visco-elastic wave propagation from the point-source earthquake at T=21.79 seconds is shown. In (c), an illustration of using WaveCastNet to forecast the propagation of seismic waves is shown. The framework consists of encoder and decoder components, which in turn consist of stacked recurrent cells. In this work, we advocate a novel recurrent ConvLEM cell, which can model multiscale structures in space and time.

Large earthquakes can rapidly devastate landscapes, toppling buildings and rupturing infrastructure, posing a substantial risk in seismically active regions. These seismic events happen when a fault ruptures. The released seismic energy propagates through the Earth in form of seismic waves, eventually reaching the Earth’s surface. To mitigate the immediate threats posed by large earthquakes, early warning systems have been developed and implemented [4, 33, 34]. These systems aim to detect fast-traveling P-waves by sensors located in proximity to the earthquake epicenter. Once detected, a processing center estimates the earthquake location, magnitude (M), and fault geometry. Then, the system predicts ground motion intensity parameters (e.g., Modified Mercalli Intensity, Peak Ground Acceleration, and Peak Ground Velocities), which provide information regarding potential damages. Subsequently, warnings are issued, typically a few seconds to a minute before the arrival of the more destructive S-waves and surface waves. These warnings serve as an early alert to enable critical infrastructures to initiate necessary precautions, such as stop** trains and shutting down gas pipelines, which allow people to take protective measures.

The performance of these systems relies on the detection and isolation of earthquake signals, as well as on the accuracy of the earthquake parameter estimation and seismic wave propagation modeling [3]. Inaccuracies in the parameter estimation, most commonly in over/under predictions in earthquake magnitudes, lead to false alert or missing warning opportunities [55, 42]. The conventional use of empirical ground motion models precludes high fidelity representation of the complex source and path effects, and the site-specific variability of ground motion intensities [26, 8, 9, 14, 6]. Alternative approaches forecast future ground motion intensity measures or waveforms up to the time when a sensor detects actual earthquake ground motions [31, 21]. These approaches combine physics-based simulation (e.g., radiative transfer theory or finite-difference wavefield simulations) and data assimilation (e.g., optimum interpolation techniques [32]) to remove the dependence on arrival detection and magnitude estimation, while handling the sparsity of the data and incorporating source and path effects. However, typically their prediction accuracy remains insufficient to be deployed in real cases [3], and they require substantial computational resources [21].

Artificial Intelligence (AI) provides a promising alternative approach for modeling ground motion propagation. That is because deep neural networks are well posed to model the nontrivial spatiotemporal properties of ground motions [19, 61, 11, 60, 20, 22, 57]. Moreover, AI methods have the advantage of being computational efficient during inference time, which is of great importance for early warning systems.

Figure 1(a-b) illustrates our problem setup alongside an example snapshot demonstrating visco-elastic wave propagation. Our objective is to predict future wave motions over a time horizon of up to 100 seconds. We approach this as a spatio-temporal sequence prediction task. Specifically, we are given a sequence of J𝐽Jitalic_J elements, 𝒳1,𝒳2,,𝒳Jsubscript𝒳1subscript𝒳2subscript𝒳𝐽\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{J}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, and our goal is to forecast the subsequent K𝐾Kitalic_K elements, 𝒳J+1,𝒳J+2,,𝒳J+Ksubscript𝒳𝐽1subscript𝒳𝐽2subscript𝒳𝐽𝐾\mathcal{X}_{J+1},\mathcal{X}_{J+2},\dots,\mathcal{X}_{J+K}caligraphic_X start_POSTSUBSCRIPT italic_J + 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_J + 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_J + italic_K end_POSTSUBSCRIPT. Each element 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT within the sequence belongs to C×H×Wsuperscript𝐶𝐻𝑊\mathbb{R}^{C\times H\times W}blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, representing a 3-D seismic wavefield. Each wavefield provides spatial information, for a H×Wsuperscript𝐻𝑊\mathbb{R}^{H\times W}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT grid, about the particle velocity of the wave propagation across C𝐶Citalic_C spatial directions (i.e., X, Y, Z directions). One of the main challenges in modeling ground motion data lies in the necessity to handle multi-scale structures that are complex to model. Thus, it is crucial for a forecasting model to effectively capture the joint correlations present across both spatial and temporal dimensions.

To address this challenge, we propose an AI-enabled framework for forecasting ground motions. Specifically, we develop a wavefield forecasting network (WaveCastNet), which is based on the sequence-to-sequence (seq2seq) framework introduced by [54]. Central to WaveCastNet are two components: an encoder and a decoder. The encoder processes a sequence of seismic wavefields with the aim to summarizes the input sequence into a single encoder state. The decoder, in turn, generates a target sequence of seismic wavefields which is conditioned on the encoder state. Figure1(c) illustrates the architecture of our WaveCastNet for predicting seismic waves. Within this architecture, both the encoder and decoder are composed of stacked recurrent units designed to model sequential data. There exist various formulations of modern recurrent cells, including unitary recurrent units [5], gated recurrent units (GRUs)[13], and long-short-term memory (LSTM) units[30]. However, these recurrent cells, with their reliance on fully-connected layers, tend to destroy the intrinsic multi-scale spatial information present in 2-D or 3-D spatial data. The Convolutional Long Short-Term Memory (ConvLSTM) architecture [52] addresses this shortcoming by integrating convolution operations into the LSTM’s update and gating mechanisms, a modification that has proven particularly beneficial for modelling spatio-temporal sequences. ConvLSTM can effectively capture multiscale spatial patterns through convolutional filters, however, this model falls short in modeling temporal multiscale structures. To address this shortcoming, we design a novel convolutional long expressive memory (ConvLEM) model, which extends the LEM model [50] by integrating convolutional layers into the LEM architecture. This ConvLEM model is used as backbone for designing the WaveCastNet’s encoder and decoder.

Our results demonstrate that WaveCastNet improves the predictive accuracy compared to seq2seq frameworks that leverage ConvLSTM, or gated variants. Our WaveCastNet even surpasses the capabilities of newly introduced transformer networks in the context of ground motion forecasting. Importantly, our approach shows robustness and enhanced generalization capabilities, especially in scenarios involving wavefields of greater magnitudes unseen during the training phase. The versatility of our framework is further evidenced by its flexibility, transitioning seamlessly from scenarios with dense, fully captured wavefields to those characterized by sparse, selectively sampled measurements. Expanding on these findings, we show that we can use an ensemble of WaveCastNets to produce uncertainty estimates. This is a critical component for demonstrating and verifying the reliability of our proposed framework. Our work not only showcases WaveCastNet’s improved forecasting accuracy, but it also shows its potential for improving warning times and thus advancing early warning systems.

2  Results

We evaluate our methodology by forecasting particle-velocity waveforms near the Hayward fault, simulating earthquake scenarios in the San Francisco Bay Area (SFBA), northern California, United States, as depicted in Figure 1(a-b). San Francisco, positioned approximately 20 km west of the Hayward fault, ranks among the most densely populated metropolitan regions in the United States. Given the heightened seismic risk associated with the Hayward fault — estimated by the United States Geological Survey (USGS) to exceed 30%percent3030\%30 % — the enhancement of early warning systems is imperative for minimizing infrastructural damage and disruptions, as well as reducing human casualties. Our work focuses on the prediction of ground motion waveforms extending over 120 km along the X direction, parallel to, and 80 km along the Y direction, perpendicular to, the Hayward fault, as outlined by the black rectangle in Figure 1(a-b).

Metrics.

The performance of WaveCastNet is assessed by analyzing the intensity of the ground motions using peak ground velocity (PGV) values, which are defined as:

PGV(𝒳)=maxt𝒳t2[cX]+𝒳t2[cY],𝑃𝐺𝑉𝒳subscript𝑡superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑋superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑌PGV(\mathcal{X})=\max_{t}\sqrt{\mathcal{X}_{t}^{2}[c_{X}]+\mathcal{X}_{t}^{2}[% c_{Y}]},italic_P italic_G italic_V ( caligraphic_X ) = roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ] + caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] end_ARG , (1)

where 𝒳t2[cX]superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑋\mathcal{X}_{t}^{2}[c_{X}]caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ] and 𝒳t2[cY]superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑌\mathcal{X}_{t}^{2}[c_{Y}]caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] represent the velocity data in the X and Y directions, respectively. Additionally, we examine the corresponding arrival time, Tpgvsubscript𝑇𝑝𝑔𝑣T_{pgv}italic_T start_POSTSUBSCRIPT italic_p italic_g italic_v end_POSTSUBSCRIPT, determined by the equation:

Tpgv(𝒳)=argmaxt𝒳t2[cX]+𝒳t2[cY],subscript𝑇𝑝𝑔𝑣𝒳subscriptargmax𝑡superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑋superscriptsubscript𝒳𝑡2delimited-[]subscript𝑐𝑌T_{pgv}(\mathcal{X})=\operatorname*{arg\,max}_{t}\sqrt{\mathcal{X}_{t}^{2}[c_{% X}]+\mathcal{X}_{t}^{2}[c_{Y}]},italic_T start_POSTSUBSCRIPT italic_p italic_g italic_v end_POSTSUBSCRIPT ( caligraphic_X ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ] + caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] end_ARG , (2)

indicating the moment when the horizontal amplitude of the particle velocity reaches its peak.

Furthermore, to evaluate the accuracy of the predicted wavefield 𝒳^^𝒳\hat{\mathcal{X}}over^ start_ARG caligraphic_X end_ARG against the target ground truth 𝒳𝒳\mathcal{X}caligraphic_X, we use the accuracy (ACC) metric, expressed as:

ACC=t,h,w𝒳t^[c,h,w]𝒳t[c,h,w](t,h,w𝒳t2^[c,h,w])(t,h,w𝒳t2[c,h,w]),𝐴𝐶𝐶subscript𝑡𝑤^subscript𝒳𝑡𝑐𝑤subscript𝒳𝑡𝑐𝑤subscript𝑡𝑤^superscriptsubscript𝒳𝑡2𝑐𝑤subscript𝑡𝑤superscriptsubscript𝒳𝑡2𝑐𝑤ACC=\frac{\sum_{t,h,w}\hat{\mathcal{X}_{t}}[c,h,w]\cdot\mathcal{X}_{t}[c,h,w]}% {\sqrt{\left(\sum_{t,h,w}\hat{\mathcal{X}_{t}^{2}}[c,h,w]\right)\cdot\left(% \sum_{t,h,w}\mathcal{X}_{t}^{2}[c,h,w]\right)}},italic_A italic_C italic_C = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_h , italic_w end_POSTSUBSCRIPT over^ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG [ italic_c , italic_h , italic_w ] ⋅ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] end_ARG start_ARG square-root start_ARG ( ∑ start_POSTSUBSCRIPT italic_t , italic_h , italic_w end_POSTSUBSCRIPT over^ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_c , italic_h , italic_w ] ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_t , italic_h , italic_w end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c , italic_h , italic_w ] ) end_ARG end_ARG ,

and the relative Frobenius norm error (RFNE), defined as:

RFNE=t,h,w(𝒳t^[c,h,w]𝒳t[c,h,w])2t,h,w𝒳t2[c,h,w].RFNEsubscript𝑡𝑤superscript^subscript𝒳𝑡𝑐𝑤subscript𝒳𝑡𝑐𝑤2subscript𝑡𝑤superscriptsubscript𝒳𝑡2𝑐𝑤\text{RFNE}=\frac{\sqrt{\sum_{t,h,w}\left(\hat{\mathcal{X}_{t}}[c,h,w]-% \mathcal{X}_{t}[c,h,w]\right)^{2}}}{\sqrt{\sum_{t,h,w}\mathcal{X}_{t}^{2}[c,h,% w]}}.RFNE = divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_h , italic_w end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG [ italic_c , italic_h , italic_w ] - caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_h , italic_w end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_c , italic_h , italic_w ] end_ARG end_ARG .

2.1  Point-source Small Earthquakes

Refer to caption
Figure 2: Point-source earthquake prediction: (a) Point-source earthquake wavefield snapshots for the Y component data at T=17.4, 21.75, 26.1, and 30.45 s from (top) ground truth, (middle) densely and regularly sampled predicted data, and (bottom) sparsely and irregularly sampled predicted data. (b) Map of PGV values, and (c) corresponding prediction errors; (d) map of TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT, and (e) corresponding prediction errors. The errors are calculated by subtracting the ground truth values from the predicted values; thus, positive values indicate over-prediction and negative values indicate under-prediction. Circles in the bottom figures indicate the location of stations.

We first use WaveCastNet to predict ground motions from point-source earthquakes with magnitudes smaller than M4.5. The training dataset is generated using simulated waveforms at frequencies below 0.5 Hz, with a minimum S-wave velocity of 500 m/s. A total of 960 point sources are positioned at 1 km intervals along the white line shown in Figure 1a, with sources placed at depths between 2 and 15 km (note, the white line represents a rectangular plane extending 60 km horizontally and 13 km vertically). These simulations use a fourth-order finite-difference visco-elastic wave model provided by the open-source SW4 package [46, 47]. The subsurface elastic properties are derived from the San Francisco Bay region 3D seismic velocity model v21.1, developed by the USGS [1, 29]. The source wavelet, modeled as a delta function low-pass filtered at 0.5 Hz, assumes that the corner frequencies of small earthquakes exceed 0.5 Hz, maintaining relatively flat frequency spectra within our simulation bandwidth. A uniform double-couple source mechanism is used for all simulations. These simulated data are used as the ground truth throughout our study.

Our goal with WaveCastNet is to generate forecasts for future 100-second intervals based on data observed during the initial 15.6 seconds post-rupture. The input sequence, denoted as 𝒳1,𝒳2,,𝒳Jsubscript𝒳1subscript𝒳2subscript𝒳𝐽\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{J}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, comprises 60 elements, while the target sequence, 𝒳J+1,𝒳J+2,,𝒳J+Ksubscript𝒳𝐽1subscript𝒳𝐽2subscript𝒳𝐽𝐾\mathcal{X}_{J+1},\mathcal{X}_{J+2},\dots,\mathcal{X}_{J+K}caligraphic_X start_POSTSUBSCRIPT italic_J + 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_J + 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_J + italic_K end_POSTSUBSCRIPT, includes 388 elements, with each timestep Δt=0.26Δ𝑡0.26\Delta t=0.26roman_Δ italic_t = 0.26 seconds. The encoder component of WaveCastNet processes the input sequence to derive an encoding state, which subsequently is used by the decoder in generating the target sequence. Instead of forecasting the complete target sequence in one go, we consider iterative predictions of smaller subsequences, each spanning 60 elements. Specifically, we use the predicted subsequences as new inputs to forecast the next 60606060 elements, and so on. We obtain the entire target sequence in under one second.

In the following, we consider two scenarios: (i) input sequences consisting of densely sampled wavefields; and (ii) input sequences consisting of sparsely sampled wavefields.

Densely sampled input data.

Here, we use densely sampled wavefields as inputs, where each element of the input and target sequences is a 3-D tensor, 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with dimensions of 3 (components) x 344 (X direction) x 224 (Y direction).

Figure 2a displays a series of ground truth wavefield snapshots in the top row, while the middle row visualizes the wavefields predicted by WaveCastNet. Our model demonstrates exceptional capability in capturing the patterns of P- and S-wavefronts, as well as the scattered coda waves. Additionally, we assess WaveCastNet’s performance by analyzing the intensity and timing of the ground motions, focusing particularly on the PGV values.

Spatial distributions of PGV and its timing (TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT) are shown in Figure 2b-d. The results show accurate reproduction of large PGVs and their arrival times, notably near the earthquake hypocenter at X=40𝑋40X=40italic_X = 40, Y=38𝑌38Y=38italic_Y = 38 km, within the Livermore basin at X=6080𝑋6080X=60-80italic_X = 60 - 80 km, Y=4060𝑌4060Y=40-60italic_Y = 40 - 60 km, and in the northeast corner of the model at X=2040𝑋2040X=20-40italic_X = 20 - 40 km. The deviations in PGV values are minimal, less than 5% from the ground truth. Errors in TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT are generally negligible, although larger discrepancies are observed where TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT exhibits discontinuities, likely influenced by the underlying geological structures.

These findings show that WaveCastNet can capture complex kinematics and dynamics of wave propagation, and its capability to model phenomena such as amplitude decay — stemming from both geometrical spreading and intrinsic attenuation — and the amplification effects associated with wave reverberation within geological basins.

Refer to caption
Figure 3: Waveforms from (a) San Francisco (NC.J020) and (b) San Jose (NP.1788) for a point-source earthquake. The blue lines indicate the ground truth, the red lines show the mean of the predicted waveforms, and the red shaded areas represent three times the standard deviation from the mean waveforms.

Sparsely sampled input data.

Here, we simulate a scenario more representative of real-world conditions, where seismograph distributions are sparse and irregular, as depicted in Figure 1a. We derive sensor locations from waveforms recorded over the past decade, available through the Northern California Earthquake Data Center database [44].

Input - setting ACC RFNE
Dense and regular sampling 0.98 0.20
Sparse and irregular sampling 0.96 0.27
Table 1: Performance Metrics for the dense and sparse sampling scenarios. Providing the model with more information (i.e., dense inputs) helps to improve performance.

To obtain sparsely sampled data, we determine the row and column indices [h,w]𝑤[h,w][ italic_h , italic_w ] of each sensor on the wavefield snapshot 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After eliminating sensors with overlap** indices, we retain data for 564 sensors, forming an input sequence where each element comprises a 2-D tensor with dimensions 3 (components) x 564 (sparse measurements). The corresponding target sequence consists of densely sampled wavefields, akin to the previous experimental setting. To handle the sparsely sampled input sequences, WaveCastNet incorporates a specialized embedding layer, while all other components of the model architecture remain unchanged.

The bottom row of Figure 2 demonstrates that WaveCastNet effectively predicts wave propagation, PGV, and TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT across the entire area, even when provided solely with sparsely sampled data. Although the errors are larger compared to those from the dense sampling scenario, sparse measurements are sufficient to capture the dynamics of wave propagation. Table 1 quantitatively evaluates WaveCastNet’s performance across these two different scenarios, showcasing its adaptability to varied sampling conditions.

Refer to caption
Figure 4: Uncertainty estimates for the dense point-source ground motion prediction. (a,d) Mean, (b,e) errors, and (c,f), standard deviation of (a-c) lnPGV𝑃𝐺𝑉\ln{PGV}roman_ln italic_P italic_G italic_V and (d-f) TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT. A hole in (c), centered at X=40, Y=38 km, indicates the location where TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT is within our initial time window.

Uncertainty estimation.

Quantifying uncertainty in ground motion forecasting is crucial. To address this, we employ an ensemble approach by training 50 instances of WaveCastNet with different seeds and bootstrapped training data. This approach allows us to calculate the mean and standard deviation of both time-series and their frequency-domain amplitude spectra for stations located in San Francisco and San Jose, as illustrated in Figure 3. Our ensemble successfully captures each waveform in detail, and the predicted amplitude spectra align closely with the ground truth.

Furthermore, WaveCastNet’s performance remains reliable even in scenarios where no seismic waves reach a station within the initial input sequence, such as at station NC.J020 in San Jose (see Figure 3b). The mean values of PGVs and their arrival times (TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT) exhibit excellent agreement with the observed data, as demonstrated in Figures 4a, b, d, and e. The standard deviations for the logarithmic values of PGVs and TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT are consistently less than 1% of their mean values, indicating that WaveCastNet provides reliable predictions. Notably, slightly higher deviations are observed within the Livermore basin, potentially reflecting WaveCastNet’s sensitivity to the complex interactions of multiple wavefronts arriving from different directions.

2.2  Generalization to Finite-fault Large Earthquakes

Earthquake early warning systems are designed to mitigate the hazards posed by large magnitude earthquakes. Unlike small earthquakes, which can be modeled as point sources, large earthquakes necessitate representation as finite-size rupture planes. An earthquake rupture initiates at the epicenter and propagates along the fault, emitting seismic waves from each point. This allows for modeling the effects of a finite-size fault as an aggregation of point sources, each initiating seismic activity at a predetermined time. By using a Green’s function response for a point source along the entire fault, it is possible to compute the ground motion from a large magnitude earthquake by integrating the response of multiple point sources regularly distributed on the fault, following a physics-based kinematic rupture model [28, 25, 24, 48].

Inspired by this concept, we evaluate the capabilities of WaveCastNet to predict finite-fault earthquake waveforms using point-source simulations. For this, we employ kinematic rupture models, suitable for earthquakes ranging from M4.5 to M7, developed in accordance with [24]. These models allow us to generate synthetic waveforms using the same simulation method as that for the point sources. The rupture plane is designed as a vertical rectangle, strategically aligned with the locations of the point sources. The dimensions of the rupture planes are scaled in accordance with the earthquake magnitude to adequately release seismic energy (see Table 2), following the guidelines by [35]. Source parameters such as slip, slip rate, rupture initiation time, and local dip exhibit spatial variability and include stochastic fluctuations at minor scales, allowing the simulation to aggregate the linear responses of numerous point sources with varying parameters. WaveCastNet does not incorporate these parameters even during the inference time. Moreover, as the duration of energy release extends with increasing magnitude [56, 45] and the early waveforms remain similar across a range of magnitudes [43], predicting accurate ground motion waveforms presents a significant challenge.

Refer to caption
Figure 5: Evolution of M6 ground motion prediction at the station NP.1788 located in San Jose by using the time window of (a) 15.68, (b) 18.29, and (c) 20.90 s.
Mw Fault size Trupsubscript𝑇𝑟𝑢𝑝T_{rup}italic_T start_POSTSUBSCRIPT italic_r italic_u italic_p end_POSTSUBSCRIPT ACC RFNE
(km ×\times× km) (s)
4.5 1.8 ×\times× 1.8 3.5 0.95 0.35
5.0 3.4 ×\times× 3 3.7 0.95 0.37
5.5 8 ×\times× 4 6.0 0.95 0.42
6.0 12.5 ×\times× 8 9.6 0.88 0.52
6.5 26 ×\times× 12 13.2 0.66 0.84
7.0 66 ×\times× 15 26.6 0.53 0.86
Table 2: Fault size and performance metrics of finite-fault earthquake data predictions using 15.6-second input time window. Trupsubscript𝑇𝑟𝑢𝑝T_{rup}italic_T start_POSTSUBSCRIPT italic_r italic_u italic_p end_POSTSUBSCRIPT indicates the end time of the rupture. See Figure D.2-D.6 for the rupture models.
Refer to caption
Figure 6: M6 earthquake ground motions for (top) ground truth and (bottom) prediction using 18.2-second input time window. (a) Snapshot waveforms for Y components, (b) PGV, (c) PGV error, (d) TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT and (e) TPGVsubscript𝑇𝑃𝐺𝑉T_{PGV}italic_T start_POSTSUBSCRIPT italic_P italic_G italic_V end_POSTSUBSCRIPT error.

Initially, we normalize the data for finite-fault earthquakes using the same pixel-wise mean and standard deviation tensors derived from the point-source training dataset. Subsequently, we scale the data by the standard deviation calculated from the initial 15.6 seconds of waveform data. WaveCastNet exhibits robust forecasting performance for earthquakes ranging from M4.5 to M5.5. However, as shown in Table 2, performance deteriorates for earthquakes of M6 and above. This degradation in performance correlates with the duration of rupture, which extends up to 13.2 seconds for M6.5 and 26.2 seconds for M7 earthquakes, reaching or exceeding the length of the input time window. Consequently, the input waveforms fail to encompass the full extent of the excited energy. This leads to underestimations of amplitude, although the kinematics are reasonably well reproduced, as illustrated in Figure 5a.

To address this limitation, we extend the length of the input time window, a modification feasible in real-world applications. The extended results for the M6 earthquake are shown in Figures 5b-d and 6. As anticipated, the fidelity of waveform recovery is enhanced notably with the expansion of the time window, particularly evident in the low-frequency components. WaveCastNet forecasts phases of waveforms extremely well, but continues to slightly underestimate amplitudes, especially of early arrivals. Nonetheless, the errors in PGV remain within 1.5 log units, but the timing errors can be large. As shown in Figure 5, multiple wavelets show similar peak values challenging to be differentiated especially when the reverberations occur. These results affirm WaveCastNet’s substantial potential to generalize effectively to finite-fault earthquake simulations.

3  Discussion and Conclusions

Our experiments confirm that WaveCastNet holds considerable promise for accurately forecasting wavefields derived from both point-source and finite-fault simulations of large magnitude earthquakes. WaveCastNet shows, for both dense and (irregular) sparse sensor configurations, that it can reliably predict seismic wave propagation as well as that it can capture PGV values. The excellent fit from the first arrival to later coda waveforms is remarkable. We may interpret that this behavior as WaveCastNet captures the Huygens principle, i.e., each spatial point is represented as a new source point. Notably, WaveCastNet can process an entire 100-second sequence in just 0.56 seconds using a single NVIDIA A100 GPU. Moreover, we anticipate that inference time can be even further improved by optimizing both the model architecture, and inference pipeline.

These findings are particularly significant as they demonstrate the practicality of integrating WaveCastNet into earthquake early warning systems. This integration would significantly advance the systems’ capabilities, facilitating a more rapid response during seismic events.

Generalization. WaveCastNet demonstrates robust generalization up to M5.5, and show that it effectively generalizes up to M6 when employing a sliding window approach. This modification, which accounts for energies released later in the earthquake rupture process, which was also used for data-assimilation-based earthquake early warning systems, is easy to implement and does not need additional training or alterations to the existing framework. Importantly, our AI-based forecasting approach does not require prior knowledge of earthquake magnitudes or epicenters. This suggests that WaveCastNet can be effectively trained on a limited dataset, while generalizing to different seismic scenarios, including higher magnitude earthquakes.

It is important to stress that applying a model trained on point-source earthquakes to a larger magnitude earthquake is challenging. This is because the physical representation of the rupture process changes from a point source to a finite-size fault, which is represented by a complex kinematic model. The amplitude of waveforms varies substantially between M4.5 and M7, with differences exceeding 80 times. Additionally, the spatial amplitude decay rate of ground motion intensities varies with magnitude due to changes in the fault size. Empirically, we observe that this can complicate the data normalization process, and lead to undesirable underprediction of amplitudes. Moreover, our results suggest that merely extending the input time window is insufficient. Thus, expanding the training set to include waveforms from finite-fault simulations is essential for overcoming these challenges.

Model Parameters Latent Space Patch Size Embed Dimension ACC RFNE
Seq2seq using ConvGRU[7] 4.99M (144, 21, 14) - - 0.94 0.34
Seq2seq using ConvLSTM [52] 6.65M (144, 21, 14) - - 0.95 0.32
WaveCastNet (ours) 8.15M (144, 21, 14) - - 0.96 0.27
Swin Transformer [39] 13.72M - (3,4,4) 144 0.95 0.31
Time-S-Former [23] 10.21M - (1,8,8) 192 0.95 0.31
Swin Transformer* 24.27M - (4,4,4) 192 0.97 0.25
Time-S-Former* 33.82M - (1,8,8) 192 0.98 0.20
Table 3: Performance comparison between seq2seq frameworks using different recurrent cells, and state-of-the-art transformers for forecasting small point-source earthquakes. While larger vision transformers can perform better on this task, we show that these models fail to generalize to domain-shifted settings in Figure 7.

Comparative study. To demonstrate the advantages of our proposed approach, we show performance comparisons with baseline models. We evaluate WaveCastNet against seq2seq frameworks which use ConvLSTM [52] and ConvGRU [7] as backbones. Results, presented in Table 3, show better accuracy and lower relative Frobenius norm error for WaveCastNet in predicting point-source earthquakes. These experiments use data that are spatially downsampled by a factor of four to ensure model convergence with less computational resources. Nevertheless, the setup mirrors that discussed in Sec. 2.1, with each 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT within the sequence reduced to C×H4×W4superscript𝐶𝐻4𝑊4\mathbb{R}^{C\times\frac{H}{4}\times\frac{W}{4}}blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT. The findings demonstrate WaveCastNet effectiveness, achieving better accuracy and lower reconstruction errors, while using the same latent space dimensions.

\begin{overpic}[width=195.12767pt]{figures/main_paper/Ablation_generalized.pdf% } \end{overpic}
Figure 7: Generalization performance as a function of the earthquake magnitude. All models are trained on point-source earthquakes only, and it can be seen that WaveCastNet generalizes best to domain-shifted settings.

Additionally, we compare WaveCastNet to state-of-the-art transformer architectures designed for spatio-temporal modeling, including the Swin transformer [38] and the Time-S-Former [23]. Despite the good performance on the task of predicting point-source earthquakes, these transformers struggled with generalization in forecasting higher magnitude earthquakes, as indicated by large relative errors across magnitudes in Figure 7. The comparative study reveals that our WaveCastNet offers beneficial trade-offs: it requires fewer parameters than transformers, facilitates faster inference times, and introduces a regularization effect through its information bottleneck, aiding generalization.

Future Directions. Our experiments used synthetic, noise-free data at frequencies below 0.5 Hz. Moving forward, we plan to apply WaveCastNet to actual earthquake observations — a process we are currently preparing to undertake. The strong generalization capabilities observed suggest that it is sufficient to train WaveCastNet on a large number of real, small-magnitude earthquake recordings. We also expect that the model can be trained on both synthetic and real ground motion data, which may help to reduce uncertainties in visco-elastic earth models and earthquake source parameters.

Future direction include also the exploration of data augmentation schemes to further improve the robustness to domain shifted settings [36, 37, 15], as well as recently proposed state-space models for modelling sequences [27, 59, 58].

4  Method

In this section, we outline the methodology behind WaveCastNet. We begin with an overview of the sequence-to-sequence (seq2seq) framework that serves as the basis for our forecasting model. We then explain the ConvLEM model, which is central to our approach. We then describe the data normalization and preprocessing strategies we employed, as well as the processes involved in generating the data used for our experiments.

4.1  Wavefield Forecasting Network

Our WaveCastNet is based on the sequence-to-sequence (seq2seq) framework, originally developed for natural language processing [54]. Similar to other seq2seq models, WaveCastNet comprises four primary components:

  • Embedding layer. This layer maps input wavefields into a latent space. We employ two types of embedding layers: (i) convolutional layers enhanced with batch normalization and LeakyReLU activation, optimized for embedding densely sampled wavefields into a latent space; and (ii) fully connected layers, followed by convolutional layers, optimized for embedding sparsely sampled wavefields into a latent space.

  • Encoder. The encoder processes the embedded sequence into a fixed-size encoder state that provides a compressed summary of the input sequence necessary for generating the target sequence.

  • Decoder. Operating sequentially, the decoder predicts each element of the target sequence one at a time. It uses the previously predicted output combined with the encoder state to forecast the next element.

  • Reconstruction layer. The reconstruction layer allows us to recover detailed spatial information from the predicted latent sequences by using transposed convolutional layers alongside pixel-shuffle techniques to reconstruct the high-resolution wavefield.

Both the encoder and decoder use our novel ConvLEM cell (see Section 4.2 for details), which is designed to capture complex multi-scale patterns in both spatial and temporal dimensions. Additional technical details of the embedding and reconstruction layers are discussed in the appendix.

The seq2seq framework seeks to find a target sequence 𝒴¬𝒳𝒥𝒳𝒥𝒦𝒴¬subscript𝒳𝒥subscript𝒳𝒥𝒦\mathbfcal{Y}:={\mathcal{X}}_{J+1},\dots,{\mathcal{X}}_{J+K}roman_𝒴 ¬ roman_ℑ roman_𝒳 start_POSTSUBSCRIPT roman_𝒥 ⇓ ∞ end_POSTSUBSCRIPT ⇔ … ⇔ roman_𝒳 start_POSTSUBSCRIPT roman_𝒥 ⇓ roman_𝒦 end_POSTSUBSCRIPT, from a given input sequence 𝒳¬𝒳𝒳𝒥𝒳¬subscript𝒳subscript𝒳𝒥\mathbfcal{X}:={\mathcal{X}}_{1},\dots,{\mathcal{X}}_{J}roman_𝒳 ¬ roman_ℑ roman_𝒳 start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⇔ … ⇔ roman_𝒳 start_POSTSUBSCRIPT roman_𝒥 end_POSTSUBSCRIPT. The objective is to optimize the conditional probability:

𝒴~=argmax𝒴p(𝒴𝒳𝒟decoderencoder𝒳\tilde{\mathbfcal{Y}}=\operatorname*{arg\,max}_{\mathbfcal{Y}}p(\mathbfcal{Y}|% \mathbfcal{X})\approx\mathcal{D}_{\text{decoder}}(\mathcal{E}_{\text{encoder}}% (\mathbfcal{X})).over~ start_ARG roman_𝒴 end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT roman_𝒴 end_POSTSUBSCRIPT italic_p ( roman_𝒴 ♣ roman_𝒳 ⇒ ≈ roman_𝒟 start_POSTSUBSCRIPT decoder end_POSTSUBSCRIPT ⇐ roman_ℰ start_POSTSUBSCRIPT encoder end_POSTSUBSCRIPT ⇐ roman_𝒳 ⇒ ⇒ ↙ (3)

While it is challenging to compute the conditional probability directly, an encoder-decoder framework can be used to generate an approximate target sequence [54]. In this process, an encoder, denoted as encodersubscriptencoder\mathcal{E}_{\text{encoder}}caligraphic_E start_POSTSUBSCRIPT encoder end_POSTSUBSCRIPT, compresses the embedded input sequence 𝒳𝒳\mathbfcal{X}roman_𝒳 into a concise encoder state. This state is subsequently used by a decoder, 𝒟decodersubscript𝒟decoder\mathcal{D}_{\text{decoder}}caligraphic_D start_POSTSUBSCRIPT decoder end_POSTSUBSCRIPT, to generate the predicted latent sequence 𝒴~~𝒴\tilde{\mathbfcal{Y}}over~ start_ARG roman_𝒴 end_ARG, which can then be mapped to desired output space by a reconstruction layer. This approach effectively leverages the encoded information to produce a sequence that approximates the target sequence.

WaveCastNet tailors this seq2seq framework specifically for the task of forecasting ground motions, treating the prediction challenge as a regression problem. We aim to minimize the sum of all the squared differences between the predicted wavefields 𝒳^^𝒳\hat{\mathcal{X}}over^ start_ARG caligraphic_X end_ARG and the actual wavefields 𝒳𝒳\mathcal{X}caligraphic_X:

2=1Tt=1T𝒳^t𝒳tF2,subscript21𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptnormsubscript^𝒳𝑡subscript𝒳𝑡𝐹2\mathcal{L}_{\text{2}}=\frac{1}{T}\sum_{t=1}^{T}\|\hat{\mathcal{X}}_{t}-% \mathcal{X}_{t}\|_{F}^{2},caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm. Under the assumption that prediction errors follow a normal distribution, minimizing the 2subscript2\mathcal{L}_{\text{2}}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss corresponds to maximizing the likelihood of the data given the model. This approach guides the learning of the model parameters through the minimization of the loss across all forecasted and actual sequences. During inference, the model uses these learned parameters to generate target sequences for new input sequences.

To further enhance the model’s performance, we adopt the Huber loss during training, defined as follows:

HubersubscriptHuber\displaystyle\mathcal{L}_{\text{Huber}}caligraphic_L start_POSTSUBSCRIPT Huber end_POSTSUBSCRIPT =t,c,h,wLδ(𝒳^t[c,h,w],𝒳t[c,h,w])TCHW,absentsubscript𝑡𝑐𝑤subscript𝐿𝛿subscript^𝒳𝑡𝑐𝑤subscript𝒳𝑡𝑐𝑤𝑇𝐶𝐻𝑊\displaystyle=\frac{\sum_{t,c,h,w}L_{\delta}\left(\hat{\mathcal{X}}_{t}[c,h,w]% ,\mathcal{X}_{t}[c,h,w]\right)}{TCHW},= divide start_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_c , italic_h , italic_w end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] , caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] ) end_ARG start_ARG italic_T italic_C italic_H italic_W end_ARG , (5)

with the loss function Lδsubscript𝐿𝛿L_{\delta}italic_L start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT given by:

Lδ(x^,x)subscript𝐿𝛿^𝑥𝑥\displaystyle L_{\delta}\left(\hat{x},x\right)italic_L start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , italic_x ) ={12(x^x)2for |x^x|δ,δ(|x^x|12δ)otherwise.absentcases12superscript^𝑥𝑥2for ^𝑥𝑥𝛿𝛿^𝑥𝑥12𝛿otherwise.\displaystyle=\begin{cases}\frac{1}{2}\left(\hat{x}-x\right)^{2}&\text{for }|% \hat{x}-{x}|\leq\delta,\\ \delta\cdot\left(|\hat{x}-x|-\frac{1}{2}\delta\right)&\text{otherwise.}\end{cases}= { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG italic_x end_ARG - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL for | over^ start_ARG italic_x end_ARG - italic_x | ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL italic_δ ⋅ ( | over^ start_ARG italic_x end_ARG - italic_x | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_δ ) end_CELL start_CELL otherwise. end_CELL end_ROW (6)

The Huber loss effectively balances the L1𝐿1L1italic_L 1 and L2𝐿2L2italic_L 2 norms, which supports more robust fitting across various earthquake conditions and depths during training. Specifically, we find that the Huber loss improves WaveCastNet’s capability to better capture the challenging PGV patterns. Moreover, we observe that using this loss enables our model to better generalize across different earthquake magnitudes and conditions, while also ensuring faster convergence during training.

4.2  Convolutional Long Expressive Memory

We propose Convolutional Long Expressive Memory (ConvLEM) to overcome the limitations of traditional recurrent units in modeling complex multi-scale structures across spatial and temporal dimensions. These limitations are highlighted when recurrent units are viewed as dynamical systems [12, 16], where the evolution over time is governed by a system of input-dependent ordinary differential equations:

d𝐡dt=τfθ(𝐡(t),𝐱(t)),𝑑𝐡𝑑𝑡𝜏subscript𝑓𝜃𝐡𝑡𝐱𝑡\frac{d{\bf h}}{dt}=\tau\cdot f_{\theta}({\bf h}(t),{\bf x}(t)),divide start_ARG italic_d bold_h end_ARG start_ARG italic_d italic_t end_ARG = italic_τ ⋅ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_h ( italic_t ) , bold_x ( italic_t ) ) , (7)

where inputs 𝐱(t)d𝐱𝑡superscript𝑑{\bf x}(t)\in\mathbb{R}^{d}bold_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and hidden states 𝐡(t)l𝐡𝑡superscript𝑙{\bf h}(t)\in\mathbb{R}^{l}bold_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are modeled as continuous functions over time t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ]. However, this model is limited to modeling dynamics at a fixed temporal scale τ𝜏\tauitalic_τ. An intuitive approach to address this issue involves integrating a high-dimensional gating function to replace τ𝜏\tauitalic_τ, aiming to model dynamics occurring across various time scales [13, 17]. Nevertheless, employing a single gating mechanism often falls short of adequately capturing the complexities found in more challenging dynamical systems.

In this work, we enhance the modeling of multi-scale temporal structures by extending the recently introduced Long Expressive Memory (LEM) unit [50]. This approach is based on the following coupled differential equations:

d𝐜(t)dt𝑑𝐜𝑡𝑑𝑡\displaystyle\frac{d{\bf c}(t)}{dt}divide start_ARG italic_d bold_c ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =𝐠c[fθcc(𝐡(t),𝐱(t))𝐜(t)],absentdirect-productsubscript𝐠𝑐delimited-[]subscriptsuperscript𝑓𝑐subscript𝜃𝑐𝐡𝑡𝐱𝑡𝐜𝑡\displaystyle={\bf g}_{c}\odot\left[f^{c}_{\theta_{c}}({\bf h}(t),{\bf x}(t))-% {\bf c}(t)\right],= bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊙ [ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_h ( italic_t ) , bold_x ( italic_t ) ) - bold_c ( italic_t ) ] , (8)
d𝐡(t)dt𝑑𝐡𝑡𝑑𝑡\displaystyle\frac{d{\bf h}(t)}{dt}divide start_ARG italic_d bold_h ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =𝐠h[fθhh(𝐜(t),𝐱(t))𝐡(t)],absentdirect-productsubscript𝐠delimited-[]subscriptsuperscript𝑓subscript𝜃𝐜𝑡𝐱𝑡𝐡𝑡\displaystyle={\bf g}_{h}\odot\left[f^{h}_{\theta_{h}}({\bf c}(t),{\bf x}(t))-% {\bf h}(t)\right],= bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊙ [ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_c ( italic_t ) , bold_x ( italic_t ) ) - bold_h ( italic_t ) ] ,

where 𝐡(t)l𝐡𝑡superscript𝑙{\bf h}(t)\in\mathbb{R}^{l}bold_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝐜(t)l𝐜𝑡superscript𝑙{\bf c}(t)\in\mathbb{R}^{l}bold_c ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT represent the slow and fast evolving hidden states, respectively. The gating functions 𝐠csubscript𝐠𝑐{\bf g}_{c}bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐠hsubscript𝐠{\bf g}_{h}bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, which are dependent on both the input and the states, introduce variability in temporal scales into the dynamics of the model. Here, direct-product\odot signifies the Hadamard product, ensuring element-wise multiplication.

We advance the basic LEM unit by incorporating convolutional operations that facilitate modeling of both input-to-state and state-to-state transitions, akin to those used in the ConvLSTM model [52]. By representing the hidden states and inputs as tensors, we are better able to preserve and model critical multi-scale spatial patterns. The ConvLEM is thus formulated as:

d𝒞(t)dt𝑑𝒞𝑡𝑑𝑡\displaystyle\frac{d\mathcal{C}(t)}{dt}divide start_ARG italic_d caligraphic_C ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =𝐠c[fθcc((t),𝒳(t))𝒞(t)],absentdirect-productsubscript𝐠𝑐delimited-[]subscriptsuperscript𝑓𝑐subscript𝜃𝑐𝑡𝒳𝑡𝒞𝑡\displaystyle={\bf g}_{c}\odot\left[f^{c}_{\theta_{c}}(\mathcal{H}(t),\mathcal% {X}(t))-\mathcal{C}(t)\right],= bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊙ [ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) - caligraphic_C ( italic_t ) ] , (9)
d(t)dt𝑑𝑡𝑑𝑡\displaystyle\frac{d\mathcal{H}(t)}{dt}divide start_ARG italic_d caligraphic_H ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =𝐠h[fθhh(𝒞(t),𝒳(t))(t)],absentdirect-productsubscript𝐠delimited-[]subscriptsuperscript𝑓subscript𝜃𝒞𝑡𝒳𝑡𝑡\displaystyle={\bf g}_{h}\odot\left[f^{h}_{\theta_{h}}(\mathcal{C}(t),\mathcal% {X}(t))-\mathcal{H}(t)\right],= bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊙ [ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C ( italic_t ) , caligraphic_X ( italic_t ) ) - caligraphic_H ( italic_t ) ] , (10)

In this equation, (t)r×p×q𝑡superscript𝑟𝑝𝑞\mathcal{H}(t)\in\mathbb{R}^{r\times p\times q}caligraphic_H ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_p × italic_q end_POSTSUPERSCRIPT and 𝒞(t)r×p×q𝒞𝑡superscript𝑟𝑝𝑞\mathcal{C}(t)\in\mathbb{R}^{r\times p\times q}caligraphic_C ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_p × italic_q end_POSTSUPERSCRIPT denote the slow and fast evolving hidden states, respectively. The input 𝒳(t)c×h×w𝒳𝑡superscript𝑐𝑤\mathcal{X}(t)\in\mathbb{R}^{c\times h\times w}caligraphic_X ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT is a three-dimensional tensor.

To effectively train this model, using an appropriate discretization scheme is essential, as it enables the learning of model weights through backpropagation over time. Following the methodology presented in [50], we consider a positive timestep ΔtΔ𝑡\Delta troman_Δ italic_t and use the Implicit-Explicit (IMEX) time-step** scheme. This approach aids in formulating the discretized version of the ConvLEM unit as follows:

𝚫𝐭n𝚫subscript𝐭𝑛\displaystyle\bm{\Delta}\mathbf{t}_{n}bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =Δt𝐠cabsentΔ𝑡subscript𝐠𝑐\displaystyle=\Delta t\,{\bf g}_{c}= roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (11)
𝚫𝐭n¯¯𝚫subscript𝐭𝑛\displaystyle\overline{\bm{\Delta}\mathbf{t}_{n}}over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG =Δt𝐠habsentΔ𝑡subscript𝐠\displaystyle=\Delta t\,{\bf g}_{h}= roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (12)
𝒞nsubscript𝒞𝑛\displaystyle\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝟙𝚫𝐭n)𝒞n1+𝚫𝐭nfθccabsentdirect-product1𝚫subscript𝐭𝑛subscript𝒞𝑛1direct-product𝚫subscript𝐭𝑛subscriptsuperscript𝑓𝑐subscript𝜃𝑐\displaystyle=\left(\mathbbm{1}-\bm{\Delta}\mathbf{t}_{n}\right)\odot\mathcal{% C}_{n-1}+\bm{\Delta}\mathbf{t}_{n}\odot f^{c}_{\theta_{c}}= ( blackboard_1 - bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT (13)
nsubscript𝑛\displaystyle\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝟙𝚫𝐭n¯)n1+𝚫𝐭n¯fθhhabsentdirect-product1¯𝚫subscript𝐭𝑛subscript𝑛1direct-product¯𝚫subscript𝐭𝑛subscriptsuperscript𝑓subscript𝜃\displaystyle=\left(\mathbbm{1}-\overline{\bm{\Delta}\mathbf{t}_{n}}\right)% \odot\mathcal{H}_{n-1}+\overline{\bm{\Delta}\mathbf{t}_{n}}\odot f^{h}_{\theta% _{h}}= ( blackboard_1 - over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ⊙ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ⊙ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT (14)

with update functions

fθccsubscriptsuperscript𝑓𝑐subscript𝜃𝑐\displaystyle f^{c}_{\theta_{c}}italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT =tanh(𝐖hcn1+𝐖xc𝒳n),absent𝑡𝑎𝑛subscript𝐖𝑐subscript𝑛1subscript𝐖𝑥𝑐subscript𝒳𝑛\displaystyle=tanh\left(\mathbf{W}_{hc}*\mathcal{H}_{n-1}+\mathbf{W}_{xc}*% \mathcal{X}_{n}\right),= italic_t italic_a italic_n italic_h ( bold_W start_POSTSUBSCRIPT italic_h italic_c end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_x italic_c end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (15)
fθhhsubscriptsuperscript𝑓subscript𝜃\displaystyle f^{h}_{\theta_{h}}italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT =tanh(𝐖ch𝒞n+𝐖xh𝒳n),absent𝑡𝑎𝑛subscript𝐖𝑐subscript𝒞𝑛subscript𝐖𝑥subscript𝒳𝑛\displaystyle=tanh\left(\mathbf{W}_{ch}*\mathcal{C}_{n}+\mathbf{W}_{xh}*% \mathcal{X}_{n}\right),= italic_t italic_a italic_n italic_h ( bold_W start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ∗ caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_x italic_h end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (16)

and gating functions

𝐠hsubscript𝐠\displaystyle{\bf g}_{h}bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =σ(𝐖xt¯𝒳n+𝐖ht¯n1),absent𝜎subscript𝐖𝑥¯𝑡subscript𝒳𝑛subscript𝐖¯𝑡subscript𝑛1\displaystyle=\sigma\left(\mathbf{W}_{x\overline{t}}*\mathcal{X}_{n}+\mathbf{W% }_{h\overline{t}}*\mathcal{H}_{n-1}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) , (17)
𝐠csubscript𝐠𝑐\displaystyle{\bf g}_{c}bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT =σ(𝐖xt𝒳n+𝐖htn1).absent𝜎subscript𝐖𝑥𝑡subscript𝒳𝑛subscript𝐖𝑡subscript𝑛1\displaystyle=\sigma\left(\mathbf{W}_{xt}*\mathcal{X}_{n}+\mathbf{W}_{ht}*% \mathcal{H}_{n-1}\right).= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_t end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_t end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) . (18)

In this notation, 𝐖,subscript𝐖{\bf W}_{\cdot,\cdot}bold_W start_POSTSUBSCRIPT ⋅ , ⋅ end_POSTSUBSCRIPT denotes the weight tensors, direct-product\odot represents the Hadamard product, and * indicates the convolutional operator, with subscript n𝑛nitalic_n marking a discrete time step ranging from 1111 to N𝑁Nitalic_N. The matrix of ones, denoted as 𝟙1\mathbbm{1}blackboard_1, matches the shape of the hidden states. The sigmoid function σ𝜎\sigmaitalic_σ, used in the gating functions, maps activations to a range between 0 and 1. Note, for brevity, bias vectors are omitted from the update and gating function.

Based on the model structures outlined above, we further introduce a reset gate 𝐠resetsubscript𝐠𝑟𝑒𝑠𝑒𝑡{\bf g}_{reset}bold_g start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT to refine the modeling of the correlation between fast and slow hidden states:

𝐠reset=σ(𝐖xr𝒳n+𝐖hrn1).subscript𝐠𝑟𝑒𝑠𝑒𝑡𝜎subscript𝐖𝑥𝑟subscript𝒳𝑛subscript𝐖𝑟subscript𝑛1{\bf g}_{reset}=\sigma\left(\mathbf{W}_{xr}*\mathcal{X}_{n}+\mathbf{W}_{hr}*% \mathcal{H}_{n-1}\right).bold_g start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT = italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_r end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) . (19)

The reset gate is integrated into the update function for the slow hidden states as follows:

fθhh=tanh(𝐠reset(𝐖ch𝒞n)+𝐖xh𝒳n).subscriptsuperscript𝑓subscript𝜃𝑡𝑎𝑛direct-productsubscript𝐠𝑟𝑒𝑠𝑒𝑡subscript𝐖𝑐subscript𝒞𝑛subscript𝐖𝑥subscript𝒳𝑛f^{h}_{\theta_{h}}=tanh\left({\bf g}_{reset}\odot\left(\mathbf{W}_{ch}*% \mathcal{C}_{n}\right)+\mathbf{W}_{xh}*\mathcal{X}_{n}\right).italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_t italic_a italic_n italic_h ( bold_g start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT ⊙ ( bold_W start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ∗ caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + bold_W start_POSTSUBSCRIPT italic_x italic_h end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (20)

Intuitively, this additional gate helps to improve the flow of relevant information from the updated fast hidden states into updating the slow hidden states.

Enhancing the gating functions proves beneficial for modeling complex spatio-temporal problems in practice. Leveraging the concept of “peephole connections” [53], we further enhance the gates by injecting information about the fast hidden states. We define these gates as follows:

𝐠h=σ(𝐖xt¯𝒳n+𝐖ht¯n1+𝐖ct¯𝒞n1),subscript𝐠𝜎subscript𝐖𝑥¯𝑡subscript𝒳𝑛subscript𝐖¯𝑡subscript𝑛1direct-productsubscript𝐖𝑐¯𝑡subscript𝒞𝑛1\displaystyle{\bf g}_{h}=\sigma\left(\mathbf{W}_{x\overline{t}}*\mathcal{X}_{n% }+\mathbf{W}_{h\overline{t}}*\mathcal{H}_{n-1}+\mathbf{W}_{c\overline{t}}\odot% \mathcal{C}_{n-1}\right),bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_σ ( bold_W start_POSTSUBSCRIPT italic_x over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_c over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝐠c=σ(𝐖xt𝒳n+𝐖htn1+𝐖ct𝒞n),subscript𝐠𝑐𝜎subscript𝐖𝑥𝑡subscript𝒳𝑛subscript𝐖𝑡subscript𝑛1direct-productsubscript𝐖𝑐𝑡subscript𝒞𝑛\displaystyle{\bf g}_{c}=\sigma\left(\mathbf{W}_{xt}*\mathcal{X}_{n}+\mathbf{W% }_{ht}*\mathcal{H}_{n-1}+\mathbf{W}_{ct}\odot\mathcal{C}_{n}\right),bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_t end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_t end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,
𝐠reset=σ(𝐖xr𝒳n+𝐖hrn1+𝐖cr𝒞n).subscript𝐠𝑟𝑒𝑠𝑒𝑡𝜎subscript𝐖𝑥𝑟subscript𝒳𝑛subscript𝐖𝑟subscript𝑛1direct-productsubscript𝐖𝑐𝑟subscript𝒞𝑛\displaystyle{\bf g}_{reset}=\sigma\left(\mathbf{W}_{xr}*\mathcal{X}_{n}+% \mathbf{W}_{hr}*\mathcal{H}_{n-1}+\mathbf{W}_{cr}\odot\mathcal{C}_{n}\right).bold_g start_POSTSUBSCRIPT italic_r italic_e italic_s italic_e italic_t end_POSTSUBSCRIPT = italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_r end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

These modified gates show an improved ability to process longer sequences more accurately. Intuitively, by incorporating additional contextual information, these gates are better suited to model complex multi-scale dynamics, which in turn improves the model’s expressiveness.

Figure 8 illustrates the discretized ConvLEM unit.

\begin{overpic}[width=205.96883pt,tics=5,trim=70.2625pt 0.0pt 82.3075pt 0.0pt,% clip]{figures/main_paper/lem.pdf} \end{overpic}
Figure 8: Schematic of the ConvLEM cell. Here, σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent 𝐠csubscript𝐠𝑐{\bf g}_{c}bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐠hsubscript𝐠{\bf g}_{h}bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT respectively. Update function f𝑓fitalic_f is set to tanh. Red links indicate peephole connections.

4.3  Normalization

Seismic waves exhibit varying residence times as they travel through different geographic locations, leading to significantly greater ground motion variance in certain regions. Therefore, normalizing is crucial in order to obtain a good forecasting performance. In this work, we use a particle velocity-wise normalization scheme for each snapshot.

Consider all Q𝑄Qitalic_Q sequences in the training set, whereas each sequence is composed of T𝑇Titalic_T snapshots {𝒳tq}superscriptsubscript𝒳𝑡𝑞\{{\mathcal{X}_{t}}^{q}\}{ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT }. For each particle velocity 𝒳[c,h,w]𝒳𝑐𝑤\mathcal{X}[c,h,w]caligraphic_X [ italic_c , italic_h , italic_w ], we compute the mean and standard deviation values across all snapshots in the training set:

{𝒳tq[c,h,w]|q=0,1,2,Q1;t=0,1,2,T1}.conditional-setsuperscriptsubscript𝒳𝑡𝑞𝑐𝑤formulae-sequence𝑞012𝑄1𝑡012𝑇1\{{\mathcal{X}_{t}}^{q}[c,h,w]|q=0,1,2,\dots Q-1;t=0,1,2,\dots T-1\}.{ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT [ italic_c , italic_h , italic_w ] | italic_q = 0 , 1 , 2 , … italic_Q - 1 ; italic_t = 0 , 1 , 2 , … italic_T - 1 } .

The resulting mean and standard deviation tensors have the same shape as the snapshot 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, denoted as 𝒳meansubscript𝒳mean\mathcal{X}_{\text{mean}}caligraphic_X start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT, 𝒳stdsubscript𝒳std\mathcal{X}_{\text{std}}caligraphic_X start_POSTSUBSCRIPT std end_POSTSUBSCRIPT, respectively. During the data preprocessing stage, for each snapshot 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we apply particle velocity-wise normalization as follows:

𝒳t¯=𝒳t[c,h,w]𝒳mean[c,h,w]𝒳std[c,h,w].¯subscript𝒳𝑡subscript𝒳𝑡𝑐𝑤subscript𝒳mean𝑐𝑤subscript𝒳std𝑐𝑤\bar{\mathcal{X}_{t}}=\frac{\mathcal{X}_{t}[c,h,w]-\mathcal{X}_{\text{mean}}[c% ,h,w]}{\mathcal{X}_{\text{std}}[c,h,w]}.over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] - caligraphic_X start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT std end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] end_ARG .

Particle velocity-wise normalization also prevents potential spatial information leakage during the normalization process for our sparse sampling scenario.

Normalization for domain-shifted settings.

The ground motion of earthquakes with higher magnitudes (e.g., M4.5-M7), once normalized, exhibits a considerably wider range compared to the normalized M4 data. Thus, we need to normalize the ground motions again to obtain a reasonable range using the information present in the input window. Given the input window from time step t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we conduct a channel-wise normalization for each input snapshot 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the standard deviation values computed for the following set:

{𝒳¯t[c,h,w]\displaystyle\{\bar{\mathcal{X}}_{t}[c,h,w]{ over¯ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] |t=t1,t1+1,,t2;h=0,1,2,H1;\displaystyle|t=t_{1},t_{1}+1,\dots,t_{2};h=0,1,2,\dots H-1;| italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , … , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_h = 0 , 1 , 2 , … italic_H - 1 ;
w=0,1,2,W1}.\displaystyle w=0,1,2,\dots W-1\}.italic_w = 0 , 1 , 2 , … italic_W - 1 } .

The reasons for not using the particle velocity-wise normalization here are twofold. Firstly, the initial particle velocity-wise normalization has already introduced varying standard deviations for different spatial locations. Secondly, since ground motion in the early warning area is observed to be zero within the input window, the particle velocity-wise standard deviation tensor would consist mostly of zeros, making the normalization process infeasible.

4.4  Data Generation

We simulate point-source and finite-fault earthquake ground motions up to 0.5 Hz within a three-dimensional (3D) volume extending 120 km in the fault parallel (FP) direction (X direction), 80 km in the fault normal (FN) direction (Y direction), and 30 km in depth. These simulations are conducted using the USGS San Francisco Bay region 3D seismic velocity model (SFVM) v21.1 [29]. Material properties, including the Vp-Vs relationships, are defined for each geological unit based on laboratory and well-log measurements, which include parameters such as P- and S-wave velocities [29, 10, 2]. Simulations are initiated with a minimum S-wave velocity of 500 m/s. We generate visco-elastic wave fields using the open-source SW4 package, which computes the 4th order finite-difference solution of the visco-elastic wave equations [47]. This software package is well-established, with its accuracy validated through numerous ground motion simulations [40, 41, 49].

The surface of the Earth is modeled with a free surface condition, while the outer boundaries use absorbing boundary conditions through a super grid approach spanning 30 grids. We consider a flat surface, and to avoid numerical dispersion, we consider a simulation grid with a mesh size of 150 m3 at the surface, designed to ensure a minimum of six grids per wavelength. To optimize computational resources, the mesh size is doubled at depths of 2.2 km and 6.6 km. The largest grid size employed is 600 m3, covering a total of approximately 9.59 million grid points. The attenuation and velocity dispersion are modeled using three standard linear solid models, assuming a constant Q over the simulated frequency range. Each simulation runs for 120 seconds with a time step of 0.0260134 seconds, resulting in 4,613 time steps. The three component particle velocity motions are recorded every 10 steps (i.e., 0.26014 sec) at 150 m x 150 m grids and then are downsampled to 300 m ×\times× 300 m grids for the training and testing WaveCastNet. These simulations are carried out on 12 nodes equipped with INTEL XEON Gold 5218/6230 CPUs within the Lawrencium cluster at Lawrence Berkeley National Laboratory.

References

  • [1] B. T. Aagaard, T. M. Brocher, D. Dolenc, D. Dreger, R. W. Graves, S. Harmsen, S. Hartzell, S. Larsen, and M. L. Zoback. Ground-Motion Modeling of the 1906 San Francisco Earthquake, Part I: Validation Using the 1989 Loma Prieta Earthquake. Bulletin of the Seismological Society of America, 98(2):989–1011, 2008.
  • [2] B. T. Aagaard, R. W. Graves, A. Rodgers, T. M. Brocher, R. W. Simpson, D. Dreger, N. A. Petersson, S. C. Larsen, S. Ma, and R. C. Jachens. Ground-Motion Modeling of Hayward Fault Scenario Earthquakes, Part II: Simulation of Long-Period and Broadband Ground Motions. Bulletin of the Seismological Society of America, 100(6):2945–2977, 2010.
  • [3] R. M. Allen and D. Melgar. Earthquake Early Warning: Advances, Scientific Challenges, and Societal Needs. Annual Review of Earth and Planetary Sciences, 47(1):1–28, 2019.
  • [4] R. M. Allen and M. Stogaitis. Global Growth of Earthquake Early Warning. Science, 375(6582):717–718, 2022.
  • [5] M. Arjovsky, A. Shah, and Y. Bengio. Unitary Evolution Recurrent Neural Networks. In International Conference on Machine Learning, pages 1120–1128, 2016.
  • [6] G. M. Atkinson and D. M. Boore. Modifications to Existing Ground-Motion Prediction Equations in Light of New DataModifications to Existing Ground-Motion Prediction Equations in Light of New Data. Bulletin of the Seismological Society of America, 101(3):1121–1135, 2011.
  • [7] N. Ballas, L. Yao, C. Pal, and A. Courville. Delving Deeper Into Convolutional Networks for Learning Video Representations. International Conference on Learning Representations, 2016.
  • [8] J. Bayless and N. A. Abrahamson. Summary of the BA18 Ground‐Motion Model for Fourier Amplitude Spectra for Crustal Earthquakes in CaliforniaSummary of the BA18 Ground‐Motion Model for Fourier Amplitude Spectra for Crustal Earthquakes. Bulletin of the Seismological Society of America, 109(5):2088–2105, 2019.
  • [9] Y. Bozorgnia, N. A. Abrahamson, L. A. Atik, T. D. Ancheta, G. M. Atkinson, J. W. Baker, A. Baltay, D. M. Boore, K. W. Campbell, B. S.-J. Chiou, R. Darragh, S. Day, J. Donahue, R. W. Graves, N. Gregor, T. Hanks, I. Idriss, R. Kamai, T. Kishida, A. Kottke, S. A. Mahin, S. Rezaeian, B. Rowshandel, E. Seyhan, S. Shahi, T. Shantz, W. Silva, P. Spudich, J. P. Stewart, J. Watson-Lamprey, K. Wooddell, and R. Youngs. NGA-West2 Research Project. Earthquake Spectra, 30(3):973–987, 2014.
  • [10] T. M. Brocher. Compressional and Shear-Wave Velocity versus Depth Relations for Common Rock Types in Northern CaliforniaCompressional and Shear-Wave Velocity versus Depth Relations for Common Rock Types in Northern CA. Bulletin of the Seismological Society of America, 98(2):950–968, 2008.
  • [11] C. Chai, M. Maceira, H. J. Santos‐Villalobos, S. V. Venkatakrishnan, M. Schoenball, W. Zhu, G. C. Beroza, C. Thurber, and E. C. Team. Using a Deep Neural Network and Transfer Learning to Bridge Scales for Seismic Phase Picking. Geophysical Research Letters, 47(16), 2020.
  • [12] B. Chang, M. Chen, E. Haber, and E. H. Chi. AntisymmetricRNN: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations, 2018.
  • [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Empirical Methods in Natural Language Processing, pages 1724–1734, 2014.
  • [14] J. Douglas, S. Akkar, G. Ameri, P.-Y. Bard, D. Bindi, J. J. Bommer, S. S. Bora, F. Cotton, B. Derras, M. Hermkes, N. M. Kuehn, L. Luzi, M. Massa, F. Pacor, C. Riggelsen, M. A. Sandıkkaya, F. Scherbaum, P. J. Stafford, and P. Traversa. Comparisons Among the Five Ground-Motion Models Developed Using RESORCE for the Prediction of Response Spectral Accelerations due to Earthquakes in Europe and the Middle East. Bulletin of Earthquake Engineering, 12(1):341–358, 2014.
  • [15] B. Erichson, S. H. Lim, W. Xu, F. Utrera, Z. Cao, and M. Mahoney. NoisyMix: boosting model robustness to common corruptions. In International Conference on Artificial Intelligence and Statistics, pages 4033–4041. PMLR, 2024.
  • [16] N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021.
  • [17] N. B. Erichson, S. H. Lim, and M. W. Mahoney. Gated recurrent neural networks with weighted time-delay feedback. arXiv preprint arXiv:2212.00228, 2022.
  • [18] N. B. Erichson, L. Mathelin, Z. Yao, S. L. Brunton, M. W. Mahoney, and J. N. Kutz. Shallow Neural Networks for Fluid Flow Reconstruction With Limited Sensors. Proceedings of the Royal Society A, 476(2238):20200097, 2020.
  • [19] R. D. D. Esfahani, F. Cotton, M. Ohrnberger, and F. Scherbaum. TFCGAN: Nonstationary Ground-Motion Simulation in the Time–Frequency Domain Using Conditional Generative Adversarial Network (CGAN) and Phase Retrieval Methods. Bulletin of the Seismological Society of America, 113(1):453–467, 2022.
  • [20] M. A. Florez, M. Caporale, P. Buabthong, Z. E. Ross, D. Asimaki, and M.-A. Meier. Data-Driven Synthesis of Broadband Earthquake Ground Motions Using Artificial Intelligence. Bulletin of the Seismological Society of America, 112(4):1979–1996, 2022.
  • [21] T. Furumura, T. Maeda, and A. Oba. Early Forecast of Long‐Period Ground Motions via Data Assimilation of Observed Ground Motions and Wave Propagation Simulations. Geophysical Research Letters, 46(1):138–147, 2019.
  • [22] T. Furumura and Y. Oishi. An Early Forecast of Long‐Period Ground Motions of Large Earthquakes Based on Deep Learning. Geophysical Research Letters, 50(6), 2023.
  • [23] Gedas Bertasius and Heng Wang and Lorenzo Torresani. Is Space-Time Attention All You Need for Video Understanding? In International Conference on Machine Learning, 2021.
  • [24] R. Graves and A. Pitarka. Kinematic Ground‐Motion Simulations on Rough Faults Including Effects of 3D Stochastic Velocity Perturbations. Bulletin of the Seismological Society of America, 106(5):2136–2153, 2016.
  • [25] R. W. Graves and A. Pitarka. Broadband Ground-Motion Simulation Using a Hybrid Approach. Bulletin of the Seismological Society of America, 100(5A):2095–2123, 2010.
  • [26] N. Gregor, N. A. Abrahamson, G. M. Atkinson, D. M. Boore, Y. Bozorgnia, K. W. Campbell, B. S.-J. Chiou, I. Idriss, R. Kamai, E. Seyhan, W. Silva, J. P. Stewart, and R. Youngs. Comparison of NGA-West2 GMPEs. Earthquake Spectra, 30(3):1179–1197, 2014.
  • [27] A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • [28] S. Hartzell, S. Harmsen, A. Frankel, and S. Larsen. Calculation of Broadband Time Histories of Ground Motion: Comparison of Methods and Validation Using Strong-Ground Motion From the 1994 Northridge Earthquake. Bulletin of the Seismological Society of America, 89(6):1484–1504, 1999.
  • [29] E. Hirakawa and B. Aagaard. Evaluation and Updates for the USGS San Francisco Bay Region 3D Seismic Velocity Model in the East and North Bay Portions. Bulletin of the Seismological Society of America, 2022.
  • [30] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
  • [31] M. Hoshiba and S. Aoki. Numerical Shake Prediction for Earthquake Early Warning: Data Assimilation, Real‐Time Shake Map**, and Simulation of Wave PropagationNumerical Shake Prediction for Earthquake Early Warning. Bulletin of the Seismological Society of America, 105(3):1324–1338, 2015.
  • [32] E. Kalnay. Atmospheric Modeling, Data Assimilation, and Predictability. Cambridge University Press, 2003.
  • [33] M. D. Kohler, E. S. Cochran, D. Given, S. Guiwits, D. Neuhauser, I. Henson, R. Hartog, P. Bodin, V. Kress, S. Thompson, et al. Earthquake Early Warning ShakeAlert System: West Coast Wide Production Prototype. Seismological Research Letters, 89(1):99–107, 2018.
  • [34] M. D. Kohler, D. E. Smith, J. Andrews, A. I. Chung, R. Hartog, I. Henson, D. D. Given, R. de Groot, and S. Guiwits. Earthquake Early Warning ShakeAlert 2.0: Public Rollout. Seismological Research Letters, 91(3):1763–1775, 2020.
  • [35] M. Leonard. Earthquake Fault Scaling: Self-Consistent Relating of Rupture Length, Width, Average Displacement, and Moment Release. Bulletin of the Seismological Society of America, 100(5A):1971–1988, 2010.
  • [36] S. H. Lim, N. B. Erichson, L. Hodgkinson, and M. W. Mahoney. Noisy recurrent neural networks. Advances in Neural Information Processing Systems, 34:5124–5137, 2021.
  • [37] S. H. Lim, N. B. Erichson, F. Utrera, W. Xu, and M. W. Mahoney. Noisy feature mixup. In International Conference on Learning Representations, 2021.
  • [38] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, pages 10012–10022, 2021.
  • [39] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
  • [40] D. McCallen, A. Petersson, A. Rodgers, A. Pitarka, M. Miah, F. Petrone, B. Sjogreen, N. Abrahamson, and H. Tang. EQSIM—A Multidisciplinary Framework for Fault-to-Structure Earthquake Simulations on Exascale Computers Part I: Computational Models and Workflow. Earthquake Spectra, page 875529302097098, 2020.
  • [41] D. McCallen, F. Petrone, M. Miah, A. Pitarka, A. Rodgers, and N. Abrahamson. EQSIM—A Multidisciplinary Framework for Fault-to-Structure Earthquake Simulations on Exascale Computers, Part II: Regional Simulations of Building Response. Earthquake Spectra, page 875529302097098, 2020.
  • [42] M. Meier, T. Heaton, and J. Clinton. The Gutenberg Algorithm: Evolutionary Bayesian Magnitude Estimates for Earthquake Early Warning with a Filter BankThe Gutenberg Algorithm: Evolutionary Bayesian Magnitude Estimates for EEW. Bulletin of the Seismological Society of America, 105(5):2774–2786, 2015.
  • [43] M.-A. Meier, J. P. Ampuero, and T. H. Heaton. The Hidden Simplicity of Subduction Megathrust Earthquakes. Science, 357(6357):1277–1281, 2017.
  • [44] NCEDC. Northern California Earthquake Data Center. UC Berkeley Seismological Laboratory.
  • [45] S. Noda and W. L. Ellsworth. Scaling Relation Between Earthquake Magnitude and the Departure Time from P Wave Similar Growth. Geophysical Research Letters, 43(17):9053–9060, 2016.
  • [46] A. Petersson, B. Sjogreen, and H. Tang. SW4: User’s Guide, version 3.0. Technical Report LLNL-SM-741439, Lawrence Livermore National Laboratory, 2023.
  • [47] N. A. Petersson and B. Sjögreen. Stable Grid Refinement and Singular Source Discretization for Seismic Wave Simulations. Comm. Comput. Phys., 8(5):1074–1110, 2010.
  • [48] A. Pitarka, R. Graves, K. Irikura, K. Miyakoshi, C. Wu, H. Kawase, A. Rodgers, and D. McCallen. Refinements to the Graves–Pitarka Kinematic Rupture Generator, Including a Dynamically Consistent Slip-Rate Function, Applied to the 2019 Mw 7.1 Ridgecrest Earthquake. Bulletin of the Seismological Society of America, 2021.
  • [49] A. J. Rodgers, A. Pitarka, N. A. Petersson, B. Sjögreen, and D. B. McCallen. Broadband (0–4 Hz) Ground Motions for a Magnitude 7.0 Hayward Fault Earthquake With Three‐Dimensional Structure and Topography. Geophysical Research Letters, 45(2):739–747, 2018.
  • [50] T. K. Rusch, S. Mishra, N. B. Erichson, and M. W. Mahoney. Long Expressive Memory for Sequence Modeling. In International Conference on Learning Representations, 2021.
  • [51] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
  • [52] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Conference on Neural Information Processing Systems, 28, 2015.
  • [53] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised Learning of Video Representations Using LSTMs. In International Conference on Machine Learning, pages 843–852, 2015.
  • [54] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning With Neural Networks. Conference on Neural Information Processing Systems, 27, 2014.
  • [55] D. T. Trugman, M. T. Page, S. E. Minson, and E. S. Cochran. Peak Ground Displacement Saturates Exactly When Expected: Implications for Earthquake Early Warning. Journal of Geophysical Research: Solid Earth, 124(5):4642–4653, 2019.
  • [56] T. Uchide and S. Ide. Scaling of Earthquake Rupture Growth in the Parkfield Area: Self‐Similar Growth and Suppression by the Finite Seismogenic Layer. Journal of Geophysical Research: Solid Earth, 115(B11), 2010.
  • [57] Y. Yang, A. F. Gao, K. Azizzadenesheli, R. W. Clayton, Z. E. Ross, and Y. Yang. Rapid Seismic Waveform Modeling and Inversion With Neural Operators. IEEE Transactions on Geoscience and Remote Sensing, 61:1–12, 2023.
  • [58] A. Yu, M. W. Mahoney, and N. B. Erichson. There is HOPE to avoid HiPPOs for long-memory state space models. arXiv preprint arXiv:2405.13975, 2024.
  • [59] A. Yu, A. Nigmetov, D. Morozov, M. W. Mahoney, and N. B. Erichson. Robustifying state-space models for long sequences via approximate diagonalization. In International Conference on Learning Representations, 2023.
  • [60] X. Zhang, M. Zhang, and X. Tian. Real‐Time Earthquake Early Warning With Deep Learning: Application to the 2016 M 6.0 Central Apennines, Italy Earthquake. Geophysical Research Letters, 48(5), 2021.
  • [61] W. Zhu and G. C. Beroza. PhaseNet: A Deep-Neural-Network-Based Seismic Arrival Time Picking Method. Geophysical Journal International, 2018.

Appendix A Notation

Terms Definition
WaveCastNet Wavefield forecasting network (WaveCastNet) based on a seq2seq model.
Seq2Seq AI-enabled sequence to sequence (seq2seq) modelling framework.
ConvLEM Convolutional long expressive memory (ConvLEM) recurrent unit.
time window Sequence length in temporal dimension.
arrival time Time step at which maximal waveform arrives.
particle velocity Single pixel 𝒳t[c,h,w]subscript𝒳𝑡𝑐𝑤\mathcal{X}_{t}[c,h,w]caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_c , italic_h , italic_w ] in the snapshot.
waveform Time series recorded for a single particle velocity.
wavefield Snapshot 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at a certain time step.
t𝑡titalic_t Temporal coordinate (time point).
h,w𝑤h,witalic_h , italic_w XY𝑋𝑌XYitalic_X italic_Y-index of each input snapshot.
c𝑐citalic_c Channel index for velocity in a certain direction.
𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Snapshot of shape C×H×W𝐶𝐻𝑊C\times H\times Witalic_C × italic_H × italic_W at time step t𝑡titalic_t.
𝒞nsubscript𝒞𝑛\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Fast hidden state in latent space.
nsubscript𝑛\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Slow hidden state in latent space.
X𝑋Xitalic_X (NS) North-South direction.
Y𝑌Yitalic_Y (EW) East-West direction.
Z𝑍Zitalic_Z (UP) Vertical direction, positive values signify upward movement.
Table A.1: Terms and Definitions

Appendix B Technical Details

B.1  Discretized ConvLEM

Here we derive the discretized formula of ConvLEM from the following time-dependent ODEs:

d𝒞(t)dt𝑑𝒞𝑡𝑑𝑡\displaystyle\frac{d\mathcal{C}(t)}{dt}divide start_ARG italic_d caligraphic_C ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =ψ𝒞(𝒞(t),(t),𝒳(t))=𝐠c((t),𝒳(t))[fθcc((t),𝒳(t))𝒞(t)],absentsubscript𝜓𝒞𝒞𝑡𝑡𝒳𝑡direct-productsubscript𝐠𝑐𝑡𝒳𝑡delimited-[]subscriptsuperscript𝑓𝑐subscript𝜃𝑐𝑡𝒳𝑡𝒞𝑡\displaystyle=\psi_{\mathcal{C}}\left(\mathcal{C}(t),\mathcal{H}(t),\mathcal{X% }(t)\right)={\bf g}_{c}\left(\mathcal{H}(t),\mathcal{X}(t)\right)\odot\left[f^% {c}_{\theta_{c}}(\mathcal{H}(t),\mathcal{X}(t))-\mathcal{C}(t)\right],= italic_ψ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( caligraphic_C ( italic_t ) , caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) = bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) ⊙ [ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) - caligraphic_C ( italic_t ) ] , (21)
d(t)dt𝑑𝑡𝑑𝑡\displaystyle\frac{d\mathcal{H}(t)}{dt}divide start_ARG italic_d caligraphic_H ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG =ψ(𝒞(t),(t),𝒳(t))=𝐠h((t),𝒳(t))[fθhh(𝒞(t),𝒳(t))(t)].absentsubscript𝜓𝒞𝑡𝑡𝒳𝑡direct-productsubscript𝐠𝑡𝒳𝑡delimited-[]subscriptsuperscript𝑓subscript𝜃𝒞𝑡𝒳𝑡𝑡\displaystyle=\psi_{\mathcal{H}}\left(\mathcal{C}(t),\mathcal{H}(t),\mathcal{X% }(t)\right)={\bf g}_{h}\left(\mathcal{H}(t),\mathcal{X}(t)\right)\odot\left[f^% {h}_{\theta_{h}}(\mathcal{C}(t),\mathcal{X}(t))-\mathcal{H}(t)\right].= italic_ψ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ( caligraphic_C ( italic_t ) , caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) = bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( caligraphic_H ( italic_t ) , caligraphic_X ( italic_t ) ) ⊙ [ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C ( italic_t ) , caligraphic_X ( italic_t ) ) - caligraphic_H ( italic_t ) ] .

The gating functions 𝐠csubscript𝐠𝑐{\bf g}_{c}bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, 𝐠hsubscript𝐠{\bf g}_{h}bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and update functions fθccsubscriptsuperscript𝑓𝑐subscript𝜃𝑐f^{c}_{\theta_{c}}italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT, fθhhsubscriptsuperscript𝑓subscript𝜃f^{h}_{\theta_{h}}italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT are defined based on convolutional operation:

𝐠c(,𝒳)subscript𝐠𝑐𝒳\displaystyle{\bf g}_{c}\left(\mathcal{H},\mathcal{X}\right)bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_H , caligraphic_X ) =σ(𝐖xt𝒳+𝐖ht),absent𝜎subscript𝐖𝑥𝑡𝒳subscript𝐖𝑡\displaystyle=\sigma\left(\mathbf{W}_{xt}*\mathcal{X}+\mathbf{W}_{ht}*\mathcal% {H}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_t end_POSTSUBSCRIPT ∗ caligraphic_X + bold_W start_POSTSUBSCRIPT italic_h italic_t end_POSTSUBSCRIPT ∗ caligraphic_H ) , (22)
𝐠h(,𝒳)subscript𝐠𝒳\displaystyle{\bf g}_{h}\left(\mathcal{H},\mathcal{X}\right)bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( caligraphic_H , caligraphic_X ) =σ(𝐖xt¯𝒳+𝐖ht¯),absent𝜎subscript𝐖𝑥¯𝑡𝒳subscript𝐖¯𝑡\displaystyle=\sigma\left(\mathbf{W}_{x\overline{t}}*\mathcal{X}+\mathbf{W}_{h% \overline{t}}*\mathcal{H}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_X + bold_W start_POSTSUBSCRIPT italic_h over¯ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∗ caligraphic_H ) ,
fθcc(,𝒳)subscriptsuperscript𝑓𝑐subscript𝜃𝑐𝒳\displaystyle f^{c}_{\theta_{c}}(\mathcal{H},\mathcal{X})italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_H , caligraphic_X ) =tanh(𝐖hc+𝐖xc𝒳),absent𝑡𝑎𝑛subscript𝐖𝑐subscript𝐖𝑥𝑐𝒳\displaystyle=tanh\left(\mathbf{W}_{hc}*\mathcal{H}+\mathbf{W}_{xc}*\mathcal{X% }\right),= italic_t italic_a italic_n italic_h ( bold_W start_POSTSUBSCRIPT italic_h italic_c end_POSTSUBSCRIPT ∗ caligraphic_H + bold_W start_POSTSUBSCRIPT italic_x italic_c end_POSTSUBSCRIPT ∗ caligraphic_X ) ,
fθhh(𝒞,𝒳)subscriptsuperscript𝑓subscript𝜃𝒞𝒳\displaystyle f^{h}_{\theta_{h}}(\mathcal{C},\mathcal{X})italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C , caligraphic_X ) =tanh(𝐖ch𝒞+𝐖xh𝒳).absent𝑡𝑎𝑛subscript𝐖𝑐𝒞subscript𝐖𝑥𝒳\displaystyle=tanh\left(\mathbf{W}_{ch}*\mathcal{C}+\mathbf{W}_{xh}*\mathcal{X% }\right).= italic_t italic_a italic_n italic_h ( bold_W start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ∗ caligraphic_C + bold_W start_POSTSUBSCRIPT italic_x italic_h end_POSTSUBSCRIPT ∗ caligraphic_X ) .

In this notation, (t)𝑡\mathcal{H}(t)caligraphic_H ( italic_t ) and 𝒞(t)𝒞𝑡\mathcal{C}(t)caligraphic_C ( italic_t ) denote the slow and fast evolving hidden states in latent space r×p×qsuperscript𝑟𝑝𝑞\mathbb{R}^{r\times p\times q}blackboard_R start_POSTSUPERSCRIPT italic_r × italic_p × italic_q end_POSTSUPERSCRIPT respectively. 𝒳(t)c×h×w𝒳𝑡superscript𝑐𝑤\mathcal{X}(t)\in\mathbb{R}^{c\times h\times w}caligraphic_X ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT represents a three-dimensional input tensor. 𝐖,subscript𝐖{\bf W}_{\cdot,\cdot}bold_W start_POSTSUBSCRIPT ⋅ , ⋅ end_POSTSUBSCRIPT denotes the convolutional kernels, direct-product\odot represents the Hadamard product, and * indicates the convolutional operator. For brevity, bias vectors are omitted in gating and updated functions defined in 22.

We utilize the Implicit-Explicit (IMEX) time-step** scheme to write the ODEs in Eq. (21) in a discretized formula, with subscript n𝑛nitalic_n as time steps index ranging from 1111 to N𝑁Nitalic_N. Given Δt>0Δ𝑡0\Delta t>0roman_Δ italic_t > 0:

𝒞n𝒞n1Δt=ψ𝒞(𝒞n1,n1,𝒳n)=𝐠c(n1,𝒳n)[fθcc(n1,𝒳n)𝒞n1],subscript𝒞𝑛subscript𝒞𝑛1Δ𝑡subscript𝜓𝒞subscript𝒞𝑛1subscript𝑛1subscript𝒳𝑛direct-productsubscript𝐠𝑐subscript𝑛1subscript𝒳𝑛delimited-[]subscriptsuperscript𝑓𝑐subscript𝜃𝑐subscript𝑛1subscript𝒳𝑛subscript𝒞𝑛1\displaystyle\frac{\mathcal{C}_{n}-\mathcal{C}_{n-1}}{\Delta t}=\psi_{\mathcal% {C}}\left(\mathcal{C}_{n-1},\mathcal{H}_{n-1},\mathcal{X}_{n}\right)={\bf g}_{% c}\left(\mathcal{H}_{n-1},\mathcal{X}_{n}\right)\odot\left[f^{c}_{\theta_{c}}(% \mathcal{H}_{n-1},\mathcal{X}_{n})-\mathcal{C}_{n-1}\right],divide start_ARG caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG = italic_ψ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ [ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] , (23)
nn1Δt=ψ(𝒞n,n1,𝒳n)=𝐠h(n1,𝒳n)[fθhh(𝒞n,𝒳n)n1].subscript𝑛subscript𝑛1Δ𝑡subscript𝜓subscript𝒞𝑛subscript𝑛1subscript𝒳𝑛direct-productsubscript𝐠subscript𝑛1subscript𝒳𝑛delimited-[]subscriptsuperscript𝑓subscript𝜃subscript𝒞𝑛subscript𝒳𝑛subscript𝑛1\displaystyle\frac{\mathcal{H}_{n}-\mathcal{H}_{n-1}}{\Delta t}=\psi_{\mathcal% {H}}\left(\mathcal{C}_{n},\mathcal{H}_{n-1},\mathcal{X}_{n}\right)={\bf g}_{h}% \left(\mathcal{H}_{n-1},\mathcal{X}_{n}\right)\odot\left[f^{h}_{\theta_{h}}(% \mathcal{C}_{n},\mathcal{X}_{n})-\mathcal{H}_{n-1}\right].divide start_ARG caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG = italic_ψ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ [ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] .

For discretized fast hidden state 𝒞nsubscript𝒞𝑛\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have:

𝒞n𝒞n1subscript𝒞𝑛subscript𝒞𝑛1\displaystyle\mathcal{C}_{n}-\mathcal{C}_{n-1}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT =Δt𝐠c(fθcc𝒞n1);absentdirect-productΔ𝑡subscript𝐠𝑐subscriptsuperscript𝑓𝑐subscript𝜃𝑐subscript𝒞𝑛1\displaystyle=\Delta t\cdot{\bf g}_{c}\odot(f^{c}_{\theta_{c}}-\mathcal{C}_{n-% 1});= roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊙ ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT - caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ;
𝒞nsubscript𝒞𝑛\displaystyle\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(Δt𝐠c)fθcc+𝒞n1(Δt𝐠c)𝒞n1absentdirect-productΔ𝑡subscript𝐠𝑐subscriptsuperscript𝑓𝑐subscript𝜃𝑐subscript𝒞𝑛1direct-productΔ𝑡subscript𝐠𝑐subscript𝒞𝑛1\displaystyle=(\Delta t\cdot{\bf g}_{c})\odot f^{c}_{\theta_{c}}+\mathcal{C}_{% n-1}-(\Delta t\cdot{\bf g}_{c})\odot\mathcal{C}_{n-1}= ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT
=(Δt𝐠c)fθcc+𝟙𝒞n1(Δt𝐠c)𝒞n1absentdirect-productΔ𝑡subscript𝐠𝑐subscriptsuperscript𝑓𝑐subscript𝜃𝑐direct-product1subscript𝒞𝑛1direct-productΔ𝑡subscript𝐠𝑐subscript𝒞𝑛1\displaystyle=(\Delta t\cdot{\bf g}_{c})\odot f^{c}_{\theta_{c}}+\mathbbm{1}% \odot\mathcal{C}_{n-1}-(\Delta t\cdot{\bf g}_{c})\odot\mathcal{C}_{n-1}= ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT + blackboard_1 ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT
=(Δt𝐠c)fθcc+(𝟙Δt𝐠c)𝒞n1,absentdirect-productΔ𝑡subscript𝐠𝑐subscriptsuperscript𝑓𝑐subscript𝜃𝑐direct-product1Δ𝑡subscript𝐠𝑐subscript𝒞𝑛1\displaystyle=(\Delta t\cdot{\bf g}_{c})\odot f^{c}_{\theta_{c}}+(\mathbbm{1}-% \Delta t\cdot{\bf g}_{c})\odot\mathcal{C}_{n-1},= ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( blackboard_1 - roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ,

where 𝟙1\mathbbm{1}blackboard_1 is the matrix of ones that matches the shape of hidden state 𝒞nsubscript𝒞𝑛\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and nsubscript𝑛\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Similarly, we have for nsubscript𝑛\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

n=(Δt𝐠h)fθhh+(𝟙Δt𝐠h)n1.subscript𝑛direct-productΔ𝑡subscript𝐠subscriptsuperscript𝑓subscript𝜃direct-product1Δ𝑡subscript𝐠subscript𝑛1\mathcal{H}_{n}=(\Delta t\cdot{\bf g}_{h})\odot f^{h}_{\theta_{h}}+(\mathbbm{1% }-\Delta t\cdot{\bf g}_{h})\odot\mathcal{H}_{n-1}.caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⊙ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( blackboard_1 - roman_Δ italic_t ⋅ bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⊙ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT .

Define 𝚫𝐭n=Δt𝐠c𝚫subscript𝐭𝑛Δ𝑡subscript𝐠𝑐\bm{\Delta}\mathbf{t}_{n}=\Delta t\,{\bf g}_{c}bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, 𝚫𝐭n¯=Δt𝐠h¯𝚫subscript𝐭𝑛Δ𝑡subscript𝐠\,\overline{\bm{\Delta}\mathbf{t}_{n}}=\Delta t\,{\bf g}_{h}over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. By plugging in 22 and LABEL:eq:discretized, we derive the discretized formula for ConvLEM:

𝚫𝐭n𝚫subscript𝐭𝑛\displaystyle\bm{\Delta}\mathbf{t}_{n}bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =Δt𝐠c(n1,𝒳n)absentΔ𝑡subscript𝐠𝑐subscript𝑛1subscript𝒳𝑛\displaystyle=\Delta t\,{\bf g}_{c}(\mathcal{H}_{n-1},\mathcal{X}_{n})= roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (24)
𝚫𝐭n¯¯𝚫subscript𝐭𝑛\displaystyle\overline{\bm{\Delta}\mathbf{t}_{n}}over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG =Δt𝐠h(n1,𝒳n)absentΔ𝑡subscript𝐠subscript𝑛1subscript𝒳𝑛\displaystyle=\Delta t\,{\bf g}_{h}(\mathcal{H}_{n-1},\mathcal{X}_{n})= roman_Δ italic_t bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
𝒞nsubscript𝒞𝑛\displaystyle\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝟙𝚫𝐭n)𝒞n1+𝚫𝐭nfθcc(n1,𝒳n)absentdirect-product1𝚫subscript𝐭𝑛subscript𝒞𝑛1direct-product𝚫subscript𝐭𝑛subscriptsuperscript𝑓𝑐subscript𝜃𝑐subscript𝑛1subscript𝒳𝑛\displaystyle=\left(\mathbbm{1}-\bm{\Delta}\mathbf{t}_{n}\right)\odot\mathcal{% C}_{n-1}+\bm{\Delta}\mathbf{t}_{n}\odot f^{c}_{\theta_{c}}(\mathcal{H}_{n-1},% \mathcal{X}_{n})= ( blackboard_1 - bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
nsubscript𝑛\displaystyle\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝟙𝚫𝐭n¯)n1+𝚫𝐭n¯fθhh(𝒞n,𝒳n).absentdirect-product1¯𝚫subscript𝐭𝑛subscript𝑛1direct-product¯𝚫subscript𝐭𝑛subscriptsuperscript𝑓subscript𝜃subscript𝒞𝑛subscript𝒳𝑛\displaystyle=\left(\mathbbm{1}-\overline{\bm{\Delta}\mathbf{t}_{n}}\right)% \odot\mathcal{H}_{n-1}+\overline{\bm{\Delta}\mathbf{t}_{n}}\odot f^{h}_{\theta% _{h}}(\mathcal{C}_{n},\mathcal{X}_{n}).\qed= ( blackboard_1 - over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ⊙ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + over¯ start_ARG bold_Δ bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ⊙ italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . italic_∎

B.2  Gating Function Distribution

We visualize the distribution of 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t and 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG for the encoder ConvLEM cells in WaveCastNet on point-source small earthquakes in Figure B.1. Here, we set the time step factor ΔtΔ𝑡\Delta troman_Δ italic_t in 24 to 1, so 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t and 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG equal to the gating functions 𝐠csubscript𝐠𝑐{\bf g}_{c}bold_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐠hsubscript𝐠{\bf g}_{h}bold_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for the fast and slow hidden states 𝒞(t)𝒞𝑡\mathcal{C}(t)caligraphic_C ( italic_t ) and (t)𝑡\mathcal{H}(t)caligraphic_H ( italic_t ), respectively.

As shown in Figure B.1, the observed occurrences of 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t and 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG at each scale decays as a power law with respect to scale amplitude [50].

Refer to caption
Figure B.1: Histogram of 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t and 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG for the encoder ConvLEM cells in WaveCastNet.

By setting all axes to log scale, we can observe the different linear slopes and amplitude ranges for 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t and 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG. 𝚫𝐭¯¯𝚫𝐭\overline{\mathbf{\Delta t}}over¯ start_ARG bold_Δ bold_t end_ARG exhibits a smaller linear slope and longer trailing tail, with distribution at the amplitude closer to 0 compared to 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t, enabling (t)𝑡\mathcal{H}(t)caligraphic_H ( italic_t ) to better capture low-frequency features. In contrast, 𝚫𝐭𝚫𝐭\mathbf{\Delta t}bold_Δ bold_t is more centrally distributed near 1, showing a smaller amplitude range and a larger linear slope, reflecting the rapid change of hidden state 𝒞(t)𝒞𝑡\mathcal{C}(t)caligraphic_C ( italic_t ). These observations prove that the temporal multiscale resolution structure of ConvLEM is essential for modeling the fast-slow dynamical pattern in ground motion data.

B.3  Structure of Embedding and Reconstruction Layers

Here we discuss the embedding and reconstruction layers.

The embedding layer for densely and regularly sampled inputs x3×344×224𝑥superscript3344224x\in\mathbb{R}^{3\times 344\times 224}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 344 × 224 end_POSTSUPERSCRIPT is composed of three cascaded encoder layers. A standard encoder layer comprises a convolutional layer, with kernel size =(4,4)44(4,4)( 4 , 4 ), stride=2, padding=1, followed by a LeakyRelu activation layer and BatchNorm layer. Each encoder layer reduces the input spatial dimensions by a factor of 2. After the input signal is passed through three encoder layers, the dimensions change from 3×344×22433442243\times 344\times 2243 × 344 × 224 to a fixed-size latent space of 144×43×281444328144\times 43\times 28144 × 43 × 28. The channel transformation process is illustrated in Figure B.2.

The embedding layer for sparsely and irregularly sampled data maps the inputs x3×564𝑥superscript3564x\in\mathbb{R}^{3\times 564}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 564 end_POSTSUPERSCRIPT to a latent space of dimension 144×43×281444328144\times 43\times 28144 × 43 × 28. Specifically, this embedding layer uses a shallow multi-layer feed forward network [18], followed by two convolutional layers, as illustrated in Figure B.2.

The reconstruction layer retrieves the predicted wavefield snapshot, shaped 144×43×281444328144\times 43\times 28144 × 43 × 28, from the latent space. This process involves increasing the spatial dimensions by a factor of 2 through transposed convolution, followed by a PixelShuffle layer [51] to further upscale the output by a factor of 4. The dimensional transformation process is depicted in Figure B.2.

Refer to caption
Figure B.2: Detailed structure for the embedding layers and reconstruction layer in dense and sparse sampling scenarios.

B.4  Other Related Methods

We implemented ConvLSTM and ConvGRU with peephole connections as follows. For brevity, bias vectors are omitted from the activation and gating functions.

ConvLSTM

𝐢nsubscript𝐢𝑛\displaystyle\mathbf{i}_{n}bold_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =σ(𝐖xi𝒳n+𝐖hin1+Wci𝒞n1),absent𝜎subscript𝐖𝑥𝑖subscript𝒳𝑛subscript𝐖𝑖subscript𝑛1direct-productsubscript𝑊𝑐𝑖subscript𝒞𝑛1\displaystyle=\sigma\left(\mathbf{W}_{xi}*\mathcal{X}_{n}+\mathbf{W}_{hi}*% \mathcal{H}_{n-1}+W_{ci}\odot\mathcal{C}_{n-1}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_i end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_i end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝐟nsubscript𝐟𝑛\displaystyle\mathbf{f}_{n}bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =σ(𝐖xf𝒳n+𝐖hfn1+𝐖cf𝒞n1),absent𝜎subscript𝐖𝑥𝑓subscript𝒳𝑛subscript𝐖𝑓subscript𝑛1direct-productsubscript𝐖𝑐𝑓subscript𝒞𝑛1\displaystyle=\sigma\left(\mathbf{W}_{xf}*\mathcal{X}_{n}+\mathbf{W}_{hf}*% \mathcal{H}_{n-1}+\mathbf{W}_{cf}\odot\mathcal{C}_{n-1}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_f end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_f end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝒞nsubscript𝒞𝑛\displaystyle\mathcal{C}_{n}caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =𝐟n𝒞n1+𝐢nf(𝐖xc𝒳n+𝐖hcn1),absentdirect-productsubscript𝐟𝑛subscript𝒞𝑛1direct-productsubscript𝐢𝑛𝑓subscript𝐖𝑥𝑐subscript𝒳𝑛subscript𝐖𝑐subscript𝑛1\displaystyle=\mathbf{f}_{n}\odot\mathcal{C}_{n-1}+\mathbf{i}_{n}\odot f\left(% \mathbf{W}_{xc}*\mathcal{X}_{n}+\mathbf{W}_{hc}*\mathcal{H}_{n-1}\right),= bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ italic_f ( bold_W start_POSTSUBSCRIPT italic_x italic_c end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_c end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝐨nsubscript𝐨𝑛\displaystyle\mathbf{o}_{n}bold_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =σ(𝐖xo𝒳n+𝐖hon1+𝐖co𝒞n),absent𝜎subscript𝐖𝑥𝑜subscript𝒳𝑛subscript𝐖𝑜subscript𝑛1direct-productsubscript𝐖𝑐𝑜subscript𝒞𝑛\displaystyle=\sigma\left(\mathbf{W}_{xo}*\mathcal{X}_{n}+\mathbf{W}_{ho}*% \mathcal{H}_{n-1}+\mathbf{W}_{co}\odot\mathcal{C}_{n}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_o end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_c italic_o end_POSTSUBSCRIPT ⊙ caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,
nsubscript𝑛\displaystyle\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =𝐨nf(𝒞n).absentdirect-productsubscript𝐨𝑛𝑓subscript𝒞𝑛\displaystyle=\mathbf{o}_{n}\odot f\left(\mathcal{C}_{n}\right).= bold_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ italic_f ( caligraphic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

ConvGRU

𝐙nsubscript𝐙𝑛\displaystyle{\bf Z}_{n}bold_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =σ(𝐖xz𝒳n+𝐖hzn1),absent𝜎subscript𝐖𝑥𝑧subscript𝒳𝑛subscript𝐖𝑧subscript𝑛1\displaystyle=\sigma\left(\mathbf{W}_{xz}*\mathcal{X}_{n}+\mathbf{W}_{hz}*% \mathcal{H}_{n-1}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_z end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_z end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝐑nsubscript𝐑𝑛\displaystyle\mathbf{R}_{n}bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =σ(𝐖xr𝒳n+𝐖hrn1),absent𝜎subscript𝐖𝑥𝑟subscript𝒳𝑛subscript𝐖𝑟subscript𝑛1\displaystyle=\sigma\left(\mathbf{W}_{xr}*\mathcal{X}_{n}+\mathbf{W}_{hr}*% \mathcal{H}_{n-1}\right),= italic_σ ( bold_W start_POSTSUBSCRIPT italic_x italic_r end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,
𝐨nsubscript𝐨𝑛\displaystyle\mathbf{o}_{n}bold_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =f(𝐖xo𝒳n+𝐑n(𝐖hon1)),absent𝑓subscript𝐖𝑥𝑜subscript𝒳𝑛direct-productsubscript𝐑𝑛subscript𝐖𝑜subscript𝑛1\displaystyle=f\left(\mathbf{W}_{xo}*\mathcal{X}_{n}+\mathbf{R}_{n}\odot(% \mathbf{W}_{ho}*\mathcal{H}_{n-1})\right),= italic_f ( bold_W start_POSTSUBSCRIPT italic_x italic_o end_POSTSUBSCRIPT ∗ caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ ( bold_W start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT ∗ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ) ,
nsubscript𝑛\displaystyle\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(1𝐙n)n1+𝐙n𝐨n.absentdirect-product1subscript𝐙𝑛subscript𝑛1direct-productsubscript𝐙𝑛subscript𝐨𝑛\displaystyle=\left(1-{\bf Z}_{n}\right)\odot\mathcal{H}_{n-1}+{\bf Z}_{n}% \odot\mathbf{o}_{n}.= ( 1 - bold_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⊙ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ bold_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Appendix C Additional Results

C.1  Moving MNIST

Here, we show experiments for the MovingMNIST dataset to further demonstrate the ConvLEM’s performance in spatio-temporal forecasting. The MovingMNIST dataset [53] is a well-established benchmark for video prediction and spatiotemporal modeling tasks. This dataset consists of a total of 10,0001000010,00010 , 000 videos, each comprising 20202020 fixed-size frames with dimensions of 1×64×64164641\times 64\times 641 × 64 × 64 pixels. Each video sequence features two handwritten digits selected from the original MNIST dataset, which move within the frame at various speeds and directions. The digits exhibit diverse velocities and trajectories, including linear motion, bouncing off the frame edges, and occasional overlap, presenting a complex and challenging scenario for spatio-temporal forecasting models. The diversity in motion patterns makes the MovingMNIST dataset an ideal benchmark for evaluating the ability of models to capture and predict dynamic changes over time.

We use the first 10101010 frames as input to predict the subsequent 10101010 frames. The models consist of 3 stacked recurrent layers with no embedding or reconstruction layers involved. Table C.1 shows the results for this task. While using fewer parameters, ConvLEM is able to outperform the prediction performance of the ConvLSTM model. This further demonstrates ConvLEM’s potential for spatio-temporal forecasting tasks.

Refer to caption
Figure C.1: An example on MovingMnist dataset.
Model Parameters Latent Space Layers BCELoss \downarrow
Stacked ConvLSTM 3.10M (64, 64, 64) 3 206.13
Stacked ConvLEM 2.31M (64, 64, 64) 3 166.75
Table C.1: Results for Moving Mnist. The ConvLEM demonstrates improved forecasting capabilities while requiring fewer parameters than ConvLSTM.

Appendix D Supplementary figures

  • Figure D.2: Kinematic rupture model of the M4.5 earthquake

  • Figure D.2: Kinematic rupture model of the M5 earthquake

  • Figure D.4: Kinematic rupture model of the M5.5 earthquake

  • Figure D.4: Kinematic rupture model of the M6 earthquake

  • Figure D.6: Kinematic rupture model of the M6.5 earthquake

  • Figure D.6: Kinematic rupture model of the M7earthquake

Refer to caption
Figure D.1: Kinematic rupture models for the M4.5 earthquake. (Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.
Refer to caption
Figure D.2: Kinematic rupture models for the M5.0 earthquake. (Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.
Refer to caption
Figure D.3: Kinematic rupture models for the M5.5 earthquake. (Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.
Refer to caption
Figure D.4: Kinematic rupture models for the M6 earthquake. (Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.
Refer to caption
Figure D.5: Kinematic rupture models for the M6.5 earthquake. (Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.
Refer to caption
Figure D.6: Kinematic rupture models for the M7 earthquake.(Top) slip (middle) rise time and (bottom) 5 Hz slip rate distributions.