Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy

Elena Tomasi, Gabriele Franch, Marco Cristoforetti
Fondazione Bruno Kessler
Trento, Italy
{Elena Tomasi} [email protected]

Abstract

Downscaling techniques are one of the most prominent applications of Deep Learning (DL) in Earth System Modeling. A robust DL downscaling model can generate high-resolution fields from coarse-scale numerical model simulations, saving the timely and resourceful applications of regional/local models. Additionally, generative DL models have the potential to provide uncertainty information, by generating ensemble-like scenario pools, a task that is computationally prohibitive for traditional numerical simulations. In this study, we apply a Latent Diffusion Model (LDM) to downscale ERA5 data [1] over Italy up to a resolution of 2 km. The high-resolution target data consists of results from a high-resolution dynamical downscaling performed with COSMO-CLM [2, 3]. Our goal is to demonstrate that recent advancements in generative modeling enable DL-based models to deliver results comparable to those of numerical dynamical downscaling models, given the same input data (i.e., ERA5 data), preserving the realism of fine-scale features and flow characteristics. The training and testing database consists of hourly data from 2000 to 2020 ( 184000 timestamps). The target variables of this study are 2-m temperature and 10-m horizontal wind components. A selection of predictors from ERA5 is used as input to the LDM (e.g., 850hPa temperature, specific humidity, and wind), and a residual approach against a reference UNET is leveraged in applying the LDM. The performance of the generative LDM is compared with reference baselines of increasing complexity: quadratic interpolation of ERA5, a UNET, and a Generative Adversarial Network (GAN) built on the same reference UNET. Results highlight the improvements introduced by the LDM architecture and the residual approach over these baselines. The models are evaluated on a yearly test dataset, assessing the models’ performance through deterministic metrics (e.g. RMSE, BIAS, R2, …), spatial distribution of errors, and reconstruction of frequency and power spectra distributions.

Keywords Downscaling $\cdot$ Deep learning $\cdot$ Dynamical reanalyses $\cdot$ Generative models

1 Introduction

High-resolution near-surface meteorological fields such as 2-m temperature and 10-m wind speed are key targets for the weather and climate scientific and operational communities. Such high-resolution information is of essential importance for a wide variety of applications (e.g. available wind potential, predicted energy consumption, water availability, …), across diverse timescales, from weather forecasting (nowcasting, medium-range forecasting, and seasonal predictions) to climate projections. The hunger for high-resolution data is directly linked to and justified by the information that such data hold: extreme weather events and localized phenomena can typically be described by highly resolved fields only.

Downscaling is a well-known approach that allows obtaining local high-resolution data (predictands) starting from low-resolution information (predictors) by applying suitable refinement techniques. The two most traditional approaches are dynamical downscaling and statistical downscaling ([4], [5], [6]), applied alternatively depending on the final goal of each specific application.

Traditionally, high-resolution fields are achieved in operational weather forecasting by performing dynamical downscaling of lower-resolution simulations. Examples of this approach are all the Local Area Models (LAMs) run in every operational center worldwide, fed with global circulation models at a low resolution, and producing high-resolution fields for a localized area, typically nationwide (e.g. [7], [8], [9]). As for the climate community, this approach materializes, for example, in applications run within the WCRP Coordinated Regional Downscaling Experiment (CORDEX, [10]), performing dynamical downscaling of climate projections with Regional Climate Models (RCM) going from the $\sim$ 100-km resolution of Global Climate Models down to a $\sim$ 16-km resolution (e.g. [11]). The dynamical downscaling approach is well-established and provides physically and temporally consistent fields. However, it still has significant drawbacks due to the high resource demands required for its execution that limit its application, e.g. to deterministic runs instead of (or limited to small) ensemble runs.

On the other hand, statistical downscaling uses coarse data from numerical simulations to infer data at high resolution by applying empirical relationships or transfer functions derived from a set of known predictors–predictands data pairs ([6]). Statistical downscaling methods have evolved through the years since the 1990s with increasingly greater levels of complexity and amount of data. Canonical examples of statistical downscaling methods are linear or multilinear regression methods (e.g. [12], [13]), analog ensemble downscaling (e.g. [14]) or quantile map** ([15]).

In recent years, the advent of machine-learning techniques introduced many other powerful methods. These approaches have the potential to outperform classical statistical models, introducing nonlinear components in the downscaling process and learning from the provided high-resolution data. Specifically, Convolutional Neural Networks (CNNs) are particularly well suited for handling spatially distributed data and for the super-resolution task, being able to capture complex, nonlinear map**s identifying crucial features, and have already been successfully applied to weather downscaling (e.g. [16, 17, 18]). Building on CNN frameworks, two Deep Learning approaches are currently the most promising for improving atmospheric downscaling: Generative Adversarial Networks (GANs, [19, 20]) and Diffusion Models ([21]), which both allow for a probabilistic approach to the problem. The potential and drawbacks of these approaches are widely reported in the following Section 2.

In this work, we focus on and test a Latent Diffusion model that is novel for application in the atmospheric downscaling task. Its advantages should be twofold: the diffusion approach has a much more stable training than that of GAN models, still holding the ability to generate small-scale features and the potential for ensemble production, and shows superior results in applications to image processing compared to GANs ([22, 23]). In addition, the latent approach should be seen as incremental compared to pixel-space-based diffusion approaches as it provides a cheaper solution in terms of computational costs for both inference and training ([24]). This feature is intrinsically appealing to scale the downscaling task to wider (and longer) domains. Lastly, we use the high-resolution output from a numerical model dynamical downscaling simulation as our target-reference-truth. In doing so, we aim to determine whether a properly trained deep learning model can effectively emulate dynamical downscaling. If successful, this model could serve as a versatile dynamical downscaling emulator, much faster than the traditional numerical dynamical models, with a broad range of utmost significant applications.

2 Related work and Contribution

Building on CNN frameworks, two Deep Learning (DL) approaches are currently the most promising for improving atmospheric downscaling: Generative Adversarial Networks (GANs) and Diffusion Models, which both allow for the generation of small-scale features and a probabilistic approach to the problem. GANs have already shown promising results in downscaling different meteorological variables, in different regions [25, 26, 27] but pose relevant challenges (e.g., instabilities, mode collapses) during the training procedure ([28, 29]).

On the other hand, Diffusion models introduce a relatively younger approach and have already been proven very effective in weather forecasting and nowcasting applications (e.g. [30, 31]). Diffusion models have yet to be widely tested and evaluated on the atmospheric downscaling task, but their characteristics and capabilities are undoubtedly promising for this application as shown for example in [32, 33].

Building on these encouraging results, in this work, we approach the downscaling task with a Latent diffusion model, comparing it against some standard baselines and a GAN baseline. Specifically, we re-adapted the Latent Diffusion Cast (LDCast) model [30], recently developed for precipitation nowcasting. LDCast has shown superior performance in the generation of highly realistic precipitation forecast ensembles, and in the representation of uncertainty, compared to traditional Generative Adversarial Network (GAN)-based methods. Our resulting model for downscaling, similar to fully convolutional models, can be trained on examples of smaller spatial domains (patches) and used at the evaluation stage on domains of arbitrary sizes, making it suitable for the generation of high-resolution data covering wider domains. As also suggested in [33], we propose the application of the diffusion model with a residual approach, relying on a standard UNET architecture for capturing the bigger scales and training the latent diffusion model to generate the residual, small scales only.

Our work differs from most of the above-mentioned works also in terms of the chosen pair of low-resolution and high-resolution data for the training. This choice highly influences the level of complexity that the DL downscaling model must achieve. Indeed, downscaling a coarsening of the high-resolution data (e.g. [27, 25, 34]) is a much easier task than downscaling modeled low-resolution data (e.g. short-term forecasts as in [26], seasonal predictions, or climate projections) to independent high-resolution data, either coming from observations or numerical model simulation. While the first exercise falls into a purely super-resolution task, the latter includes learning model biases and correcting them, including the identification of the inability of large-scale numerical models to detect local phenomena that cannot be resolved at their coarse resolution. In our work, we focus on reanalyses products and we train our models using a set of 14 ERA5 variables as low-resolution input and high-resolution data from a dynamical downscaling of ERA5 (run with the COSMO-CLM model) as target data (2-m temperature and 10-m wind speed components). This approach is similar to that followed, for example, by [35, 33]. In doing so, we intentionally force the model to learn to generate the effects of those local phenomena resolved by the dynamical numerical model, emulating its behavior.

3 Datasets

3.1 Low- and High-resolution data

The goal of this experiment is to train DL models to mimic a dynamical downscaling performed with a convection-permitting Regional Climate Model (RCM). The target high-resolution data consists of the hourly Italian Very High-Resolution Reanalyses produced with COSMO5.0_CLM9 (VHR-REA CCLM) by dynamically downscaling ERA5 reanalyses [1] from its native resolution ( 25km) up to 2.2km over Italy ([2, 3]). Consistently with these target numerical simulations, the input low-resolution data fed to our DL models are ERA5 data.

3.2 Data alignment and preprocessing

ERA5 data have a resolution of 0.25° worldwide, which roughly corresponds to 22 km at the latitudes of the focus domain, while VHR-REA CCLM data have a native resolution of 0.02° (2.2 km). Data from both datasets were preprocessed to re-project, trim, and align the low and high-resolution fields. Specifically, the coordinate reference system (CRS) chosen for the experiment is ETRS89-LAEA Europe (Lambert Azimuthal Equal Area), also known in the EPSG Geodetic Parameter Dataset under the identifier EPSG:3035 ([36]), and the experiment grids align with the European Environmental Agency Reference grid (EEA Reference grid, [37]). ERA5 was reprojected and interpolated (with nearest-neighbor interpolation) on the EEA 16-km reference grid while VHR-REA CCLM was reprojected and interpolated on the EEA 2-km reference grid. The factor of the downscaling procedure is, therefore, 8x: over the target domain, low-resolution data consists of 72x86 16-km pixels images while high-resolution data consists of 576x672 2-km pixels images.

3.3 Experimental domain

The experiment target domain spans from 36°N to 48°N, and from 5°E to 20°E (Figure 1). This area corresponds to the region where VHR-REA CCLM data are available. The region includes a wide variety of topographically different sub-areas (mountainous areas such as the Alps and the Appennini, flat areas such as the Po Valley and coastal lines) which trigger local phenomena whose effects are challenging to identify for the DL downscaling models, as they are not present in the low-resolution data.

Refer to caption — Figure 1: Experimental domain with 2-km digital elevation model.

3.4 Target variables and predictors

The target variables of the study are (i) 2-m temperature and (ii) horizontal wind components at 10 m: different, dedicated models to each target variable have been trained. The input ERA5 low-resolution data are the target variables and additional fields used as dynamical predictors, to improve models’ performance. The choice of the set of input fields was based on previous literature (e.g. [18], [17], [26]) and on hardware constraints for the experiment. The selected fields used as predictor variables, for both target variables, are the following, corresponding to a total of 14 input channels to our networks:

•

2-m temperature
•

10-m zonal and meridional wind speed
•

pressure mean sea level
•

sea surface temperature
•

snow depth
•

dew-point 2-m temperature
•

incoming surface solar radiation
•

temperature at 850 hPa
•

zonal, meridional and vertical wind speed at 850hPa
•

specific humidity at 850 hPa
•

total precipitation

In addition, high-resolution static data have been fed to the models to guide the training and improve performance. These fields include:

•

Digital Elevation Model (DEM)
•

land cover categories
•

latitude

DEM data consists of the Copernicus Digital Elevation Model (DEM, [38]) interpolated from a resolution of 90 m to a resolution of 2 km. Land cover data were retrieved from the Copernicus Land Service, Global Land Cover data (GLC, [39]) interpolated from a resolution of 100 m to a resolution of 2 km. Given that land cover was utilized as a static variable in our analysis, we selected data from 2015: this year represents the earliest epoch available for the selected GLC dataset and falls approximately amid our experimental period of 2000-2020. Land cover class data have been converted to single-channel class masks for the DL models (totaling 16 channels). All static fields have been re-projected and aligned to the high-resolution 2-km EAA reference grid.

3.5 Dataset splitting strategy

The experiment database comprises hourly data from 2000 to 2020 ( 184000 timestamps), for both low and high-resolution data. 70% of data were used for training, 15% for validation and 5% for testing (corresponding to 15, 3 and 1 year, respectively). The testing dataset was fixed to 1 year (5% of the dataset) because of time constraints in running all the models for the evaluation, especially the diffusion model.

4 Methods

In this work, we test a deep generative Latent Diffusion Model (LDM) for the downscaling task, conditioned with low-resolution predictors and high-resolution static data. The implemented LDM is trained to predict the residual error between a previously trained reference UNET and the target variables, hence the model will be addressed as LDM_res (LDM_residual) hereafter. This residual approach has shown great performance in the application of pixel-space diffusion models [33] and is tested here in the latent diffusion context. The underlying idea is to exploit the great ability of a relatively simple network (a UNET) to properly capture the main, bigger-scale variability of the atmospheric high-resolution data and leverage the power of the generative diffusion model to only focus on the reconstruction of the smaller-scale, locally-driven variability of the fields. Figure 2 shows the high-level flow chart of the training and inference setup for LDM_res, while Section 4.3 holds the detailed description of LDM_res architecture.

LDM_res is compared against three different baselines with increasing levels of complexity: the quadratic interpolation of ERA5, a UNET, and a Generative Adversarial Network (GAN). The implemented UNET is the core base for each tested deep-learning architecture. Indeed, the same reference UNET network is used (i) as a baseline, (ii) as the Generator of the implemented GAN, and (iii) for the calculation of the residual on which LDM_res is trained. With this approach, we aim to fairly compare the power of generating small-scale features of the adversarial and the diffusion methods.

A dedicated network has been trained for each model type, for the two target variables: the 2-m temperature and 10-m horizontal wind components. The downscaling is performed for fixed time steps, with an image-to-image approach.

Given the incremental complexity of the tested models, in the following sections, we start by describing the core reference UNET architecture (Section 4.1), then the GAN architecture (Section 4.2), and finally LDM_res architecture (Section 4.3). The number of trainable parameters of the models is summarized in Table 1.

Table 1: Number of trainable parameters for each tested model.

Model	Sub-net	# of trainable parameters	Tot # of trainable parameters
UNET		$\sim$ 31M	$\sim$ 31M
GAN	UNET-Generator	$\sim$ 31M	$\sim$ 34M
GAN	Discriminator	$\sim$ 3M	$\sim$ 34M
LDM_res	VAE	$\sim$ 115K (2mT), $\sim$ 430K (UV)	$\sim$ 300M
	Conditioner	$\sim$ 21M
	Denoiser	$\sim$ 275M

4.1 UNET Architecture

The core UNET network implemented for our experiments is a standard UNET architecture ([40]), featuring an encoder (contracting path), a bottleneck, and a decoder (expansive path), with skip connections bridging corresponding levels between the encoder and decoder to preserve spatial information. To use a standard UNET to perform downscaling, the input low-resolution data are interpolated with the nearest neighbor interpolation to the target high resolution before feeding them to the network. The encoder is composed of four blocks, each consisting of a layer containing two consecutive 2D convolutions with ReLU activation, interspersed with batch normalization to ensure stable learning ([41]), and a max-pooling operation. The max-pooling layer reduces the spatial resolution by half, enabling the model to capture increasingly complex features while reducing the image dimensions. The output of each encoder block is used in both the next encoder block and the corresponding decoder block through skip connections. The decoder mirrors the encoder with transposed 2D convolutional layers and upsampling steps to go back to the starting resolution. The use of batch normalization ensures robust learning, while the skip connections help preserve critical spatial information across the encoder-decoder bridge. Details on the UNET structure and resolutions are depicted in Figure 3.

4.2 GAN Architecture

The Generative Adversarial Network (GAN, [19, 20]) tested in this experiment consists of deep, fully convolutional Generator and Discriminator networks, conditioned with low-resolution predictors and high-resolution static data. The generator is trained to output fields that cannot be distinguished from ground truth images by a discriminator, which is trained on the other hand to detect the generator’s ’fake’ outputs. Our reference GAN consists of a UNET generator upgraded with a Patch-GAN discriminator ([42]). The input data to the generator are low-resolution predictors and high-resolution static data only (no noise addition is performed) and we, therefore, obtain a deterministic GAN.

The generator architecture consists exactly of the UNET described in Section 4.1, as shown in Figure 3.

The discriminator is a PatchGAN convolutional classifier ([42]), which focuses on structures at the scale of image patches. The structure of the discriminator is composed of modules of the form convolution-BatchNorm-ReLu. It assigns a "realness" score to each N × N patch of the image, runs convolutionally across the image, and allows us to obtain an overall score by averaging all responses for each patch. High-quality results can be obtained with patches much smaller than the full size of the image, with relevant advantages in terms of resources for the training and application to arbitrarily large images. Details on the discriminator network structure and resolutions are depicted in Figure 3.

The training procedure follows the combined loss function approach for GANs [20], including recent improvements to promote stability in the training ([43]), with the primary goal of balancing the minimization of both the generator’s and discriminator’s losses, which are adversarial. The pixel loss we used is the mean absolute error, while the discriminator loss is the hinge loss. The discriminator is activated after 50000 training steps, giving the generator time to learn to generate consistent outputs and thus stabilize the adversarial training ([43]). After activating the discriminator, the network is trained by updating alternatively the gradients of the generator and the discriminator. The network is constantly fed with the whole target domain and no patch-training is applied.

4.3 Latent Diffusion Model Architecture

Diffusion Models ([21]) are probabilistic models meant to extrapolate a data distribution p(x) by corrupting the training data through the successive addition of Gaussian noise (fixed) and then learning to recover the data by reversing this noising process (generative).

The Latent Diffusion Model (LDM) applied for this experiment is an architecture derived from Stable Diffusion ([24]), specifically a re-adaptation of the conditional LDM LDCast ([30]), developed for precipitation nowcasting and already successfully applied for other variables (e.g. [44]). The latent diffusion model derived for the downscaling task is composed of three main elements: (i) a convolutional Variational AutoEncoder (VAE), used to project the residual high-resolution target variables to and from a latent space; (ii) a conditioner, used to embed low-resolution data and high-resolution static data; and a denoiser, used to manage the diffusion process within the latent space. In the following sections, we report a detailed description of each model component. The number of trainable parameters for each component is summarized in Table 1, while Figures 4 and 5 summarize the training and inference procedures for the model and the main structure of each component architecture, respectively.

4.3.1 Variational Autoencoder

The Variational Autoencoder (VAE) projects the residual high-resolution data from the pixel space to a continuous latent space (encoder) and projects them back to the pixel space (decoder). We train a dedicated VAE for the 2-m temperature and a dedicated VAE for the 10-m wind speed components, independently from the conditioner and denoiser. Once trained, the VAE weights are kept constant during the training of the rest of the network architecture. During inference, only the decoder of the VAE is used (see Figure 5).

The encoder and the decoder are structured as 2D convolutional networks composed of blocks of a ResNet residual block ([45]) and a downsampling/upsampling convolutional layer. Three levels of such blocks are used, each reducing each spatial dimension by a factor of 2, while the number of channels is bottlenecked at 32 times the number of input target variables (i.e. 32x1 for the 2-m temperature and 32x2 for the 10-m wind speed components). The VAE bottleneck latent space is regularised with a loss based on Kullback-Leibler divergence (KL) between the latent variable and a multivariate standard normal variable.

While the space dimensions are reduced by a factor of $8^{2}$ (from 512x512 to 64x64 pixel patches), the number of channels is increased from 1 (for the 2-m temperature) and 2 (for the 10-m wind speed components) to 32 and 64, respectively: the overall amount of data is therefore compressed only by a factor of 2, for both the VAEs (from 1x512x512 to 32x64x64 and from 2x512x512 to 64x64x64). Nevertheless, the gain in training performance of the denoiser and conditioner is much greater than the data reduction factor as the compression along the space dimension is more relevant for reducing the computational cost of the training than the increase in channel number ([24]).

Figure 4 depicts details of the VAE’s structure.

4.3.2 Conditioner

The conditioner stack acts as a context encoder to process the low-resolution predictors and high-resolution static data and embed them into each level of the denoiser UNET architecture. Initially, both datasets are preprocessed by passing through a dedicated encoder, a projection layer, and an analysis sequence, before being merged. The predictors’ encoder is a basic Identity layer as they already have the same spatial dimensions as the latent space (64x64). The static data encoder is a variational encoder with the same structure as the VAE described in the previous Section 4.3.1. For both datasets, the projection layer is a 2D convolutional layer with unitary kernel size, used to increase the number of channels, and the analysis is a sequence of 4 2D Adaptive Fourier Neural Operator (AFNO) blocks (following [46]), used to extract relevant features. After pre-processing, the conditioning information is prepared to be fed into each level of the denoiser UNET by applying a combination of average pooling and 2D Resnet layers. Figure 4 depicts details of the conditioner’s structure.

4.3.3 Denoiser

Our denoising stack is structured as the one of the LDCast model ([30]), a re-adaptation of the U-Net-type network applied in the original latent diffusion model ([24]). The resulting denoiser network consists of a UNet backbone enabled with a conditioning mechanism based on 2D AFNO blocks ([30]), aiming at a cross-attention-like operation (as suggested in [47]). This structure is meant to control the high-resolution synthesis process feeding the conditioning in each level of the UNET architecture.

For the downscaling task, the conditioning information consists of the low-resolution predictors’ data and the high-resolution static data, elaborated by the conditioner. Figure 4 depicts details of the denoiser’s structure.

To improve the reconstruction of extreme values (for both temperature and wind speed) we implemented the v-prediction parameterization in our LDM model, following [48]: this parameterization trains the denoiser to model a weighted combination of both the noise and the start image, instead of either the only noise or the only start image as done in the more traditional implementations eps and x0, respectively.

As shown in Figure 5, the conditioner and the denoiser are trained together, minimizing the mean square error (L2), and feeding the network with random patches of ERA5 predictors (64x64 pixels) and static data (512x512 pixels) for the conditioning, and high-resolution target variables (512x512 pixels) for the ground truth.

5 Results

All the presented models were tested on a one-year dataset, which was held out during the training and validation processes. The results from the LDM_res are evaluated based on a single inference run, obtained using 100 denoising steps; its potential to produce ensemble results is postponed to future analyses.

The following sections compare the results from LDM_res against the baselines using various verification metrics and distributions. In the additional material, Section 8.4, we report the comparison of results from the LDM trained with and without the residual approach to provide an overview of the improvements introduced by this method.

5.1 Qualitative evaluation

To provide a qualitative and perceptual overview of the obtained results, we present a random snapshot of downscaled variables compared with both the input ERA5 low-resolution data and the COSMO-CLM high-resolution reference-truth (Figure 6). The second and third columns show a zoom-in on Sardinia Island, providing a deeper overview of models’ performance over complex terrain, coastal shores, and open sea. Both generative models, the GAN and LDM_res, effectively overcome the blurriness observed in both the quadratic interpolation and the UNET for the target variables. Particularly for 2-meter temperatures, LDM_res demonstrates a remarkable ability to identify and reconstruct discontinuities in the variable field (zoomed-in view in Figure 6). Figure 6 also includes results for 10-meter wind speed (in color), which is a derived field obtained by combining the two actual target variables of the models, U and V. Perceptually, the results for this variable from both the GAN and LDM_res appear similar and equally plausible, displaying significantly more small-scale features compared to the UNET. A deeper qualitative examination reveals that the GAN aligns well with the reference truth, particularly over land, but exhibits mode collapse over the sea for both target variables. An example of this effect is shown in the additional material, Section 8.1, Figure 11. Conversely, the LDM_res consistently generates plausible high-resolution data across the entire domain, over both land and sea, and for both target variables.

5.2 Verification deterministic metrics

Figure 7 compares model results for different deterministic metrics, averaging results over the whole domain for each test timestep. In addition to results from the baseline and tested models, Figure 7 also reports results for the VAE of the LDM. These results are obtained using the VAE offline, feeding it with COSMO_CLM high-resolution data and calculating the metrics on the reconstructed data: this allows quantifying the LDM error resulting from the data decompression from the latent space only. We present three distance metrics, the Root Mean Square Error (RMSE), the mean bias (BIAS), and the coefficient of determination (R2), and one correlation metric, the Pearson Correlation Coefficient (PCC). Details on how these metrics are calculated are presented as additional material in Section 8.2, following [49].

The UNET and GAN models show comparable and best results for all metrics, except for the bias. Indeed, minimizing the MSE is the exact goal of their training procedure. Conversely, the LDM has been trained on a much different objective but performs very well for all the metrics. As expected, all models struggle more in downscaling the wind components than the 2-m temperature. Biases show that all models perform very well, with LDM_res excelling, especially for temperature. The UNET and the GAN show spatially averaged biases within 1°C for temperature, while LDM_res shrinks this variability to less than 0.5°C. Spatially averaged biases amount to 1 m/s for wind speed, with a narrower spread for LDM_res. The UNET and LDM_res show a less skewed distribution than the GAN model for the 2-m temperature: while the GAN model tends to underestimate the average 2-m temperature mostly, LDM_res shows a very balanced distribution for over- and under-estimations. As for the wind speed biases, all the models always slightly underestimate the target variable.

The results show that the VAE contribution to LDM_res is the lowest for the RMSE of 2-m temperature, while bias, R2, and PCC have very little to no effects on temperature and wind speed.

5.3 Spatial distribution of errors

The spatial distribution of averaged-in-time magnitude differences for both the target variables and all tested models is illustrated in Figure 8. Within each panel, the numbers in squared brackets represent the 0.5 and 99.5 percentile values, offering insight into the highest errors recorded. Negative and positive values signify underestimation and overestimation, respectively, for both variables. Results from the quadratic interpolation of ERA5 data provide information on the original input data: 2-m temperature tends to be highly overestimated over complex terrain but underestimated on flat terrain, with smaller errors over the sea. Wind speed, conversely, is largely underestimated over land, particularly over mountain ridges, with a tendency towards overestimation along coastal shores.

On the contrary, all DL-based models, including the UNET baseline, exhibit substantially smaller errors. For 2-m temperature, errors remain below 0.3°C, while for wind speed, they stay under 0.8 m/s across the entire domain. Notably, the UNET and GAN models perform comparably well for 2-m temperature, whereas LDM_res excels, leveraging diffusion processes to reduce the UNET errors homogeneously.

As for the wind speed results, all models tend to underestimate the wind speed. The LDM model shows the best performance, with errors reduced to nearly 0 over most of the domain, with a uniform distribution over both land and sea. The GAN presents traces of its characteristic mode collapses especially over the sea: this evidence suggests that the mode collapses are reproduced statically by the model in the same locations through time, which is consistent with the deployed training approach, i.e. feeding the network always the entire, fixed domain.

Regarding wind speed, all models exhibit a tendency to underestimation. LDM_res demonstrates superior performance, minimizing errors to nearly zero over most of the domain, with a uniform distribution over land and sea. The GAN displays traces of mode collapses, particularly evident over the sea, indicating that these collapses persist statically in specific locations over time, consistent with the training approach of presenting the network with the fixed, entire domain.

5.4 Frequency distributions

Figure 9 presents the results in terms of frequency distributions. LDM_res precisely captures the reconstruction of the 2-m temperature frequency distribution, surpassing all other models. All DL models effectively mitigate the occurrence of cold extremes evident in the low-resolution data (as demonstrated by the quadratic interpolation distribution) while increasing the incidence of warm extremes. Notably, the adversarial training of the UNET yields marginal enhancements in capturing the frequency distribution, with the GAN slightly outperforming the UNET, particularly regarding cold extremes. Conversely, the diffusion process performed by LDM_res significantly corrects the UNET residual errors, aligning closely with the reference-truth distribution across all temperature values.

Reconstructing the distribution of 10-m wind speed proves more challenging for all models, given the inherent chaotic nature of the $u$ and $v$ wind components compared to temperature, which is strongly influenced by terrain elevation. Nonetheless, performance outcomes mirror those of the 2-m temperature. The GAN modestly improves upon UNET results, primarily in reducing occurrences of low wind speeds. LDM_res exhibits the highest proficiency in capturing both the tail and center of the wind speed distribution.

5.5 Radially Averaged Power Spectral Density (RAPSD)

Figure 10 showcases the results in terms of Radially Averaged Power Spectral Density (RAPSD), computed following the implementation outlined in [50]. The top row of the figure illustrates a single RAPSD, representing the average of each RAPSD calculated for every timestamp within the test dataset. To provide insight into the distribution of these values across all timestamps, the distributions of single-time RAPSD for fixed wavelengths are displayed in the bottom rows of Figure 10.

Overall, all DL models effectively reconstruct the 2-m temperature power spectra down to wavelengths of 10 km. However, LDM_res consistently outperforms both the UNET and the GAN, as evident in panels (b) and (c) of Figure 10. The UNET and the GAN yield similar results, with marginal yet consistent enhancements stemming from the adversarial training of the UNET. The diffusion process of LDM_res adeptly enhances the generation of small-scale features, showing precise reconstruction of the spectra up to 9 km (see panel c). For scales smaller than 9-10 km, all models exhibit decreased performance, albeit still showing improvements over the quadratic interpolation of ERA5. LDM_res still outperforms the other models but the very small-scale variability of the original data is slightly underestimated. This behavior is to be ascribed to the VAE, as further elucidated in Section 8.3. Indeed, original reference-truth data compressed and reconstructed by the VAE show the very same spectra underestimation for scales smaller than 9 km. The loss of information is therefore due to and inherent to the projection to the latent space.

In contrast, results for wind speed distinctly demonstrate that generative models surpass both quadratic interpolation and UNET, effectively matching the slope of the energy power spectra and remaining competitive with each other. Specifically, LDM_res consistently outperforms the GAN up to 7 km, as emphasized in panels (b) and (c) of Figure 10 (right column). Similar to the 2-m temperature, for scales smaller than 10-9 km, both the GAN and LDM_res experience reduced performance, although they consistently exhibit improvements over the UNET. This behavior, for LDM_res, is in this case partly to be ascribed to the VAE, as further shown in Section 8.3. Original wind reference-truth data compressed and reconstructed by the VAE show a similar spectra underestimation for scales smaller than 9 km but an additional loss is to be attributed intrinsically to the extraction of features with the diffusion process conditioned with the low-resolution data and high-resolution static data. The loss of information is therefore due to and inherent to the projection to the latent space.

5.6 Run-time performance

In this section, we compare the run-time performance of our tested models. These characteristics are of fundamental importance given the potential target applications of such models. Table 2 reports data for each model: to give a whole picture of the needed resources we provide information on both the training and inference requirements. The hour budgets indicated for LDM_res include hours to train/run the UNET also, which is needed to provide the residual data. LDM_res budgets are therefore comprehensive of the required computational time to train and run the whole modeling chain from scratch. Simulations were run either on NVIDIA GeForce RTX 4090 or NVIDIA A100 GPUs. The training dataset comprises 129000 hour-samples over a target domain of 576x672 pixels (at high resolution) and 72x86 pixels (at low-resolution). The UNET and GAN training ran with batches of and of for 2-m temperature and 10-m wind speed components, respectively. LDM_res training ran with 16- and 8- size batches for the corresponding two target variables. Inference times are calculated running with single-dimension batches.

As shown in Table 2, LDM_res implies more expensive training and inference processes when compared with the tested DL baselines. This evidence is expected, given the more complex structure of the diffusion model and its dimensions in terms of trainable parameters, which is an order of magnitude greater than that of the baselines. Nevertheless, the required computational time for both training and inference remains contained and competitive with the other available options. LDM_res requires 10 days over 8 GPUs to train both models and 30 hours on a single GPU to downscale one year of hourly data. As a term of comparison, we note here that a 1-year-long COSMO-CLM simulation ran over the very same domain requires 61 h, running on 2160 cores [2] (of course producing many more high-resolution variables than the sole 2-m temperature and 10-m horizontal wind components).

Table 2: Number of GPU hours required for the training and inference (of a 1-year-long test set) by each tested model.

Model	Training		Inference
Model	2mT	UV	2mT	UV
UNET	$\sim$ 250	$\sim$ 380	$\sim$ 1	$\sim$ 1
GAN	$\sim$ 300	$\sim$ 100	$\sim$ 1	$\sim$ 1
LDM_res	$\sim$ 870	$\sim$ 1100	$\sim$ 15	$\sim$ 16

6 Discussion and Conclusions

This study compares the performance of various downscaling models, focusing on their ability to reconstruct high-resolution meteorological variables from low-resolution input data. The models evaluated include a baseline UNET, a Generative Adversarial Network (GAN), and a Latent Diffusion Model with a residual approach against the reference UNET (LDM_res). The results are analyzed using qualitative evaluations, deterministic metrics, spatial error distributions, frequency distributions, Radially Averaged Power Spectral Density (RAPSD), and runtime performance.

LDM_res demonstrates superior performance across most metrics, particularly in reconstructing fine-scale details and maintaining accuracy in frequency distributions (especially for the extreme values) and spatial error distributions. LDM_res outperforms the other models in reconstructing the power spectra, showing superior performance, especially for wind speed, with outstanding results up to 7 km wavelengths. Residual errors at smaller scales can be attributed to the data projection into the latent space, specifically to the usage of the VAE. This performance loss might be mitigated by conducting the diffusion process directly in the pixel space. However, this alternative approach would substantially increase the computational costs for both training and inference.

The remarkable results of LDM_res are to be ascribed equally to two fundamental aspects of the proposed model:

•

the incomparable effectiveness of the diffusion process in extracting features and leveraging the provided conditioning;
•

the residual approach which allows the diffusion process to focus only on smaller scales and more subtle characteristics of the fields, delegating the estimates of large-scale variation of the atmospheric fields to a simpler, yet effective, network.

However, the great performance of LDM_res comes at the cost of significantly higher computational requirements for both training and inference, when compared to the other DL models, the UNET or the GAN. Nonetheless, LDM_res still offers a significant advantage in terms of inference speed and computational efficiency once the model is trained when compared to the extensive computational resources required by COSMO-CLM.

In conclusion, the ability of LDM_res to accurately reproduce the statistics of the COSMO_CLM model reference truth data, provided with the same input, demonstrates its potential as an effective and versatile dynamical downscaling emulator. This approach significantly accelerates the downscaling process compared to traditional numerical dynamical models, making it highly suitable for a broad range of important applications, such as downscaling seasonal forecasts or climate projections. However, a primary limitation of such deep learning models remains the temporal consistency of the generated fields which is not provided by the construction of the models.

7 Future work

The results presented in this work suggest several promising directions for further investigation into the application of latent diffusion models for downscaling. Future research could explore the ensemble generation capabilities of our Latent Diffusion Model, its effectiveness in downscaling discontinuous and chaotic variables such as precipitation—crucial for many applications—and the temporal consistency of the downscaled data, including methods to enforce this consistency within the model architecture. Long-term developments may include applying latent diffusion models to real-time weather forecasts, seasonal forecasts, and climate projections, with adjustments in the training procedure, particularly in selecting low-resolution input predictors and target reference truths.

Acknowledgments

This work was developed with financial support from ICSC–Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, Spoke4 - Earth and Climate, funded by European Union – NextGenerationEU.

Disclaimer

This Work has not yet been peer-reviewed and is provided by the contributing Author(s) as a means to ensure timely dissemination of scholarly and technical Work on a noncommercial basis. Copyright and all rights therein are maintained by the Author(s) or by other copyright owners. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each Author’s copyright. This Work may not be reposted without explicit permission of the copyright owner.

8 Additional Material

8.1 Additional random snapshots of downscaled data

Some additional snapshots of downscaled data from all the tested models are shown in the following.

8.2 Calculation of verification metrics

In Section 5.2, we discuss the following metrics: three distance metrics, the Root Mean Square Error (RMSE), the mean bias (BIAS), the coefficient of determination (R2), and one correlation metric, the Pearson Correlation Coefficient (PCC). All the metrics are calculated with the xskillscore library [49]. Definitions are the following:

\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(a_{i}-b_{i})^{2}}

(1)

\mathrm{BIAS}=\frac{1}{n}\sum_{i=1}^{n}(a_{i}-b_{i})

(2)

SS_{{tot}}=\sum_{i=1}^{n}(a_{i}-\bar{a})^{2}

(3)

SS_{{res}}=\sum_{i=1}^{n}(a_{i}-b_{i})^{2}

(4)

R^{2}=1-\frac{SS_{{res}}}{SS_{{tot}}}

(5)

PCC=\frac{\sum_{i=1}^{n}(a_{i}-\bar{a})(b_{i}-\bar{b})}{\sqrt{\sum_{i=1}^{n}(a% _{i}-\bar{a})^{2}}\sqrt{\sum_{i=1}^{n}(b_{i}-\bar{b})^{2}}}

(6)

where $a$ and $b$ are the predicted and reference truth values, respectively, for every $i^{th}$ pixel of the domain; $n$ are the total number of pixels in the domain.

8.3 On the contribution of the VAE

Figure 12 compares spectra from the COSMO-CLM test reference truth data with spectra from high-resolution test data generated using three models: the quadratic interpolation of ERA5 data, VAE_res, and LDM_res. This figure aims to highlight the decompression stage’s contribution from the latent space to the pixel space in LDM_res.

VAE_res takes as input the original reference truth high-resolution target variables. LDM_res takes as input the low-resolution ERA5 predictors, the high-resolution static data, and random noise, produces information in the latent space, and projects them in the pixel space using the VAE Decoder. Both VAE_res and LDM_res also use the corresponding UNET estimates for each target variable, but these quantities are subtracted before the encoding step (when applied) and added back after the decompression stage, acting essentially as constants. Therefore, the reconstruction errors in the spectra for VAE_res are thus solely attributed to the compression/decompression processes, while the reconstruction errors in the spectra for LDM_res arise from both the diffusion and decompression processes.

For 2-m temperature, the spectra from the VAE_res and LDM_res are nearly identical across all wavelengths, including the smallest scales. This indicates that the errors in reconstructing the spectra with LDM_res are attributable solely to the decompression stage, with the diffusion process effectively and accurately extracting latent features from the conditional data. On the contrary, for 10-m wind speed, which is a more chaotic field, discrepancies between the spectra from LDM_res and VAE_res are observed at scales smaller than approximately 7 km. These differences highlight the errors introduced by the sole extraction of features by the diffusion process.

8.4 LDM residual versus non-residual results

In this section, we report the comparison of results from the LDM trained with and without the residual approach, to provide an overview of the improvements introduced by this approach. We compare results in terms of frequency distribution and of radially average power spectral density, shown in Figures 13 and 14.

The graphs show relevant differences in the performance of the two models, especially in: (i) the estimation of the most frequent values of 2-m temperature, (ii) the reconstruction of the whole frequency distribution of the wind speed, (iii) the reconstruction of the 2-m temperature power spectra at small scales, with performances of LDM with no residual approach degrading below the quadratic interpolation of ERA5, and (iv) the reconstruction of the 10-m wind speed power spectra at all scales, with a quasi-constant lag between the two models throughout all wavelengths.

The corresponding VAEs for the two models (VAE and VAE_res) perform the same except for the small scales of the 2-m temperature power spectra (not shown). Thus, the lack of performance of LDM can be attributed to the VAE only for this occasion (i.e., (iii)). All the other listed deficiencies are to be attributed to the diffusion process itself, which, when trained to reconstruct a residual field instead of the original target variable field, gains performance in the reconstruction of all the frequency distributions, and of the power spectra across all wavelengths (not only the small scales), especially for chaotic variables such as wind speed.

References

[1] Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, Adrian Simmons, Cornel Soci, Saleh Abdalla, Xavier Abellan, Gianpaolo Balsamo, Peter Bechtold, Gionata Biavati, Jean Bidlot, Massimo Bonavita, Giovanna De Chiara, Per Dahlgren, Dick Dee, Michail Diamantakis, Rossana Dragani, Johannes Flemming, Richard Forbes, Manuel Fuentes, Alan Geer, Leo Haimberger, Sean Healy, Robin J. Hogan, Elías Hólm, Marta Janisková, Sarah Keeley, Patrick Laloyaux, Philippe Lopez, Cristina Lupu, Gabor Radnoti, Patricia de Rosnay, Iryna Rozum, Freja Vamborg, Sebastien Villaume, and Jean-Noël Thépaut. The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020.
[2] Mario Raffa, Alfredo Reder, Gian Franco Marras, Marco Mancini, Gabriella Scipione, Monia Santini, and Paola Mercogliano. Vhr-rea_it dataset: Very high resolution dynamical downscaling of era5 reanalysis over italy by cosmo-clm. Data, 6(8), 2021.
[3] Marianna Adinolfi, Mario Raffa, Alfredo Reder, and Paola Mercogliano. Investigation on potential and limitations of era5 reanalysis downscaled on italy by a convection-permitting model. Climate Dynamics, 61(9), 2023.
[4] BC Hewitson and RG Crane. Climate downscaling: techniques and application. Climate Research, 07:85–95, 1996.
[5] R.L. Wilby and T.M.L. Wigley. Downscaling general circulation model output: a review of methods and limitations. Progress in Physical Geography: Earth and Environment, 21(4):530–548, 1997.
[6] Douglas Maraun and Martin Widmann. Statistical Downscaling and Bias Correction for Climate Research. Cambridge University Press, 2018.
[7] Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. Operational convective-scale numerical weather prediction with the cosmo model: Description and sensitivities. Monthly Weather Review, 139(12):3887 – 3905, 2011.
[8] Y. Seity, P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson. The arome-france convective-scale operational model. Monthly Weather Review, 139(3):976 – 991, 2011.
[9] Cosmo arpae-simc. http://www.cosmo-model.org/content/tasks/operational/cosmo/arpae-simc/default.htm. Accessed: 20 May 2024.
[10] Filippo Giorgi, Colin Jones, and Ghassem R. Asrar. Addressing climate information needs at the regional level: the cordex framework. World Meteorological Organization (WMO) Bulletin, 58(3):175, 2009.
[11] Daniela Jacob, Juliane Petersen, Bastian Eggert, Antoinette Alias, Ole Bøssing Christensen, Laurens M Bouwer, Alain Braun, Augustin Colette, Michel Déqué, Goran Georgievski, et al. Euro-cordex: new high-resolution climate change projections for european impact research. Regional environmental change, 14:563–578, 2014.
[12] Hans von Storch, Eduardo Zorita, and Ulrich Cubasch. Downscaling of global climate change estimates to regional scales: An application to iberian rainfall in wintertime. Journal of Climate, 6(6):1161 – 1171, 1993.
[13] E. Sharifi, B. Saghafian, and R. Steinacker. Downscaling satellite precipitation estimates with multiple linear regression, artificial neural networks, and spline interpolation techniques. Journal of Geophysical Research: Atmospheres, 124(2):789–805, 2019.
[14] Simone Sperati, Stefano Alessandrini, Filippo D’Amico, Will Cheng, Christopher M. Rozoff, Riccardo Bonanno, Matteo Lacavalla, Martina Aiello, Davide Airoldi, Alessandro Amaranto, Goffredo Decimi, and Milena Angelina Vergata. A new wind atlas to support the expansion of the italian wind power fleet. Wind Energy, 27(3):298–316, 2024.
[15] H.A. Panofsky and G.W. Brier. Some Applications of Statistics to Meteorology. Earth and Mineral Sciences Continuing Education, College of Earth and Mineral Sciences, 1968.
[16] J. Baño Medina, R. Manzanas, and J. M. Gutiérrez. Configuration and intercomparison of deep learning neural models for statistical downscaling. Geoscientific Model Development, 13(4):2109–2124, 2020.
[17] Neelesh Rampal, Peter B. Gibson, Abha Sood, Stephen Stuart, Nicolas C. Fauchereau, Chris Brandolino, Ben Noll, and Tristan Meyers. High-resolution downscaling with interpretable deep learning: Rainfall extremes over new zealand. Weather and Climate Extremes, 38:100525, 2022.
[18] Kevin Höhlein, Michael Kern, Timothy Hewson, and Rüdiger Westermann. A comparative study of convolutional neural network models for wind field downscaling. Meteorological Applications, 27(6):e1961, 2020.
[19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139–144, oct 2020.
[21] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
[22] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, April 2023. Epub 2023 Mar 7.
[23] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. CoRR, abs/2112.10752, 2021.
[25] Karen Stengel, Andrew Glaws, Dylan Hettinger, and Ryan N. King. Adversarial super-resolution of climatological wind and solar data. Proceedings of the National Academy of Sciences, 117(29):16805–16815, 2020.
[26] Lucy Harris, Andrew T. T. McRae, Matthew Chantry, Peter D. Dueben, and Tim N. Palmer. A generative deep learning approach to stochastic downscaling of precipitation forecasts. Journal of Advances in Modeling Earth Systems, 14(10):e2022MS003120, 2022. e2022MS003120 2022MS003120.
[27] Jussi Leinonen, Daniele Nerini, and Alexis Berne. Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Transactions on Geoscience and Remote Sensing, 59(9):7211–7223, 2021.
[28] Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
[29] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge?, 2018.
[30] Jussi Leinonen, Ulrich Hamann, Daniele Nerini, Urs Germann, and Gabriele Franch. Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification, 2023.
[31] Lizao Li, Robert Carver, Ignacio Lopez-Gomez, Fei Sha, and John Anderson. Generative emulation of weather forecast ensembles with diffusion models. Science Advances, 10(13):eadk4489, 2024.
[32] Henry Addison, Elizabeth Kendon, Suman Ravuri, Laurence Aitchison, and Peter AG Watson. Machine learning emulation of a local-scale uk climate model, 2022.
[33] Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash Vahdat, Karthik Kashinath, Jan Kautz, and Mike Pritchard. Residual diffusion modeling for km-scale atmospheric downscaling, 2023.
[34] Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R. Ganguly. Deepsd: Generating high resolution climate change projections through single image super-resolution. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 1663–1672, New York, NY, USA, 2017. Association for Computing Machinery.
[35] J. Wang, Z. Liu, I. Foster, W. Chang, R. Kettimuthu, and V. R. Kotamarthi. Fast and accurate learned multiresolution dynamical downscaling for precipitation. Geoscientific Model Development, 14(10):6355–6372, 2021.
[36] Epsg geodetic parameter dataset. https://epsg.org/home.html.
[37] European environmental agency reference grid. https://www.eea.europa.eu/data-and-maps/data/eea-reference-grids-2.
[38] Copernicus digital elevation model. https://registry.opendata.aws/copernicus-dem, 2023. Accessed: 2nd February 2023.
[39] Marcel Buchhorn, Bruno Smets, Luc Bertels, Bert De Roo, Myroslava Lesiv, Nandin Erdene Tsendbazar, Martin Herold, and Steffen Fritz. Copernicus global land service: Land cover 100m: collection 3: epoch 2015: Globe (v3.0.1) [data set]. Zenodo, 2020.
[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
[41] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
[42] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[43] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June 2021.
[44] Alberto Carpentieri, Doris Folini, Jussi Leinonen, and Angela Meyer. Extending intraday solar forecast horizons with deep generative models, 2023.
[45] K. Stephan, S. Klink, and C. Schraff. Assimilation of radar-derived rain rates into the convective-scale model cosmo-de at dwd. Quarterly Journal of the Royal Meteorological Society, 134(634):1315–1326, 2008.
[46] Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators, 2022.
[47] John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference on Learning Representations, 2022.
[48] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. CoRR, abs/2202.00512, 2022.
[49] R. Bell, A. Spring, R. Brady, Andrew, Dougie Squire, Zachary Blackwood, Maximilian Cosmo Sitter, and Taher Chegini. xarray-contrib/xskillscore: Release v0.0.23, aug 2021.
[50] S. Pulkkinen, D. Nerini, A. A. Pérez Hortal, C. Velasco-Forero, A. Seed, U. Germann, and L. Foresti. Pysteps: an open-source python library for probabilistic precipitation nowcasting (v1.0). Geoscientific Model Development, 12(10):4185–4219, 2019.