-
DSSIM: a structural similarity index for floating-point data
Authors:
Allison H. Baker,
Alexander Pinard,
Dorit M. Hammerling
Abstract:
Data visualization is a critical component in terms of interacting with floating-point output data from large model simulation codes. Indeed, postprocessing analysis workflows on simulation data often generate a large number of images from the raw data, many of which are then compared to each other or to specified reference images. In this image-comparison scenario, image quality assessment (IQA)…
▽ More
Data visualization is a critical component in terms of interacting with floating-point output data from large model simulation codes. Indeed, postprocessing analysis workflows on simulation data often generate a large number of images from the raw data, many of which are then compared to each other or to specified reference images. In this image-comparison scenario, image quality assessment (IQA) measures are quite useful, and the Structural Similarity Index (SSIM) continues to be a popular choice. However, generating large numbers of images can be costly, and plot-specific (but data independent) choices can affect the SSIM value. A natural question is whether we can apply the SSIM directly to the floating-point simulation data and obtain an indication of whether differences in the data are likely to impact a visual assessment, effectively bypassing the creation of a specific set of images from the data. To this end, we propose an alternative to the popular SSIM that can be applied directly to the floating point data, which we refer to as the Data SSIM (DSSIM). While we demonstrate the usefulness of the DSSIM in the context of evaluating differences due to lossy compression on large volumes of simulation data from a popular climate model, the DSSIM may prove useful for many other applications involving simulation or image data.
△ Less
Submitted 19 March, 2023; v1 submitted 5 February, 2022;
originally announced February 2022.
-
Nonstationary Spatial Modeling of Massive Global Satellite Data
Authors:
Huang Huang,
Lewis R. Blake,
Matthias Katzfuss,
Dorit M. Hammerling
Abstract:
Earth-observing satellite instruments obtain a massive number of observations every day. For example, tens of millions of sea surface temperature (SST) observations on a global scale are collected daily by the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. Despite their size, such datasets are incomplete and noisy, necessitating spatial statistical inference to obtain complete,…
▽ More
Earth-observing satellite instruments obtain a massive number of observations every day. For example, tens of millions of sea surface temperature (SST) observations on a global scale are collected daily by the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. Despite their size, such datasets are incomplete and noisy, necessitating spatial statistical inference to obtain complete, high-resolution fields with quantified uncertainties. Such inference is challenging due to the high computational cost, the nonstationary behavior of environmental processes on a global scale, and land barriers affecting the dependence of SST. In this work, we develop a multi-resolution approximation (M-RA) of a Gaussian process (GP) whose nonstationary, global covariance function is obtained using local fits. The M-RA requires domain partitioning, which can be set up application-specifically. In the SST case, we partition the domain purposefully to account for and weaken dependence across land barriers. Our M-RA implementation is tailored to distributed-memory computation in high-performance-computing environments. We analyze a MODIS SST dataset consisting of more than 43 million observations, to our knowledge the largest dataset ever analyzed using a probabilistic GP model. We show that our nonstationary model based on local fits provides substantially improved predictive performance relative to a stationary approach.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
Modeling massive highly-multivariate nonstationary spatial data with the basis graphical lasso
Authors:
Mitchell Krock,
William Kleiber,
Dorit Hammerling,
Stephen Becker
Abstract:
We propose a new modeling framework for highly-multivariate spatial processes that synthesizes ideas from recent multiscale and spectral approaches with graphical models. The basis graphical lasso writes a univariate Gaussian process as a linear combination of basis functions weighted with entries of a Gaussian graphical vector whose graph is estimated from optimizing an $\ell_1$ penalized likelih…
▽ More
We propose a new modeling framework for highly-multivariate spatial processes that synthesizes ideas from recent multiscale and spectral approaches with graphical models. The basis graphical lasso writes a univariate Gaussian process as a linear combination of basis functions weighted with entries of a Gaussian graphical vector whose graph is estimated from optimizing an $\ell_1$ penalized likelihood. This paper extends the setting to a multivariate Gaussian process where the basis functions are weighted with Gaussian graphical vectors. We motivate a model where the basis functions represent different levels of resolution and the graphical vectors for each level are assumed to be independent. Using an orthogonal basis grants linear complexity and memory usage in the number of spatial locations, the number of basis functions, and the number of realizations. An additional fusion penalty encourages a parsimonious conditional independence structure in the multilevel graphical model. We illustrate our method on a large climate ensemble from the National Center for Atmospheric Research's Community Atmosphere Model that involves 40 spatial processes.
△ Less
Submitted 9 June, 2021; v1 submitted 7 January, 2021;
originally announced January 2021.
-
HECT: High-Dimensional Ensemble Consistency Testing for Climate Models
Authors:
Niccolò Dalmasso,
Galen Vincent,
Dorit Hammerling,
Ann B. Lee
Abstract:
Climate models play a crucial role in understanding the effect of environmental and man-made changes on climate to help mitigate climate risks and inform governmental decisions. Large global climate models such as the Community Earth System Model (CESM), developed by the National Center for Atmospheric Research, are very complex with millions of lines of code describing interactions of the atmosph…
▽ More
Climate models play a crucial role in understanding the effect of environmental and man-made changes on climate to help mitigate climate risks and inform governmental decisions. Large global climate models such as the Community Earth System Model (CESM), developed by the National Center for Atmospheric Research, are very complex with millions of lines of code describing interactions of the atmosphere, land, oceans, and ice, among other components. As development of the CESM is constantly ongoing, simulation outputs need to be continuously controlled for quality. To be able to distinguish a "climate-changing" modification of the code base from a true climate-changing physical process or intervention, there needs to be a principled way of assessing statistical reproducibility that can handle both spatial and temporal high-dimensional simulation outputs. Our proposed work uses probabilistic classifiers like tree-based algorithms and deep neural networks to perform a statistically rigorous goodness-of-fit test of high-dimensional spatio-temporal data.
△ Less
Submitted 30 November, 2020; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Combining interdependent climate model outputs in CMIP5: A spatial Bayesian approach
Authors:
Huang Huang,
Dorit Hammerling,
Bo Li,
Richard Smith
Abstract:
Projections of future climate change rely heavily on climate models, and combining climate models through a multi-model ensemble is both more accurate than a single climate model and valuable for uncertainty quantification. However, Bayesian approaches to multi-model ensembles have been criticized for making oversimplified assumptions about bias and variability, as well as treating different model…
▽ More
Projections of future climate change rely heavily on climate models, and combining climate models through a multi-model ensemble is both more accurate than a single climate model and valuable for uncertainty quantification. However, Bayesian approaches to multi-model ensembles have been criticized for making oversimplified assumptions about bias and variability, as well as treating different models as statistically independent. This paper extends the Bayesian hierarchical approach of Sansom et al. (2017) by explicitly accounting for spatial variability and inter-model dependence. We propose a Bayesian hierarchical model that accounts for bias between climate models and observations, spatial and inter-model dependence, the emergent relationship between historical and future periods, and natural variability. Extensive simulations show that our model provides better estimates and uncertainty quantification than the commonly used simple model mean. These results are illustrated using data from the CMIP5 model archive. As examples, for Central North America our projected mean temperature for 2070--2100 is about 0.8 K lower than the simple model mean, while for East Asia it is about 0.5 K higher; however, in both cases, the widths of the 90% credible intervals are of the order 3--6 K, so the uncertainties overwhelm the relatively small differences in projected mean temperatures.
△ Less
Submitted 26 February, 2020; v1 submitted 31 December, 2019;
originally announced January 2020.
-
Unlocking GOES: A Statistical Framework for Quantifying the Evolution of Convective Structure in Tropical Cyclones
Authors:
Trey McNeely,
Ann B. Lee,
Kimberly M. Wood,
Dorit Hammerling
Abstract:
Tropical cyclones (TCs) rank among the most costly natural disasters in the United States, and accurate forecasts of track and intensity are critical for emergency response. Intensity guidance has improved steadily but slowly, as processes which drive intensity change are not fully understood. Because most TCs develop far from land-based observing networks, geostationary satellite imagery is criti…
▽ More
Tropical cyclones (TCs) rank among the most costly natural disasters in the United States, and accurate forecasts of track and intensity are critical for emergency response. Intensity guidance has improved steadily but slowly, as processes which drive intensity change are not fully understood. Because most TCs develop far from land-based observing networks, geostationary satellite imagery is critical to monitor these storms. However, these complex data can be challenging to analyze in real time, and off-the-shelf machine learning algorithms have limited applicability on this front due to their ``black box'' structure. This study presents analytic tools that quantify convective structure patterns in infrared satellite imagery for over-ocean TCs, yielding lower-dimensional but rich representations that support analysis and visualization of how these patterns evolve during rapid intensity change. The proposed ORB feature suite targets the global Organization, Radial structure, and Bulk morphology of TCs. By combining ORB and empirical orthogonal functions, we arrive at an interpretable and rich representation of convective structure patterns that serve as inputs to machine learning methods. This study uses the logistic lasso, a penalized generalized linear model, to relate predictors to rapid intensity change. Using ORB alone, binary classifiers identifying the presence (versus absence) of such intensity change events can achieve accuracy comparable to classifiers using environmental predictors alone, with a combined predictor set improving classification accuracy in some settings. More complex nonlinear machine learning methods did not perform better than the linear logistic lasso model for current data.
△ Less
Submitted 3 August, 2020; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Pushing the Limit: A Hybrid Parallel Implementation of the Multi-resolution Approximation for Massive Data
Authors:
Huang Huang,
Lewis R. Blake,
Dorit M. Hammerling
Abstract:
The multi-resolution approximation (MRA) of Gaussian processes was recently proposed to conduct likelihood-based inference for massive spatial data sets. An advantage of the methodology is that it can be parallelized. We implemented the MRA in C++ for both serial and parallel versions. In the parallel implementation, we use a hybrid parallelism that employs both distributed and shared memory compu…
▽ More
The multi-resolution approximation (MRA) of Gaussian processes was recently proposed to conduct likelihood-based inference for massive spatial data sets. An advantage of the methodology is that it can be parallelized. We implemented the MRA in C++ for both serial and parallel versions. In the parallel implementation, we use a hybrid parallelism that employs both distributed and shared memory computing for communications between and within nodes by using the Message Passing Interface (MPI) and OpenMP, respectively. The performance of the serial code is compared between the C++ and MATLAB implementations over a small data set on a personal laptop. The C++ parallel program is further carefully studied under different configurations by applications to data sets from around a tenth of a million to 47 million observations. We show the practicality of this implementation by demonstrating that we can get quick inference for massive real-world data sets. The serial and parallel C++ code can be found at https://github.com/hhuang90.
△ Less
Submitted 5 May, 2019; v1 submitted 30 April, 2019;
originally announced May 2019.
-
Marginally Parametrized Spatio-Temporal Models and Stepwise Maximum Likelihood Estimation
Authors:
Matthew Edwards,
Stefano Castruccio,
Dorit Hammerling
Abstract:
In order to learn the complex features of large spatio-temporal data, models with large parameter sets are often required. However, estimating a large number of parameters is often infeasible due to the computational and memory costs of maximum likelihood estimation (MLE). We introduce the class of marginally parametrized (MP) models, where inference can be performed efficiently with a sequence of…
▽ More
In order to learn the complex features of large spatio-temporal data, models with large parameter sets are often required. However, estimating a large number of parameters is often infeasible due to the computational and memory costs of maximum likelihood estimation (MLE). We introduce the class of marginally parametrized (MP) models, where inference can be performed efficiently with a sequence of marginal (estimated) likelihood functions via stepwise maximum likelihood estimation (SMLE). We provide the conditions under which the stepwise estimators are consistent, and we prove that this class of models includes the diagonal vector autoregressive moving average model. We demonstrate that the parameters of this model can be obtained at least three orders of magnitude faster using SMLE compared to MLE, with only a small loss in statistical efficiency. We apply an MP model to a spatio-temporal global climate data set (in order to learn complex features of interest to climate scientists) consisting of over five million data points, and we demonstrate how estimation can be performed in less than an hour on a laptop.
△ Less
Submitted 29 June, 2018;
originally announced June 2018.
-
Modeling and emulation of nonstationary Gaussian fields
Authors:
Douglas Nychka,
Dorit Hammerling,
Mitchell Krock,
Ashton Wiens
Abstract:
Geophysical and other natural processes often exhibit non-stationary covariances and this feature is important to take into account for statistical models that attempt to emulate the physical process. A convolution-based model is used to represent non-stationary Gaussian processes that allows for variation in the correlation range and vari- ance of the process across space. Application of this mod…
▽ More
Geophysical and other natural processes often exhibit non-stationary covariances and this feature is important to take into account for statistical models that attempt to emulate the physical process. A convolution-based model is used to represent non-stationary Gaussian processes that allows for variation in the correlation range and vari- ance of the process across space. Application of this model has two steps: windowed estimates of the covariance function under the as- sumption of local stationary and encoding the local estimates into a single spatial process model that allows for efficient simulation. Specifically we give evidence to show that non-stationary covariance functions based on the Mat`ern family can be reproduced by the Lat- ticeKrig model, a flexible, multi-resolution representation of Gaussian processes. We propose to fit locally stationary models based on the Mat`ern covariance and then assemble these estimates into a single, global LatticeKrig model. One advantage of the LatticeKrig model is that it is efficient for simulating non-stationary fields even at 105 locations. This work is motivated by the interest in emulating spatial fields derived from numerical model simulations such as Earth system models. We successfully apply these ideas to emulate fields that de- scribe the uncertainty in the pattern scaling of mean summer (JJA) surface temperature from a series of climate model experiments. This example is significant because it emulates tens of thousands of loca- tions, typical in geophysical model fields, and leverages embarrassing parallel computation to speed up the local covariance fitting
△ Less
Submitted 21 November, 2017;
originally announced November 2017.
-
A Case Study Competition Among Methods for Analyzing Large Spatial Data
Authors:
Matthew J. Heaton,
Abhirup Datta,
Andrew Finley,
Reinhard Furrer,
Rajarshi Guhaniyogi,
Florian Gerber,
Robert B. Gramacy,
Dorit Hammerling,
Matthias Katzfuss,
Finn Lindgren,
Douglas W. Nychka,
Furong Sun,
Andrew Zammit-Mangion
Abstract:
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structu…
▽ More
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online.
△ Less
Submitted 25 April, 2018; v1 submitted 13 October, 2017;
originally announced October 2017.
-
Compression and Conditional Emulation of Climate Model Output
Authors:
Joseph Guinness,
Dorit Hammerling
Abstract:
Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on s…
▽ More
Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. The statistical model can be used to generate realizations representing the full dataset, along with characterizations of the uncertainties in the generated data. Thus, the methods are capable of both compression and conditional emulation of the climate models. Considerable attention is paid to accurately modeling the original dataset--one year of daily mean temperature data--particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers.
△ Less
Submitted 19 February, 2018; v1 submitted 25 May, 2016;
originally announced May 2016.
-
Parallel inference for massive distributed spatial data using low-rank models
Authors:
Matthias Katzfuss,
Dorit Hammerling
Abstract:
Due to rapid data growth, statistical analysis of massive datasets often has to be carried out in a distributed fashion, either because several datasets stored in separate physical locations are all relevant to a given problem, or simply to achieve faster (parallel) computation through a divide-and-conquer scheme. In both cases, the challenge is to obtain valid inference that does not require proc…
▽ More
Due to rapid data growth, statistical analysis of massive datasets often has to be carried out in a distributed fashion, either because several datasets stored in separate physical locations are all relevant to a given problem, or simply to achieve faster (parallel) computation through a divide-and-conquer scheme. In both cases, the challenge is to obtain valid inference that does not require processing all data at a single central computing node. We show that for a very widely used class of spatial low-rank models, which can be written as a linear combination of spatial basis functions plus a fine-scale-variation component, parallel spatial inference and prediction for massive distributed data can be carried out exactly, meaning that the results are the same as for a traditional, non-distributed analysis. The communication cost of our distributed algorithms does not depend on the number of data points. After extending our results to the spatio-temporal case, we illustrate our methodology by carrying out distributed spatio-temporal particle filtering inference on total precipitable water measured by three different satellite sensor systems.
△ Less
Submitted 5 February, 2016; v1 submitted 6 February, 2014;
originally announced February 2014.