-
SNSeg: An R Package for Time Series Segmentation via Self-Normalization
Authors:
Shubo Sun,
Zifeng Zhao,
Feiyu Jiang,
Xiaofeng Shao
Abstract:
Time series segmentation aims to identify potential change-points in a sequence of temporally dependent data, so that the original sequence can be partitioned into several homogeneous subsequences. It is useful for modeling and predicting non-stationary time series and is widely applied in natural and social sciences. Existing segmentation methods primarily focus on only one type of parameter chan…
▽ More
Time series segmentation aims to identify potential change-points in a sequence of temporally dependent data, so that the original sequence can be partitioned into several homogeneous subsequences. It is useful for modeling and predicting non-stationary time series and is widely applied in natural and social sciences. Existing segmentation methods primarily focus on only one type of parameter changes such as mean and variance, and they typically depend on laborious tuning or smoothing parameters, which can be challenging to choose in practice. The self-normalization based change-point estimation framework SNCP by Zhao et al. (2022), however, offers users more flexibility and convenience as it allows for change-point estimation of different types of parameters (e.g. mean, variance, quantile and autocovariance) in a unified fashion, and requires effortless tuning. In this paper, the R package SNSeg is introduced to implement SNCP for segmentation of univariate and multivariate time series. An extension of SNCP, named SNHD, is also designed and implemented for change-point estimation in the mean vector of high-dimensional time series. The estimated changepoints as well as segmented time series are available with graphical tools. Detailed examples of SNSeg are given in simulations of multivariate autoregressive processes with change-points.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Modern extreme value statistics for Utopian extremes
Authors:
Jordan Richards,
Noura Alotaibi,
Daniela Cisneros,
Yan Gong,
Matheus B. Guerrero,
Paolo Redondo,
Xuanjie Shao
Abstract:
Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance…
▽ More
Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the EVA (2023) Conference Data Challenge, which was organised for the 13$^\text{th}$ International Conference on Extreme Value Analysis. This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of "Utopia"
△ Less
Submitted 1 May, 2024; v1 submitted 18 November, 2023;
originally announced November 2023.
-
Change-point Inference for High-dimensional Heteroscedastic Data
Authors:
Teng Wu,
Stanislav Volgushev,
Xiaofeng Shao
Abstract:
We propose a bootstrap-based test to detect a mean shift in a sequence of high-dimensional observations with unknown time-varying heteroscedasticity. The proposed test builds on the U-statistic based approach in Wang et al. (2022), targets a dense alternative, and adopts a wild bootstrap procedure to generate critical values. The bootstrap-based test is free of tuning parameters and is capable of…
▽ More
We propose a bootstrap-based test to detect a mean shift in a sequence of high-dimensional observations with unknown time-varying heteroscedasticity. The proposed test builds on the U-statistic based approach in Wang et al. (2022), targets a dense alternative, and adopts a wild bootstrap procedure to generate critical values. The bootstrap-based test is free of tuning parameters and is capable of accommodating unconditional time varying heteroscedasticity in the high-dimensional observations, as demonstrated in our theory and simulations. Theoretically, we justify the bootstrap consistency by using the recently proposed unconditional approach in Bucher and Kojadinovic (2019). Extensions to testing for multiple change-points and estimation using wild binary segmentation are also presented. Numerical simulations demonstrate the robustness of the proposed testing and estimation procedures with respect to different kinds of time-varying heteroscedasticity.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Two-Sample and Change-Point Inference for Non-Euclidean Valued Time Series
Authors:
Feiyu Jiang,
Changbo Zhu,
Xiaofeng Shao
Abstract:
Data objects taking value in a general metric space have become increasingly common in modern data analysis. In this paper, we study two important statistical inference problems, namely, two-sample testing and change-point detection, for such non-Euclidean data under temporal dependence. Typical examples of non-Euclidean valued time series include yearly mortality distributions, time-varying netwo…
▽ More
Data objects taking value in a general metric space have become increasingly common in modern data analysis. In this paper, we study two important statistical inference problems, namely, two-sample testing and change-point detection, for such non-Euclidean data under temporal dependence. Typical examples of non-Euclidean valued time series include yearly mortality distributions, time-varying networks, and covariance matrix time series. To accommodate unknown temporal dependence, we advance the self-normalization (SN) technique (Shao, 2010) to the inference of non-Euclidean time series, which is substantially different from the existing SN-based inference for functional time series that reside in Hilbert space (Zhang et al., 2011). Theoretically, we propose new regularity conditions that could be easier to check than those in the recent literature, and derive the limiting distributions of the proposed test statistics under both null and local alternatives. For change-point detection problem, we also derive the consistency for the change-point location estimator, and combine our proposed change-point test with wild binary segmentation to perform multiple change-point estimation. Numerical simulations demonstrate the effectiveness and robustness of our proposed tests compared with existing methods in the literature. Finally, we apply our tests to two-sample inference in mortality data and change-point detection in cryptocurrency data.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction
Authors:
Qing Mai,
Xiaofeng Shao,
Runmin Wang,
Xin Zhang
Abstract:
Sliced inverse regression (SIR, Li 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample si…
▽ More
Sliced inverse regression (SIR, Li 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee & Shao 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Dimension-agnostic Change Point Detection
Authors:
Hanjia Gao,
Runmin Wang,
Xiaofeng Shao
Abstract:
Change point testing for high-dimensional data has attracted a lot of attention in statistics and machine learning owing to the emergence of high-dimensional data with structural breaks from many fields. In practice, when the dimension is less than the sample size but is not small, it is often unclear whether a method that is tailored to high-dimensional data or simply a classical method that is d…
▽ More
Change point testing for high-dimensional data has attracted a lot of attention in statistics and machine learning owing to the emergence of high-dimensional data with structural breaks from many fields. In practice, when the dimension is less than the sample size but is not small, it is often unclear whether a method that is tailored to high-dimensional data or simply a classical method that is developed and justified for low-dimensional data is preferred. In addition, the methods designed for low-dimensional data may not work well in the high-dimensional environment and vice versa. In this paper, we propose a dimension-agnostic testing procedure targeting a single change point in the mean of a multivariate time series. Specifically, we can show that the limiting null distribution for our test statistic is the same regardless of the dimensionality and the magnitude of cross-sectional dependence. The power analysis is also conducted to understand the large sample behavior of the proposed test. Through Monte Carlo simulations and a real data illustration, we demonstrate that the finite sample results strongly corroborate the theory and suggest that the proposed test can be used as a benchmark for change-point detection of time series of low, medium, and high dimensions.
△ Less
Submitted 3 December, 2023; v1 submitted 19 March, 2023;
originally announced March 2023.
-
Adaptive Testing for High-dimensional Data
Authors:
Yangfan Zhang,
Runmin Wang,
Xiaofeng Shao
Abstract:
In this article, we propose a class of $L_q$-norm based U-statistics for a family of global testing problems related to high-dimensional data. This includes testing of mean vector and its spatial sign, simultaneous testing of linear model coefficients, and testing of component-wise independence for high-dimensional observations, among others. Under the null hypothesis, we derive asymptotic normali…
▽ More
In this article, we propose a class of $L_q$-norm based U-statistics for a family of global testing problems related to high-dimensional data. This includes testing of mean vector and its spatial sign, simultaneous testing of linear model coefficients, and testing of component-wise independence for high-dimensional observations, among others. Under the null hypothesis, we derive asymptotic normality and independence between $L_q$-norm based U-statistics for several $q$s under mild moment and cumulant conditions. A simple combination of two studentized $L_q$-based test statistics via their $p$-values is proposed and is shown to attain great power against alternatives of different sparsity. Our work is a substantial extension of He et al. (2021), which is mostly focused on mean and covariance testing, and we manage to provide a general treatment of asymptotic independence of $L_q$-norm based U-statistics for a wide class of kernels. To alleviate the computation burden, we introduce a variant of the proposed U-statistics by using the monotone indices in the summation, resulting in a U-statistic with asymmetric kernel. A dynamic programming method is introduced to reduce the computational cost from $O(n^{qr})$, which is required for the calculation of the full U-statistic, to $O(n^r)$ where $r$ is the order of the kernel. Numerical studies further corroborate the advantage of the proposed adaptive test as compared to some existing competitors.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Testing Serial Independence of Object-Valued Time Series
Authors:
Feiyu Jiang,
Hanjia Gao,
Xiaofeng Shao
Abstract:
We propose a novel method for testing serial independence of object-valued time series in metric spaces, which is more general than Euclidean or Hilbert spaces. The proposed method is fully nonparametric, free of tuning parameters, and can capture all nonlinear pairwise dependence. The key concept used in this paper is the distance covariance in metric spaces, which is extended to auto distance co…
▽ More
We propose a novel method for testing serial independence of object-valued time series in metric spaces, which is more general than Euclidean or Hilbert spaces. The proposed method is fully nonparametric, free of tuning parameters, and can capture all nonlinear pairwise dependence. The key concept used in this paper is the distance covariance in metric spaces, which is extended to auto distance covariance for object-valued time series. Furthermore, we propose a generalized spectral density function to account for pairwise dependence at all lags and construct a Cramer-von Mises type test statistic. New theoretical arguments are developed to establish the asymptotic behavior of the test statistic. A wild bootstrap is also introduced to obtain the critical values of the non-pivotal limiting null distribution. Extensive numerical simulations and two real data applications are conducted to illustrate the effectiveness and versatility of our proposed method.
△ Less
Submitted 27 July, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Statistical inference for high-dimensional spectral density matrix
Authors:
**yuan Chang,
Qing Jiang,
Tucker S. McElroy,
Xiaofeng Shao
Abstract:
The spectral density matrix is a fundamental object of interest in time series analysis, and it encodes both contemporary and dynamic linear relationships between component processes of the multivariate system. In this paper we develop novel inference procedures for the spectral density matrix in the high-dimensional setting. Specifically, we introduce a new global testing procedure to test the nu…
▽ More
The spectral density matrix is a fundamental object of interest in time series analysis, and it encodes both contemporary and dynamic linear relationships between component processes of the multivariate system. In this paper we develop novel inference procedures for the spectral density matrix in the high-dimensional setting. Specifically, we introduce a new global testing procedure to test the nullity of the cross-spectral density for a given set of frequencies and across pairs of component indices. For the first time, both Gaussian approximation and parametric bootstrap methodologies are employed to conduct inference for a high-dimensional parameter formulated in the frequency domain, and new technical tools are developed to provide asymptotic guarantees of the size accuracy and power for global testing. We further propose a multiple testing procedure for simultaneously testing the nullity of the cross-spectral density at a given set of frequencies. The method is shown to control the false discovery rate. Both numerical simulations and a real data illustration demonstrate the usefulness of the proposed testing methods.
△ Less
Submitted 25 February, 2023; v1 submitted 27 December, 2022;
originally announced December 2022.
-
Dynamics of Fecal Coliform Bacteria along Canada's Coast
Authors:
Shuai You,
Xiaolin Huang,
Li Xing,
Mary Lesperance,
Charles LeBlanc,
Paul Moccia,
Vincent Mercier,
Xiaojian Shao,
Youlian Pan,
Xuekui Zhang
Abstract:
The vast coastline provides Canada with a flourishing seafood industry including bivalve shellfish production. To sustain a healthy bivalve molluscan shellfish production, the Canadian Shellfish Sanitation Program was established to monitor the health of shellfish harvesting habitats, and fecal coliform bacteria data have been collected at nearly 15,000 marine sample sites across six coastal provi…
▽ More
The vast coastline provides Canada with a flourishing seafood industry including bivalve shellfish production. To sustain a healthy bivalve molluscan shellfish production, the Canadian Shellfish Sanitation Program was established to monitor the health of shellfish harvesting habitats, and fecal coliform bacteria data have been collected at nearly 15,000 marine sample sites across six coastal provinces in Canada since 1979. We applied Functional Principal Component Analysis and subsequent correlation analyses to find annual variation patterns of bacteria levels at sites in each province. The overall magnitude and the seasonality of fecal contamination were modelled by functional principal component one and two, respectively. The amplitude was related to human and warm-blooded animal activities; the seasonality was strongly correlated with river discharge driven by precipitation and snow melt in British Columbia, but such correlation in provinces along the Atlantic coast could not be properly evaluated due to lack of data during winter.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
Flexible Modeling of Nonstationary Extremal Dependence using Spatially-Fused LASSO and Ridge Penalties
Authors:
Xuanjie Shao,
Arnab Hazra,
Jordan Richards,
Raphaƫl Huser
Abstract:
Statistical modeling of a nonstationary spatial extremal dependence structure is challenging. Max-stable processes are common choices for modeling spatially-indexed block maxima, where an assumption of stationarity is usual to make inference feasible. However, this assumption is often unrealistic for data observed over a large or complex domain. We propose a computationally-efficient method for es…
▽ More
Statistical modeling of a nonstationary spatial extremal dependence structure is challenging. Max-stable processes are common choices for modeling spatially-indexed block maxima, where an assumption of stationarity is usual to make inference feasible. However, this assumption is often unrealistic for data observed over a large or complex domain. We propose a computationally-efficient method for estimating extremal dependence using a globally nonstationary, but locally-stationary, max-stable process by exploiting nonstationary kernel convolutions. We divide the spatial domain into a fine grid of subregions, assign each of them its own dependence parameters, and use LASSO ($L_1$) or ridge ($L_2$) penalties to obtain spatially-smooth parameter estimates. We then develop a novel data-driven algorithm to merge homogeneous neighboring subregions. The algorithm facilitates model parsimony and interpretability. To make our model suitable for high-dimensional data, we exploit a pairwise likelihood to draw inferences and discuss computational and statistical efficiency. An extensive simulation study demonstrates the superior performance of our proposed model and the subregion-merging algorithm over the approaches that either do not model nonstationarity or do not update the domain partition. We apply our proposed method to model monthly maximum temperatures at over 1400 sites in Nepal and the surrounding Himalayan and sub-Himalayan regions; we again observe significant improvements in model fit compared to a stationary process and a nonstationary process without subregion-merging. Furthermore, we demonstrate that the estimated merged partition is interpretable from a geographic perspective and leads to better model diagnostics by adequately reducing the number of subregion-specific parameters.
△ Less
Submitted 30 April, 2024; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Testing the martingale difference hypothesis in high dimension
Authors:
**yuan Chang,
Qing Jiang,
Xiaofeng Shao
Abstract:
In this paper, we consider testing the martingale difference hypothesis for high-dimensional time series. Our test is built on the sum of squares of the element-wise max-norm of the proposed matrix-valued nonlinear dependence measure at different lags. To conduct the inference, we approximate the null distribution of our test statistic by Gaussian approximation and provide a simulation-based appro…
▽ More
In this paper, we consider testing the martingale difference hypothesis for high-dimensional time series. Our test is built on the sum of squares of the element-wise max-norm of the proposed matrix-valued nonlinear dependence measure at different lags. To conduct the inference, we approximate the null distribution of our test statistic by Gaussian approximation and provide a simulation-based approach to generate critical values. The asymptotic behavior of the test statistic under the alternative is also studied. Our approach is nonparametric as the null hypothesis only assumes the time series concerned is martingale difference without specifying any parametric forms of its conditional moments. As an advantage of Gaussian approximation, our test is robust to the cross-series dependence of unknown magnitude. To the best of our knowledge, this is the first valid test for the martingale difference hypothesis that not only allows for large dimension but also captures nonlinear serial dependence. The practical usefulness of our test is illustrated via simulation and a real data analysis. The test is implemented in a user-friendly R-function.
△ Less
Submitted 30 September, 2022; v1 submitted 10 September, 2022;
originally announced September 2022.
-
Robust Inference for Change Points in High Dimension
Authors:
Feiyu Jiang,
Runmin Wang,
Xiaofeng Shao
Abstract:
This paper proposes a new test for a change point in the mean of high-dimensional data based on the spatial sign and self-normalization. The test is easy to implement with no tuning parameters, robust to heavy-tailedness and theoretically justified with both fixed-$n$ and sequential asymptotics under both null and alternatives, where $n$ is the sample size. We demonstrate that the fixed-$n$ asympt…
▽ More
This paper proposes a new test for a change point in the mean of high-dimensional data based on the spatial sign and self-normalization. The test is easy to implement with no tuning parameters, robust to heavy-tailedness and theoretically justified with both fixed-$n$ and sequential asymptotics under both null and alternatives, where $n$ is the sample size. We demonstrate that the fixed-$n$ asymptotics provide a better approximation to the finite sample distribution and thus should be preferred in both testing and testing-based estimation. To estimate the number and locations when multiple change-points are present, we propose to combine the p-value under the fixed-$n$ asymptotics with the seeded binary segmentation (SBS) algorithm. Through numerical experiments, we show that the spatial sign based procedures are robust with respect to the heavy-tailedness and strong coordinate-wise dependence, whereas their non-robust counterparts proposed in Wang et al. (2022) appear to under-perform. A real data example is also provided to illustrate the robustness and broad applicability of the proposed test and its corresponding estimation algorithm.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
On Variance Estimation of Random Forests with Infinite-Order U-statistics
Authors:
Tianning Xu,
Ruoqing Zhu,
Xiaofeng Shao
Abstract:
Infinite-order U-statistics (IOUS) has been used extensively on subbagging ensemble learning algorithms such as random forests to quantify its uncertainty. While normality results of IOUS have been studied extensively, its variance estimation approaches and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decom…
▽ More
Infinite-order U-statistics (IOUS) has been used extensively on subbagging ensemble learning algorithms such as random forests to quantify its uncertainty. While normality results of IOUS have been studied extensively, its variance estimation approaches and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decomposition. However, such a view usually leads to biased estimation when the kernel size is large or the sample size is small. On the other hand, while several unbiased estimators exist in the literature, their relationships and theoretical properties, especially the ratio consistency, have never been studied. These limitations lead to unguaranteed performances of constructed confidence intervals. To bridge these gaps in the literature, we propose a new view of the Hoeffding decomposition for variance estimation that leads to an unbiased estimator. Instead of leading term dominance, our view utilizes the dominance of the peak region. Moreover, we establish the connection and equivalence of our estimator with several existing unbiased variance estimators. Theoretically, we are the first to establish the ratio consistency of such a variance estimator, which justifies the coverage rate of confidence intervals constructed from random forests. Numerically, we further propose a local smoothing procedure to improve the estimator's finite sample performance. Extensive simulation studies show that our estimators enjoy lower bias and archive targeted coverage rates.
△ Less
Submitted 14 February, 2023; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Segmenting Time Series via Self-Normalization
Authors:
Zifeng Zhao,
Feiyu Jiang,
Xiaofeng Shao
Abstract:
We propose a novel and unified framework for change-point estimation in multivariate time series. The proposed method is fully nonparametric, enjoys effortless tuning and is robust to temporal dependence. One salient and distinct feature of the proposed method is its versatility, where it allows change-point detection for a broad class of parameters (such as mean, variance, correlation and quantil…
▽ More
We propose a novel and unified framework for change-point estimation in multivariate time series. The proposed method is fully nonparametric, enjoys effortless tuning and is robust to temporal dependence. One salient and distinct feature of the proposed method is its versatility, where it allows change-point detection for a broad class of parameters (such as mean, variance, correlation and quantile) in a unified fashion. At the core of our method, we couple the self-normalization (SN) based tests with a novel nested local-window segmentation algorithm, which seems new in the growing literature of change-point analysis. Due to the presence of an inconsistent long-run variance estimator in the SN test, non-standard theoretical arguments are further developed to derive the consistency and convergence rate of the proposed SN-based change-point detection method. Extensive numerical experiments and relevant real data analysis are conducted to illustrate the effectiveness and broad applicability of our proposed method in comparison with state-of-the-art approaches in the literature.
△ Less
Submitted 8 September, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Adaptive Inference for Change Points in High-Dimensional Data
Authors:
Yangfan Zhang,
Runmin Wang,
Xiaofeng Shao
Abstract:
In this article, we propose a class of test statistics for a change point in the mean of high-dimensional independent data. Our test integrates the U-statistic based approach in a recent work by \cite{hdcp} and the $L_q$-norm based high-dimensional test in \cite{he2018}, and inherits several appealing features such as being tuning parameter free and asymptotic independence for test statistics corr…
▽ More
In this article, we propose a class of test statistics for a change point in the mean of high-dimensional independent data. Our test integrates the U-statistic based approach in a recent work by \cite{hdcp} and the $L_q$-norm based high-dimensional test in \cite{he2018}, and inherits several appealing features such as being tuning parameter free and asymptotic independence for test statistics corresponding to even $q$s. A simple combination of test statistics corresponding to several different $q$s leads to a test with adaptive power property, that is, it can be powerful against both sparse and dense alternatives. On the estimation front, we obtain the convergence rate of the maximizer of our test statistic standardized by sample size when there is one change-point in mean and $q=2$, and propose to combine our tests with a wild binary segmentation (WBS) algorithm to estimate the change-point number and locations when there are multiple change-points. Numerical comparisons using both simulated and real data demonstrate the advantage of our adaptive test and its corresponding estimation method.
△ Less
Submitted 28 January, 2021;
originally announced January 2021.
-
Adaptive Change Point Monitoring for High-Dimensional Data
Authors:
Teng Wu,
Runmin Wang,
Hao Yan,
Xiaofeng Shao
Abstract:
In this paper, we propose a class of monitoring statistics for a mean shift in a sequence of high-dimensional observations. Inspired by the recent U-statistic based retrospective tests developed by Wang et al.(2019) and Zhang et al.(2020), we advance the U-statistic based approach to the sequential monitoring problem by develo** a new adaptive monitoring procedure that can detect both dense and…
▽ More
In this paper, we propose a class of monitoring statistics for a mean shift in a sequence of high-dimensional observations. Inspired by the recent U-statistic based retrospective tests developed by Wang et al.(2019) and Zhang et al.(2020), we advance the U-statistic based approach to the sequential monitoring problem by develo** a new adaptive monitoring procedure that can detect both dense and sparse changes in real-time. Unlike Wang et al.(2019) and Zhang et al.(2020), where self-normalization was used in their tests, we instead introduce a class of estimators for $q$-norm of the covariance matrix and prove their ratio consistency. To facilitate fast computation, we further develop recursive algorithms to improve the computational efficiency of the monitoring procedure. The advantage of the proposed methodology is demonstrated via simulation studies and real data illustrations.
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
Time Series Analysis of COVID-19 Infection Curve: A Change-Point Perspective
Authors:
Feiyu Jiang,
Zifeng Zhao,
Xiaofeng Shao
Abstract:
In this paper, we model the trajectory of the cumulative confirmed cases and deaths of COVID-19 (in log scale) via a piecewise linear trend model. The model naturally captures the phase transitions of the epidemic growth rate via change-points and further enjoys great interpretability due to its semiparametric nature. On the methodological front, we advance the nascent self-normalization (SN) tech…
▽ More
In this paper, we model the trajectory of the cumulative confirmed cases and deaths of COVID-19 (in log scale) via a piecewise linear trend model. The model naturally captures the phase transitions of the epidemic growth rate via change-points and further enjoys great interpretability due to its semiparametric nature. On the methodological front, we advance the nascent self-normalization (SN) technique (Shao, 2010) to testing and estimation of a single change-point in the linear trend of a nonstationary time series. We further combine the SN-based change-point test with the NOT algorithm (Baranowski et al., 2019) to achieve multiple change-point estimation. Using the proposed method, we analyze the trajectory of the cumulative COVID-19 cases and deaths for 30 major countries and discover interesting patterns with potentially relevant implications for effectiveness of the pandemic responses by different countries. Furthermore, based on the change-point detection algorithm and a flexible extrapolation function, we design a simple two-stage forecasting scheme for COVID-19 and demonstrate its promising performance in predicting cumulative deaths in the U.S.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Fully Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks
Authors:
Xingyu Sha,
Jiaqi Zhang,
Keyou You,
Kaiqing Zhang,
Tamer BaÅar
Abstract:
This paper proposes a \emph{fully asynchronous} scheme for the policy evaluation problem of distributed reinforcement learning (DisRL) over directed peer-to-peer networks. Without waiting for any other node of the network, each node can locally update its value function at any time by using (possibly delayed) information from its neighbors. This is in sharp contrast to the gossip-based scheme wher…
▽ More
This paper proposes a \emph{fully asynchronous} scheme for the policy evaluation problem of distributed reinforcement learning (DisRL) over directed peer-to-peer networks. Without waiting for any other node of the network, each node can locally update its value function at any time by using (possibly delayed) information from its neighbors. This is in sharp contrast to the gossip-based scheme where a pair of nodes concurrently update. Though the fully asynchronous setting involves a difficult multi-timescale decision problem, we design a novel stochastic average gradient (SAG) based distributed algorithm and develop a push-pull augmented graph approach to prove its exact convergence at a linear rate of $\mathcal{O}(c^k)$ where $c\in(0,1)$ and $k$ increases by one no matter on which node updates. Finally, numerical experiments validate that our method speeds up linearly with respect to the number of nodes, and is robust to straggler nodes.
△ Less
Submitted 22 January, 2021; v1 submitted 1 March, 2020;
originally announced March 2020.
-
Dating the Break in High-dimensional Data
Authors:
Runmin Wang,
Xiaofeng Shao
Abstract:
This paper is concerned with estimation and inference for the location of a change point in the mean of independent high-dimensional data. Our change point location estimator maximizes a new U-statistic based objective function, and its convergence rate and asymptotic distribution after suitable centering and normalization are obtained under mild assumptions. Our estimator turns out to have better…
▽ More
This paper is concerned with estimation and inference for the location of a change point in the mean of independent high-dimensional data. Our change point location estimator maximizes a new U-statistic based objective function, and its convergence rate and asymptotic distribution after suitable centering and normalization are obtained under mild assumptions. Our estimator turns out to have better efficiency as compared to the least squares based counterpart in the literature. Based on the asymptotic theory, we construct a confidence interval by plugging in consistent estimates of several quantities in the normalization. We also provide a bootstrap-based confidence interval and state its asymptotic validity under suitable conditions. Through simulation studies, we demonstrate favorable finite sample performance of the new change point location estimator as compared to its least squares based counterpart, and our bootstrap-based confidence intervals, as compared to several existing competitors. The asymptotic theory based on high-dimensional U-statistic is substantially different from those developed in the literature and is of independent interest.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Making deep neural networks right for the right scientific reasons by interacting with their explanations
Authors:
Patrick Schramowski,
Wolfgang Stammer,
Stefano Teso,
Anna Brugger,
Xiaoting Shao,
Hans-Georg Luigs,
Anne-Katrin Mahlein,
Kristian Kersting
Abstract:
Deep neural networks have shown excellent performances in many real-world applications. Unfortunately, they may show "Clever Hans"-like behavior -- making use of confounding factors within datasets -- to achieve high performance. In this work, we introduce the novel learning setting of "explanatory interactive learning" (XIL) and illustrate its benefits on a plant phenoty** research task. XIL ad…
▽ More
Deep neural networks have shown excellent performances in many real-world applications. Unfortunately, they may show "Clever Hans"-like behavior -- making use of confounding factors within datasets -- to achieve high performance. In this work, we introduce the novel learning setting of "explanatory interactive learning" (XIL) and illustrate its benefits on a plant phenoty** research task. XIL adds the scientist into the training loop such that she interactively revises the original model via providing feedback on its explanations. Our experimental results demonstrate that XIL can help avoiding Clever Hans moments in machine learning and encourages (or discourages, if appropriate) trust into the underlying model.
△ Less
Submitted 5 March, 2024; v1 submitted 15 January, 2020;
originally announced January 2020.
-
Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures
Authors:
Xiaoting Shao,
Alejandro Molina,
Antonio Vergari,
Karl Stelzner,
Robert Peharz,
Thomas Liebig,
Kristian Kersting
Abstract:
Probabilistic graphical models are a central tool in AI; however, they are generally not as expressive as deep neural models, and inference is notoriously hard and slow. In contrast, deep probabilistic models such as sum-product networks (SPNs) capture joint distributions in a tractable fashion, but still lack the expressive power of intractable models based on deep neural networks. Therefore, we…
▽ More
Probabilistic graphical models are a central tool in AI; however, they are generally not as expressive as deep neural models, and inference is notoriously hard and slow. In contrast, deep probabilistic models such as sum-product networks (SPNs) capture joint distributions in a tractable fashion, but still lack the expressive power of intractable models based on deep neural networks. Therefore, we introduce conditional SPNs (CSPNs), conditional density estimators for multivariate and potentially hybrid domains which allow harnessing the expressive power of neural networks while still maintaining tractability guarantees. One way to implement CSPNs is to use an existing SPN structure and condition its parameters on the input, e.g., via a deep neural network. This approach, however, might misrepresent the conditional independence structure present in data. Consequently, we also develop a structure-learning approach that derives both the structure and parameters of CSPNs from data. Our experimental evidence demonstrates that CSPNs are competitive with other probabilistic models and yield superior performance on multilabel image classification compared to mean field and mixture density networks. Furthermore, they can successfully be employed as building blocks for structured probabilistic models, such as autoregressive image models.
△ Less
Submitted 29 September, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Inference for Change Points in High Dimensional Data via Self-Normalization
Authors:
Runmin Wang,
Changbo Zhu,
Stanislav Volgushev,
Xiaofeng Shao
Abstract:
This article considers change point testing and estimation for a sequence of high-dimensional data. In the case of testing for a mean shift for high-dimensional independent data, we propose a new test which is based on $U$-statistic in Chen and Qin (2010) and utilizes the self-normalization principle [Shao (2010), Shao and Zhang (2010)]. Our test targets dense alternatives in the high-dimensional…
▽ More
This article considers change point testing and estimation for a sequence of high-dimensional data. In the case of testing for a mean shift for high-dimensional independent data, we propose a new test which is based on $U$-statistic in Chen and Qin (2010) and utilizes the self-normalization principle [Shao (2010), Shao and Zhang (2010)]. Our test targets dense alternatives in the high-dimensional setting and involves no tuning parameters. To extend to change point testing for high-dimensional time series, we introduce a trimming parameter and formulate a self-normalized test statistic with trimming to accommodate the weak temporal dependence. On the theory front, we derive the limiting distributions of self-normalized test statistics under both the null and alternatives for both independent and dependent high-dimensional data. At the core of our asymptotic theory, we obtain weak convergence of a sequential U-statistic based process for high-dimensional independent data, and weak convergence of sequential trimmed U-statistic based processes for high-dimensional linear processes, both of which are of independent interests. Additionally, we illustrate how our tests can be used in combination with wild binary segmentation to estimate the number and location of multiple change points. Numerical simulations demonstrate the competitiveness of our proposed testing and estimation procedures in comparison with several existing methods in the literature.
△ Less
Submitted 8 August, 2021; v1 submitted 21 May, 2019;
originally announced May 2019.
-
The $CI$-index: a new index to characterize the scientific output of researchers
Authors:
Xuehua Yin,
Xiuyan Sha,
Chuancun Yin
Abstract:
We propose a simple new index, named the $CI$-index, based on the Choquet integral to characterize the scientific output of researchers. This index is an improvement of the $A$-index and $R$-index and has a notable feature that highly cited papers have highly weights and lowly cited papers have lowly weights. In applications many researchers may have the same $h$-index, $g$-index or $R$-index. The…
▽ More
We propose a simple new index, named the $CI$-index, based on the Choquet integral to characterize the scientific output of researchers. This index is an improvement of the $A$-index and $R$-index and has a notable feature that highly cited papers have highly weights and lowly cited papers have lowly weights. In applications many researchers may have the same $h$-index, $g$-index or $R$-index. The $CI$-index can be provided an effective method of distinguish among such researchers.
△ Less
Submitted 15 May, 2019; v1 submitted 15 March, 2019;
originally announced March 2019.
-
Interpoint Distance Based Two Sample Tests in High Dimension
Authors:
Changbo Zhu,
Xiaofeng Shao
Abstract:
In this paper, we study a class of two sample test statistics based on inter-point distances in the high dimensional and low sample size setting. Our test statistics include the well-known energy distance and maximum mean discrepancy with Gaussian and Laplacian kernels, and the critical values are obtained via permutations. We show that all these tests are inconsistent when the two high dimensiona…
▽ More
In this paper, we study a class of two sample test statistics based on inter-point distances in the high dimensional and low sample size setting. Our test statistics include the well-known energy distance and maximum mean discrepancy with Gaussian and Laplacian kernels, and the critical values are obtained via permutations. We show that all these tests are inconsistent when the two high dimensional distributions correspond to the same marginal distributions but differ in other aspects of the distributions. The tests based on energy distance and maximum mean discrepancy are mainly targeting the differences between marginal means and variances, whereas the test based on $L^1$-distance can capture the difference in marginal distributions. Our theory sheds new light on the limitation of inter-point distance based tests, the impact of different distance metrics, and the behavior of permutation tests in high dimension. Some simulation results and a real data illustration are also presented to corroborate our theoretical findings.
△ Less
Submitted 10 April, 2020; v1 submitted 19 February, 2019;
originally announced February 2019.
-
Distance-based and RKHS-based Dependence Metrics in High Dimension
Authors:
Changbo Zhu,
Shun Yao,
Xianyang Zhang,
Xiaofeng Shao
Abstract:
In this paper, we study distance covariance, Hilbert-Schmidt covariance (aka Hilbert-Schmidt independence criterion [Gretton et al. (2008)]) and related independence tests under the high dimensional scenario. We show that the sample distance/Hilbert-Schmidt covariance between two random vectors can be approximated by the sum of squared componentwise sample cross-covariances up to an asymptotically…
▽ More
In this paper, we study distance covariance, Hilbert-Schmidt covariance (aka Hilbert-Schmidt independence criterion [Gretton et al. (2008)]) and related independence tests under the high dimensional scenario. We show that the sample distance/Hilbert-Schmidt covariance between two random vectors can be approximated by the sum of squared componentwise sample cross-covariances up to an asymptotically constant factor, which indicates that the distance/Hilbert-Schmidt covariance based test can only capture linear dependence in high dimension. As a consequence, the distance correlation based t-test developed by Szekely and Rizzo (2013) for independence is shown to have trivial limiting power when the two random vectors are nonlinearly dependent but component-wisely uncorrelated. This new and surprising phenomenon, which seems to be discovered for the first time, is further confirmed in our simulation study. As a remedy, we propose tests based on an aggregation of marginal sample distance/Hilbert-Schmidt covariances and show their superior power behavior against their joint counterparts in simulations. We further extend the distance correlation based t-test to those based on Hilbert-Schmidt covariance and marginal distance/Hilbert-Schmidt covariance. A novel unified approach is developed to analyze the studentized sample distance/Hilbert-Schmidt covariance as well as the studentized sample marginal distance covariance under both null and alternative hypothesis. Our theoretical and simulation results shed light on the limitation of distance/Hilbert-Schmidt covariance when used jointly in the high dimensional setting and suggest the aggregation of marginal distance/Hilbert-Schmidt covariance as a useful alternative.
△ Less
Submitted 8 February, 2019;
originally announced February 2019.
-
Semantic Segmentation for Urban Planning Maps based on U-Net
Authors:
Zhiling Guo,
Hiroaki Shengoku,
Guangming Wu,
Qi Chen,
Wei Yuan,
Xiaodan Shi,
Xiaowei Shao,
Yongwei Xu,
Ryosuke Shibasaki
Abstract:
The automatic digitizing of paper maps is a significant and challenging task for both academia and industry. As an important procedure of map digitizing, the semantic segmentation section mainly relies on manual visual interpretation with low efficiency. In this study, we select urban planning maps as a representative sample and investigate the feasibility of utilizing U-shape fully convolutional…
▽ More
The automatic digitizing of paper maps is a significant and challenging task for both academia and industry. As an important procedure of map digitizing, the semantic segmentation section mainly relies on manual visual interpretation with low efficiency. In this study, we select urban planning maps as a representative sample and investigate the feasibility of utilizing U-shape fully convolutional based architecture to perform end-to-end map semantic segmentation. The experimental results obtained from the test area in Shibuya district, Tokyo, demonstrate that our proposed method could achieve a very high Jaccard similarity coefficient of 93.63% and an overall accuracy of 99.36%. For implementation on GPGPU and cuDNN, the required processing time for the whole Shibuya district can be less than three minutes. The results indicate the proposed method can serve as a viable tool for urban planning map semantic segmentation task with high accuracy and efficiency.
△ Less
Submitted 30 September, 2018; v1 submitted 28 September, 2018;
originally announced September 2018.
-
Testing mutual independence in high dimension via distance covariance
Authors:
Shun Yao,
Xianyang Zhang,
Xiaofeng Shao
Abstract:
In this paper, we introduce a ${\mathcal L}_2$ type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed based on the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlati…
▽ More
In this paper, we introduce a ${\mathcal L}_2$ type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed based on the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit dimension is high, and is shown to successfully identify nonlinear dependence in empirical data analysis. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance covariance based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distribution with equal correlation.
△ Less
Submitted 18 September, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.
-
A subsampled double bootstrap for massive data
Authors:
Srijan Sengupta,
Stanislav Volgushev,
Xiaofeng Shao
Abstract:
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomes prohibitively costly in computation and its feasibility is questionable even with modern parallel computing platforms. Recently Kleiner, Talwalkar, Sarkar, and Jordan (2014) proposed a method called BLB (…
▽ More
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomes prohibitively costly in computation and its feasibility is questionable even with modern parallel computing platforms. Recently Kleiner, Talwalkar, Sarkar, and Jordan (2014) proposed a method called BLB (Bag of Little Bootstraps) for massive data which is more computationally scalable with little sacrifice of statistical accuracy. Building on BLB and the idea of fast double bootstrap, we propose a new resampling method, the subsampled double bootstrap, for both independent data and time series data. We establish consistency of the subsampled double bootstrap under mild conditions for both independent and dependent cases. Methodologically, the subsampled double bootstrap is superior to BLB in terms of running time, more sample coverage and automatic implementation with less tuning parameters for a given time budget. Its advantage relative to BLB and bootstrap is also demonstrated in numerical simulations and a data illustration.
△ Less
Submitted 5 August, 2015;
originally announced August 2015.
-
On the Coverage Bound Problem of Empirical Likelihood Methods For Time Series
Authors:
Xianyang Zhang,
Xiaofeng Shao
Abstract:
The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood [Kitamura (1997)] and nonstandard expansive empirical likelihood [Nordman et al. (2013)] methods for time series data are investigated via studying the probability for the violation of the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of t…
▽ More
The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood [Kitamura (1997)] and nonstandard expansive empirical likelihood [Nordman et al. (2013)] methods for time series data are investigated via studying the probability for the violation of the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of the blockwise empirical log-likelihood ratio obtained under the fixed-b asymptotics, which has been recently shown to provide a more accurate approximation to the finite sample distribution than the conventional chi-square approximation. Our theoretical and numerical findings suggest that both the finite sample and large sample upper bounds for coverage probabilities are strictly less than one and the blockwise empirical likelihood confidence region can exhibit serious undercoverage when (i) the dimension of moment conditions is moderate or large; (ii) the time series dependence is positively strong; or (iii) the block size is large relative to sample size. A similar finite sample coverage problem occurs for the nonstandard expansive empirical likelihood. To alleviate the coverage bound problem, we propose to penalize both empirical likelihood methods by relaxing the convex hull constraint. Numerical simulations and data illustration demonstrate the effectiveness of our proposed remedies in terms of delivering confidence sets with more accurate coverage.
△ Less
Submitted 31 July, 2014; v1 submitted 20 January, 2014;
originally announced January 2014.
-
A self-normalized approach to confidence interval construction in time series
Authors:
Xiaofeng Shao
Abstract:
We propose a new method to construct confidence intervals for quantities that are associated with a stationary time series, which avoids direct estimation of the asymptotic variances. Unlike the existing tuning-parameter-dependent approaches, our method has the attractive convenience of being free of choosing any user-chosen number or smoothing parameter. The interval is constructed on the basis…
▽ More
We propose a new method to construct confidence intervals for quantities that are associated with a stationary time series, which avoids direct estimation of the asymptotic variances. Unlike the existing tuning-parameter-dependent approaches, our method has the attractive convenience of being free of choosing any user-chosen number or smoothing parameter. The interval is constructed on the basis of an asymptotically distribution-free self-normalized statistic, in which the normalizing matrix is computed using recursive estimates. Under mild conditions, we establish the theoretical validity of our method for a broad class of statistics that are functionals of the empirical distribution of fixed or growing dimension. From a practical point of view, our method is conceptually simple, easy to implement and can be readily used by the practitioner. Monte-Carlo simulations are conducted to compare the finite sample performance of the new method with those delivered by the normal approximation and the block bootstrap approach.
△ Less
Submitted 12 May, 2010;
originally announced May 2010.
-
Testing for white noise under unknown dependence and its applications to goodness-of-fit for time series models
Authors:
Xiaofeng Shao
Abstract:
Testing for white noise has been well studied in the literature of econometrics and statistics. For most of the proposed test statistics, such as the well-known Box-Pierce's test statistic with fixed lag truncation number, the asymptotic null distributions are obtained under independent and identically distributed assumptions and may not be valid for the dependent white noise. Due to recent popu…
▽ More
Testing for white noise has been well studied in the literature of econometrics and statistics. For most of the proposed test statistics, such as the well-known Box-Pierce's test statistic with fixed lag truncation number, the asymptotic null distributions are obtained under independent and identically distributed assumptions and may not be valid for the dependent white noise. Due to recent popularity of conditional heteroscedastic models (e.g., GARCH models), which imply nonlinear dependence with zero autocorrelation, there is a need to understand the asymptotic properties of the existing test statistics under unknown dependence. In this paper, we showed that the asymptotic null distribution of Box-Pierce's test statistic with general weights still holds under unknown weak dependence so long as the lag truncation number grows at an appropriate rate with increasing sample size. Further applications to diagnostic checking of the ARMA and FARIMA models with dependent white noise errors are also addressed. Our results go beyond earlier ones by allowing non-Gaussian and conditional heteroscedastic errors in the ARMA and FARIMA models and provide theoretical support for some empirical findings reported in the literature.
△ Less
Submitted 29 June, 2009;
originally announced June 2009.
-
Nonstationarity-extended Whittle Estimation
Authors:
Xiaofeng Shao
Abstract:
For long memory time series models with uncorrelated but dependent errors, we establish the asymptotic normality of the Whittle estimator under mild conditions. Our framework includes the widely used FARIMA models with GARCH-type innovations. To cover nonstationary fractionally integrated processes, we extend the idea of Abadir, Distaso and Giraitis (2007, Journal of Econometrics 141, 1353-1384)…
▽ More
For long memory time series models with uncorrelated but dependent errors, we establish the asymptotic normality of the Whittle estimator under mild conditions. Our framework includes the widely used FARIMA models with GARCH-type innovations. To cover nonstationary fractionally integrated processes, we extend the idea of Abadir, Distaso and Giraitis (2007, Journal of Econometrics 141, 1353-1384) and develop the nonstationarity-extended Whittle estimation. The resulting estimator is shown to be asymptotically normal and is more efficient than the tapered Whittle estimator. Finally, the results from a small simulation study are presented to corroborate our theoretical findings.
△ Less
Submitted 18 March, 2009;
originally announced March 2009.