Search | arXiv e-print repository

Automatic Outlier Rectification via Optimal Transport

Authors: Jose Blanchet, Jia** Li, Markus Pelger, Greg Zanotti

Abstract: In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for… ▽ More In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize an optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We discuss the fundamental differences between our estimator and optimal transport-based distributionally robust optimization estimator. finally, we demonstrate the effectiveness and superiority of our approach over conventional approaches in extensive simulation and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2308.15627 [pdf, other]

Target PCA: Transfer Learning Large Dimensional Panel Data

Authors: Junting Duan, Markus Pelger, Ruoxuan Xiong

Abstract: This paper develops a novel method to estimate a latent factor model for a large target panel with missing observations by optimally using the information from auxiliary panel data sets. We refer to our estimator as target-PCA. Transfer learning from auxiliary panel data allows us to deal with a large fraction of missing observations and weak signals in the target panel. We show that our estimator… ▽ More This paper develops a novel method to estimate a latent factor model for a large target panel with missing observations by optimally using the information from auxiliary panel data sets. We refer to our estimator as target-PCA. Transfer learning from auxiliary panel data allows us to deal with a large fraction of missing observations and weak signals in the target panel. We show that our estimator is more efficient and can consistently estimate weak factors, which are not identifiable with conventional methods. We provide the asymptotic inferential theory for target-PCA under very general assumptions on the approximate factor model and missing patterns. In an empirical study of imputing data in a mixed-frequency macroeconomic panel, we demonstrate that target-PCA significantly outperforms all benchmark methods. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: Journal of Econometrics, accepted. The Internet Appendix (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4556029) collects the detailed proofs for all the theoretical statements in the main text, the data description, and additional simulation results

arXiv:2202.00871 [pdf, other]

Bayesian Imputation with Optimal Look-Ahead-Bias and Variance Tradeoff

Authors: Jose Blanchet, Fernando Hernandez, Viet Anh Nguyen, Markus Pelger, Xuhui Zhang

Abstract: Missing time-series data is a prevalent problem in many prescriptive analytics models in operations management, healthcare and finance. Imputation methods for time-series data are usually applied to the full panel data with the purpose of training a prescriptive model for a downstream out-of-sample task. For example, the imputation of missing asset returns may be applied before estimating an optim… ▽ More Missing time-series data is a prevalent problem in many prescriptive analytics models in operations management, healthcare and finance. Imputation methods for time-series data are usually applied to the full panel data with the purpose of training a prescriptive model for a downstream out-of-sample task. For example, the imputation of missing asset returns may be applied before estimating an optimal portfolio allocation. However, this practice can result in a look-ahead-bias in the future performance of the downstream task, and there is an inherent trade-off between the look-ahead-bias of using the entire data set for imputation and the larger variance of using only the training portion of the data set for imputation. By connecting layers of information revealed in time, we propose a Bayesian consensus posterior that fuses an arbitrary number of posteriors to optimize the variance and look-ahead-bias trade-off in the imputation. We derive tractable two-step optimization procedures for finding the optimal consensus posterior, with Kullback-Leibler divergence and Wasserstein distance as the dissimilarity measure between posterior distributions. We demonstrate in simulations and in an empirical study the benefit of our imputation mechanism for portfolio allocation with missing returns. △ Less

Submitted 11 April, 2023; v1 submitted 1 February, 2022; originally announced February 2022.

Comments: This work merges and supersedes arXiv:2102.12736

arXiv:2102.12736

Time-Series Imputation with Wasserstein Interpolation for Optimal Look-Ahead-Bias and Variance Tradeoff

Authors: Jose Blanchet, Fernando Hernandez, Viet Anh Nguyen, Markus Pelger, Xuhui Zhang

Abstract: Missing time-series data is a prevalent practical problem. Imputation methods in time-series data often are applied to the full panel data with the purpose of training a model for a downstream out-of-sample task. For example, in finance, imputation of missing returns may be applied prior to training a portfolio optimization model. Unfortunately, this practice may result in a look-ahead-bias in the… ▽ More Missing time-series data is a prevalent practical problem. Imputation methods in time-series data often are applied to the full panel data with the purpose of training a model for a downstream out-of-sample task. For example, in finance, imputation of missing returns may be applied prior to training a portfolio optimization model. Unfortunately, this practice may result in a look-ahead-bias in the future performance on the downstream task. There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data. By connecting layers of information revealed in time, we propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation. We demonstrate the benefit of our methodology both in synthetic and real financial data. △ Less

Submitted 11 April, 2023; v1 submitted 25 February, 2021; originally announced February 2021.

Comments: This paper has been superseded by arXiv:2202.00871

arXiv:1904.00745 [pdf, other]

Deep Learning in Asset Pricing

Authors: Luyang Chen, Markus Pelger, Jason Zhu

Abstract: We use deep neural networks to estimate an asset pricing model for individual stock returns that takes advantage of the vast amount of conditioning information, while kee** a fully flexible form and accounting for time-variation. The key innovations are to use the fundamental no-arbitrage condition as criterion function, to construct the most informative test assets with an adversarial approach… ▽ More We use deep neural networks to estimate an asset pricing model for individual stock returns that takes advantage of the vast amount of conditioning information, while kee** a fully flexible form and accounting for time-variation. The key innovations are to use the fundamental no-arbitrage condition as criterion function, to construct the most informative test assets with an adversarial approach and to extract the states of the economy from many macroeconomic time series. Our asset pricing model outperforms out-of-sample all benchmark approaches in terms of Sharpe ratio, explained variation and pricing errors and identifies the key factors that drive asset prices. △ Less

Submitted 10 August, 2021; v1 submitted 10 March, 2019; originally announced April 2019.

arXiv:1809.02303 [pdf, other]

Change-Point Testing for Risk Measures in Time Series

Authors: Lin Fan, Peter W. Glynn, Markus Pelger

Abstract: We propose novel methods for change-point testing for nonparametric estimators of expected shortfall and related risk measures in weakly dependent time series. We can detect general multiple structural changes in the tails of marginal distributions of time series under general assumptions. Self-normalization allows us to avoid the issues of standard error estimation. The theoretical foundations fo… ▽ More We propose novel methods for change-point testing for nonparametric estimators of expected shortfall and related risk measures in weakly dependent time series. We can detect general multiple structural changes in the tails of marginal distributions of time series under general assumptions. Self-normalization allows us to avoid the issues of standard error estimation. The theoretical foundations for our methods are functional central limit theorems, which we develop under weak assumptions. An empirical study of S&P 500 and US Treasury bond returns illustrates the practical use of our methods in detecting and quantifying market instability via the tails of financial time series. △ Less

Submitted 31 July, 2023; v1 submitted 7 September, 2018; originally announced September 2018.

arXiv:1805.03373 [pdf, other]

Interpretable Sparse Proximate Factors for Large Dimensions

Authors: Markus Pelger, Ruoxuan Xiong

Abstract: This paper proposes sparse and easy-to-interpret proximate factors to approximate statistical latent factors. Latent factors in a large-dimensional factor model can be estimated by principal component analysis (PCA), but are usually hard to interpret. We obtain proximate factors that are easier to interpret by shrinking the PCA factor weights and setting them to zero except for the largest absolut… ▽ More This paper proposes sparse and easy-to-interpret proximate factors to approximate statistical latent factors. Latent factors in a large-dimensional factor model can be estimated by principal component analysis (PCA), but are usually hard to interpret. We obtain proximate factors that are easier to interpret by shrinking the PCA factor weights and setting them to zero except for the largest absolute ones. We show that proximate factors constructed with only 5-10% of the data are usually sufficient to almost perfectly replicate the population and PCA factors without actually assuming a sparse structure in the weights or loadings. Using extreme value theory we explain why sparse proximate factors can be substitutes for non-sparse PCA factors. We derive analytical asymptotic bounds for the correlation of appropriately rotated proximate factors with the population factors. These bounds provide guidance on how to construct the proximate factors. In simulations and empirical analyses of financial portfolio and macroeconomic data we illustrate that sparse proximate factors are close substitutes for PCA factors with average correlations of around 97.5% while being interpretable. △ Less

Submitted 1 August, 2020; v1 submitted 9 May, 2018; originally announced May 2018.

Showing 1–7 of 7 results for author: Pelger, M