Search | arXiv e-print repository

Automatic Outlier Rectification via Optimal Transport

Authors: Jose Blanchet, Jia** Li, Markus Pelger, Greg Zanotti

Abstract: In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for… ▽ More In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize the optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We demonstrate the effectiveness of our approach over conventional approaches in simulations and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces. △ Less

Submitted 11 July, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2106.04028 [pdf, other]

Deep Learning Statistical Arbitrage

Authors: Jorge Guijarro-Ordonez, Markus Pelger, Greg Zanotti

Abstract: Statistical arbitrage exploits temporal price differences between similar assets. We develop a unifying conceptual framework for statistical arbitrage and a novel data driven solution. First, we construct arbitrage portfolios of similar assets as residual portfolios from conditional latent asset pricing factors. Second, we extract their time series signals with a powerful machine-learning time-ser… ▽ More Statistical arbitrage exploits temporal price differences between similar assets. We develop a unifying conceptual framework for statistical arbitrage and a novel data driven solution. First, we construct arbitrage portfolios of similar assets as residual portfolios from conditional latent asset pricing factors. Second, we extract their time series signals with a powerful machine-learning time-series solution, a convolutional transformer. Lastly, we use these signals to form an optimal trading policy, that maximizes risk-adjusted returns under constraints. Our comprehensive empirical study on daily US equities shows a high compensation for arbitrageurs to enforce the law of one price. Our arbitrage strategies obtain consistently high out-of-sample mean returns and Sharpe ratios, and substantially outperform all benchmark approaches. △ Less

Submitted 7 October, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2102.12736

Time-Series Imputation with Wasserstein Interpolation for Optimal Look-Ahead-Bias and Variance Tradeoff

Authors: Jose Blanchet, Fernando Hernandez, Viet Anh Nguyen, Markus Pelger, Xuhui Zhang

Abstract: Missing time-series data is a prevalent practical problem. Imputation methods in time-series data often are applied to the full panel data with the purpose of training a model for a downstream out-of-sample task. For example, in finance, imputation of missing returns may be applied prior to training a portfolio optimization model. Unfortunately, this practice may result in a look-ahead-bias in the… ▽ More Missing time-series data is a prevalent practical problem. Imputation methods in time-series data often are applied to the full panel data with the purpose of training a model for a downstream out-of-sample task. For example, in finance, imputation of missing returns may be applied prior to training a portfolio optimization model. Unfortunately, this practice may result in a look-ahead-bias in the future performance on the downstream task. There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data. By connecting layers of information revealed in time, we propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation. We demonstrate the benefit of our methodology both in synthetic and real financial data. △ Less

Submitted 11 April, 2023; v1 submitted 25 February, 2021; originally announced February 2021.

Comments: This paper has been superseded by arXiv:2202.00871

arXiv:2101.06323 [pdf, other]

doi 10.1145/3442381.3449842

TextGNN: Improving Text Encoder via Graph Neural Network in Sponsored Search

Authors: Jason Yue Zhu, Yanling Cui, Yuming Liu, Hao Sun, Xue Li, Markus Pelger, Tianqi Yang, Liangjie Zhang, Ruofei Zhang, Huasha Zhao

Abstract: Text encoders based on C-DSSM or transformers have demonstrated strong performance in many Natural Language Processing (NLP) tasks. Low latency variants of these models have also been developed in recent years in order to apply them in the field of sponsored search which has strict computational constraints. However these models are not the panacea to solve all the Natural Language Understanding (… ▽ More Text encoders based on C-DSSM or transformers have demonstrated strong performance in many Natural Language Processing (NLP) tasks. Low latency variants of these models have also been developed in recent years in order to apply them in the field of sponsored search which has strict computational constraints. However these models are not the panacea to solve all the Natural Language Understanding (NLU) challenges as the pure semantic information in the data is not sufficient to fully identify the user intents. We propose the TextGNN model that naturally extends the strong twin tower structured encoders with the complementary graph information from user historical behaviors, which serves as a natural guide to help us better understand the intents and hence generate better language representations. The model inherits all the benefits of twin tower models such as C-DSSM and TwinBERT so that it can still be used in the low latency environment while achieving a significant performance gain than the strong encoder-only counterpart baseline models in both offline evaluations and online production system. In offline experiments, the model achieves a 0.14% overall increase in ROC-AUC with a 1% increased accuracy for long-tail low-frequency Ads, and in the online A/B testing, the model shows a 2.03% increase in Revenue Per Mille with a 2.32% decrease in Ad defect rate. △ Less

Submitted 1 May, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: Jason Yue Zhu, Yanling Cui, Yuming Liu, Hao Sun, Xue Li, Markus Pelger, Tianqi Yang, Liangjie Zhang, Ruofei Zhang, and Huasha Zhao. 2021. TextGNN: Improving Text Encoder via Graph Neural Network in Sponsored Search. In Proceedings of the Web Conference 2021 (WWW 21), April 19-23, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3442381.3449842

Showing 1–4 of 4 results for author: Pelger, M