Search | arXiv e-print repository

Credit Ratings: Heterogeneous Effect on Capital Structure

Authors: Helmut Wasserbacher, Martin Spindler

Abstract: Why do companies choose particular capital structures? A compelling answer to this question remains elusive despite extensive research. In this article, we use double machine learning to examine the heterogeneous causal effect of credit ratings on leverage. Taking advantage of the flexibility of random forests within the double machine learning framework, we model the relationship between variable… ▽ More Why do companies choose particular capital structures? A compelling answer to this question remains elusive despite extensive research. In this article, we use double machine learning to examine the heterogeneous causal effect of credit ratings on leverage. Taking advantage of the flexibility of random forests within the double machine learning framework, we model the relationship between variables associated with leverage and credit ratings without imposing strong assumptions about their functional form. This approach also allows for data-driven variable selection from a large set of individual company characteristics, supporting valid causal inference. We report three findings: First, credit ratings causally affect the leverage ratio. Having a rating, as opposed to having none, increases leverage by approximately 7 to 9 percentage points, or 30\% to 40\% relative to the sample mean leverage. However, this result comes with an important caveat, captured in our second finding: the effect is highly heterogeneous and varies depending on the specific rating. For AAA and AA ratings, the effect is negative, reducing leverage by about 5 percentage points. For A and BBB ratings, the effect is approximately zero. From BB ratings onwards, the effect becomes positive, exceeding 10 percentage points. Third, contrary to what the second finding might imply at first glance, the change from no effect to a positive effect does not occur abruptly at the boundary between investment and speculative grade ratings. Rather, it is gradual, taking place across the granular rating notches ("+/-") within the BBB and BB categories. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 288 pages, 13 figures

arXiv:2406.11308 [pdf, other]

Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?

Authors: Philipp Schwarz, Oliver Schacht, Sven Klaassen, Daniel Grünbaum, Sebastian Imhof, Martin Spindler

Abstract: In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until… ▽ More In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs). △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 30 pages, 10 figures

arXiv:2403.02467 [pdf]

Applied Causal Inference Powered by ML and AI

Authors: Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis

Abstract: An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools. An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.04674 [pdf, other]

Hyperparameter Tuning for Causal Inference with Double Machine Learning: A Simulation Study

Authors: Philipp Bach, Oliver Schacht, Victor Chernozhukov, Sven Klaassen, Martin Spindler

Abstract: Proper hyperparameter tuning is essential for achieving optimal performance of modern machine learning (ML) methods in predictive tasks. While there is an extensive literature on tuning ML learners for prediction, there is only little guidance available on tuning ML learners for causal machine learning and how to select among different ML learners. In this paper, we empirically assess the relation… ▽ More Proper hyperparameter tuning is essential for achieving optimal performance of modern machine learning (ML) methods in predictive tasks. While there is an extensive literature on tuning ML learners for prediction, there is only little guidance available on tuning ML learners for causal machine learning and how to select among different ML learners. In this paper, we empirically assess the relationship between the predictive performance of ML methods and the resulting causal estimation based on the Double Machine Learning (DML) approach by Chernozhukov et al. (2018). DML relies on estimating so-called nuisance parameters by treating them as supervised learning problems and using them as plug-in estimates to solve for the (causal) parameter. We conduct an extensive simulation study using data from the 2019 Atlantic Causal Inference Conference Data Challenge. We provide empirical insights on the role of hyperparameter tuning and other practical decisions for causal estimation with DML. First, we assess the importance of data splitting schemes for tuning ML learners within Double Machine Learning. Second, we investigate how the choice of ML methods and hyperparameters, including recent AutoML frameworks, impacts the estimation performance for a causal parameter of interest. Third, we assess to what extent the choice of a particular causal model, as characterized by incorporated parametric assumptions, can be based on predictive performance metrics. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.01785 [pdf, other]

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

Authors: Sven Klaassen, Jan Teichert-Kluge, Philipp Bach, Victor Chernozhukov, Martin Spindler, Suhas Vijaykumar

Abstract: This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to e… ▽ More This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data. △ Less

Submitted 1 February, 2024; originally announced February 2024.

MSC Class: 62; 91 ACM Class: I.2.0

arXiv:2306.04223 [pdf, other]

Causally Learning an Optimal Rework Policy

Authors: Oliver Schacht, Sven Klaassen, Philipp Schwarz, Martin Spindler, Daniel Grünbaum, Sebastian Imhof

Abstract: In manufacturing, rework refers to an optional step of a production process which aims to eliminate errors or remedy products that do not meet the desired quality standards. Reworking a production lot involves repeating a previous production stage with adjustments to ensure that the final product meets the required specifications. While offering the chance to improve the yield and thus increase th… ▽ More In manufacturing, rework refers to an optional step of a production process which aims to eliminate errors or remedy products that do not meet the desired quality standards. Reworking a production lot involves repeating a previous production stage with adjustments to ensure that the final product meets the required specifications. While offering the chance to improve the yield and thus increase the revenue of a production lot, a rework step also incurs additional costs. Additionally, the rework of parts that already meet the target specifications may damage them and decrease the yield. In this paper, we apply double/debiased machine learning (DML) to estimate the conditional treatment effect of a rework step during the color conversion process in opto-electronic semiconductor manufacturing on the final product yield. We utilize the implementation DoubleML to develop policies for the rework of components and estimate their value empirically. From our causal machine learning analysis we derive implications for the coating of monochromatic LEDs with conversion layers. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: 22 pages, 15 figures

arXiv:2303.00280 [pdf, other]

Label Attention Network for sequential multi-label classification: you were looking at a wrong self-attention

Authors: Elizaveta Kovtun, Galina Boeva, Artem Zabolotnyi, Evgeny Burnaev, Martin Spindler, Alexey Zaytsev

Abstract: Most of the available user information can be represented as a sequence of timestamped events. Each event is assigned a set of categorical labels whose future structure is of great interest. For instance, our goal is to predict a group of items in the next customer's purchase or tomorrow's client transactions. This is a multi-label classification problem for sequential data. Modern approaches focu… ▽ More Most of the available user information can be represented as a sequence of timestamped events. Each event is assigned a set of categorical labels whose future structure is of great interest. For instance, our goal is to predict a group of items in the next customer's purchase or tomorrow's client transactions. This is a multi-label classification problem for sequential data. Modern approaches focus on transformer architecture for sequential data introducing self-attention for the elements in a sequence. In that case, we take into account events' time interactions but lose information on label inter-dependencies. Motivated by this shortcoming, we propose leveraging a self-attention mechanism over labels preceding the predicted step. As our approach is a Label-Attention NETwork, we call it LANET. Experimental evidence suggests that LANET outperforms the established models' performance and greatly captures interconnections between labels. For example, the micro-AUC of our approach is $0.9536$ compared to $0.7501$ for a vanilla transformer. We provide an implementation of LANET to facilitate its wider usage. △ Less

Submitted 4 April, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2208.11481 [pdf, ps, other]

An Improved Bernstein-type Inequality for C-Mixing-type Processes and Its Application to Kernel Smoothing

Authors: Zihao Yuan, Martin Spindler

Abstract: There are many processes, particularly dynamic systems, that cannot be described as strong mixing processes. \citet{maume2006exponential} introduced a new mixing coefficient called C-mixing, which includes a large class of dynamic systems. Based on this, \citet{hang2017bernstein} obtained a Bernstein-type inequality for a geometric C-mixing process, which, modulo a logarithmic factor and some cons… ▽ More There are many processes, particularly dynamic systems, that cannot be described as strong mixing processes. \citet{maume2006exponential} introduced a new mixing coefficient called C-mixing, which includes a large class of dynamic systems. Based on this, \citet{hang2017bernstein} obtained a Bernstein-type inequality for a geometric C-mixing process, which, modulo a logarithmic factor and some constants, coincides with the standard result for the iid case. In order to honor this pioneering work, we conduct follow-up research in this paper and obtain an improved result under more general preconditions. We allow for a weaker requirement for the semi-norm condition, fully non-stationarity, non-isotropic sampling behavior. Our result covers the case in which the index set of processes lies in $\mathbf{Z}^{d+}$ for any given positive integer $d$. Here $\mathbf{Z}^{d+}$ denotes the collection of all nonnegative integer-valued $d$-dimensional vector. This setting of index set takes both time and spatial data into consideration. For our application, we investigate the theoretical guarantee of multiple kernel-based nonparametric curve estimators for C-Mixing-type processes. More specifically we firstly obtain the $L^{\infty}$-convergence rate of the kernel density estimator and then discuss the attainability of optimality, which can also be regarded as an upate of the result of \citet{hang2018kernel}. Furthermore, we investigate the uniform convergence of the kernel-based estimators of the conditional mean and variance function in a heteroscedastic nonparametric regression model. Under a mild smoothing condition, these estimators are optimal. At last, we obtain the uniform convergence rate of conditional mode function. △ Less

Submitted 7 October, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

arXiv:2208.11433 [pdf, other]

Bernstein-type Inequalities and Nonparametric Estimation under Near-Epoch Dependence

Authors: Zihao Yuan, Martin Spindler

Abstract: The major contributions of this paper lie in two aspects. Firstly, we focus on deriving Bernstein-type inequalities for both geometric and algebraic irregularly-spaced NED random fields, which contain time series as special case. Furthermore, by introducing the idea of "effective dimension" to the index set of random field, our results reflect that the sharpness of inequalities are only associated… ▽ More The major contributions of this paper lie in two aspects. Firstly, we focus on deriving Bernstein-type inequalities for both geometric and algebraic irregularly-spaced NED random fields, which contain time series as special case. Furthermore, by introducing the idea of "effective dimension" to the index set of random field, our results reflect that the sharpness of inequalities are only associated with this "effective dimension". Up to the best of our knowledge, our paper may be the first one reflecting this phenomenon. Hence, the first contribution of this paper can be more or less regarded as an update of the pioneering work from \citeA{xu2018sieve}. Additionally, as a corollary of our first contribution, a Bernstein-type inequality for geometric irregularly-spaced $α$-mixing random fields is also obtained. The second aspect of our contributions is that, based on the inequalities mentioned above, we show the $L_{\infty}$ convergence rate of the many interesting kernel-based nonparametric estimators. To do this, two deviation inequalities for the supreme of empirical process are derived under NED and $α$-mixing conditions respectively. Then, for irregularly-spaced NED random fields, we prove the attainability of optimal rate for local linear estimator of nonparametric regression, which refreshes another pioneering work on this topic, \citeA{jenish2012nonparametric}. Subsequently, we analyze the uniform convergence rate of uni-modal regression under the same NED conditions as well. Furthermore, by following the guide of \citeA{rigollet2009optimal}, we also prove that the kernel-based plug-in density level set estimator could be optimal up to a logarithm factor. Meanwhile, when the data is collected from $α$-mixing random fields, we also derive the uniform convergence rate of a simple local polynomial density estimator \cite{cattaneo2020simple}. △ Less

Submitted 17 October, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

arXiv:2107.04851 [pdf, other]

Machine Learning for Financial Forecasting, Planning and Analysis: Recent Developments and Pitfalls

Authors: Helmut Wasserbacher, Martin Spindler

Abstract: This article is an introduction to machine learning for financial forecasting, planning and analysis (FP\&A). Machine learning appears well suited to support FP\&A with the highly automated extraction of information from large amounts of data. However, because most traditional machine learning techniques focus on forecasting (prediction), we discuss the particular care that must be taken to avoid… ▽ More This article is an introduction to machine learning for financial forecasting, planning and analysis (FP\&A). Machine learning appears well suited to support FP\&A with the highly automated extraction of information from large amounts of data. However, because most traditional machine learning techniques focus on forecasting (prediction), we discuss the particular care that must be taken to avoid the pitfalls of using them for planning and resource allocation (causal inference). While the naive application of machine learning usually fails in this context, the recently developed double machine learning framework can address causal questions of interest. We review the current literature on machine learning in FP\&A and illustrate in a simulation study how machine learning can be used for both forecasting and planning. We also investigate how forecasting and planning improve as the number of data points increases. △ Less

Submitted 10 July, 2021; originally announced July 2021.

Comments: 31 pages, 3 figures, 4 tables

arXiv:2104.03220 [pdf, other]

DoubleML -- An Object-Oriented Implementation of Double Machine Learning in Python

Authors: Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler

Abstract: DoubleML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al. (2018) for a variety of causal models. It contains functionalities for valid statistical inference on causal parameters when the estimation of nuisance parameters is based on machine learning methods. The object-oriented implementation of DoubleML provides a high flexibility in terms… ▽ More DoubleML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al. (2018) for a variety of causal models. It contains functionalities for valid statistical inference on causal parameters when the estimation of nuisance parameters is based on machine learning methods. The object-oriented implementation of DoubleML provides a high flexibility in terms of model specifications and makes it easily extendable. The package is distributed under the MIT license and relies on core libraries from the scientific Python ecosystem: scikit-learn, numpy, pandas, scipy, statsmodels and joblib. Source code, documentation and an extensive user guide can be found at https://github.com/DoubleML/doubleml-for-py and https://docs.doubleml.org. △ Less

Submitted 20 December, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: 6 pages, 2 figures

MSC Class: 62-04

Journal ref: Journal of Machine Learning Research 23 (53), 2022, 1-6

arXiv:2103.09603 [pdf, other]

doi 10.18637/jss.v108.i03

DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R

Authors: Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler, Sven Klaassen

Abstract: The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance compo… ▽ More The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods. △ Less

Submitted 5 June, 2024; v1 submitted 17 March, 2021; originally announced March 2021.

Comments: 56 pages, 8 Figures, 1 Table; Updated version for DoubleML 1.0.0; Updated version due to changes in R package paradox (for parameter tuning with mlr3)

MSC Class: 62-04

Journal ref: Journal of Statistical Software 2024

arXiv:2102.08994 [pdf, other]

Big Data meets Causal Survey Research: Understanding Nonresponse in the Recruitment of a Mixed-mode Online Panel

Authors: Barbara Felderer, Jannis Kueck, Martin Spindler

Abstract: Survey scientists increasingly face the problem of high-dimensionality in their research as digitization makes it much easier to construct high-dimensional (or "big") data sets through tools such as online surveys and mobile applications. Machine learning methods are able to handle such data, and they have been successfully applied to solve \emph{predictive} problems. However, in many situations,… ▽ More Survey scientists increasingly face the problem of high-dimensionality in their research as digitization makes it much easier to construct high-dimensional (or "big") data sets through tools such as online surveys and mobile applications. Machine learning methods are able to handle such data, and they have been successfully applied to solve \emph{predictive} problems. However, in many situations, survey statisticians want to learn about \emph{causal} relationships to draw conclusions and be able to transfer the findings of one survey to another. Standard machine learning methods provide biased estimates of such relationships. We introduce into survey statistics the double machine learning approach, which gives approximately unbiased estimators of causal parameters, and show how it can be used to analyze survey nonresponse in a high-dimensional panel setting. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: 33 pages, 3 figures, 3 tables

arXiv:2011.01092 [pdf, other]

Insights from Optimal Pandemic Shielding in a Multi-Group SEIR Framework

Authors: Philipp Bach, Victor Chernozhukov, Martin Spindler

Abstract: The COVID-19 pandemic constitutes one of the largest threats in recent decades to the health and economic welfare of populations globally. In this paper, we analyze different types of policy measures designed to fight the spread of the virus and minimize economic losses. Our analysis builds on a multi-group SEIR model, which extends the multi-group SIR model introduced by Acemoglu et al.~(2020). W… ▽ More The COVID-19 pandemic constitutes one of the largest threats in recent decades to the health and economic welfare of populations globally. In this paper, we analyze different types of policy measures designed to fight the spread of the virus and minimize economic losses. Our analysis builds on a multi-group SEIR model, which extends the multi-group SIR model introduced by Acemoglu et al.~(2020). We adjust the underlying social interaction patterns and consider an extended set of policy measures. The model is calibrated for Germany. Despite the trade-off between COVID-19 prevention and economic activity that is inherent to shielding policies, our results show that efficiency gains can be achieved by targeting such policies towards different age groups. Alternative policies such as physical distancing can be employed to reduce the degree of targeting and the intensity and duration of shielding. Our results show that a comprehensive approach that combines multiple policy measures simultaneously can effectively mitigate population mortality and economic harm. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: 39 pages, 23 figures

arXiv:2004.01623 [pdf, other]

Estimation and Uniform Inference in Sparse High-Dimensional Additive Models

Authors: Philipp Bach, Sven Klaassen, Jannis Kueck, Martin Spindler

Abstract: We develop a novel method to construct uniformly valid confidence bands for a nonparametric component $f_1$ in the sparse additive model $Y=f_1(X_1)+\ldots + f_p(X_p) + \varepsilon$ in a high-dimensional setting. Our method integrates sieve estimation into a high-dimensional Z-estimation framework, facilitating the construction of uniformly valid confidence bands for the target component $f_1$. To… ▽ More We develop a novel method to construct uniformly valid confidence bands for a nonparametric component $f_1$ in the sparse additive model $Y=f_1(X_1)+\ldots + f_p(X_p) + \varepsilon$ in a high-dimensional setting. Our method integrates sieve estimation into a high-dimensional Z-estimation framework, facilitating the construction of uniformly valid confidence bands for the target component $f_1$. To form these confidence bands, we employ a multiplier bootstrap procedure. Additionally, we provide rates for the uniform lasso estimation in high dimensions, which may be of independent interest. Through simulation studies, we demonstrate that our proposed method delivers reliable results in terms of estimation and coverage, even in small samples. △ Less

Submitted 23 April, 2024; v1 submitted 3 April, 2020; originally announced April 2020.

MSC Class: 62G08; 62-07

arXiv:2002.12710 [pdf, ps, other]

Causal mediation analysis with double machine learning

Authors: Helmut Farbmacher, Martin Huber, Lukáš Lafférs, Henrika Langen, Martin Spindler

Abstract: This paper combines causal mediation analysis with double machine learning to control for observed confounders in a data-driven way under a selection-on-observables assumption in a high-dimensional setting. We consider the average indirect effect of a binary treatment operating through an intermediate variable (or mediator) on the causal path between the treatment and the outcome, as well as the u… ▽ More This paper combines causal mediation analysis with double machine learning to control for observed confounders in a data-driven way under a selection-on-observables assumption in a high-dimensional setting. We consider the average indirect effect of a binary treatment operating through an intermediate variable (or mediator) on the causal path between the treatment and the outcome, as well as the unmediated direct effect. Estimation is based on efficient score functions, which possess a multiple robustness property w.r.t. misspecifications of the outcome, mediator, and treatment models. This property is key for selecting these models by double machine learning, which is combined with data splitting to prevent overfitting in the estimation of the effects of interest. We demonstrate that the direct and indirect effect estimators are asymptotically normal and root-n consistent under specific regularity conditions and investigate the finite sample properties of the suggested methods in a simulation study when considering lasso as machine learner. We also provide an empirical application to the U.S. National Longitudinal Survey of Youth, assessing the indirect effect of health insurance coverage on general health operating via routine checkups as mediator, as well as the direct effect. We find a moderate short term effect of health insurance coverage on general health which is, however, not mediated by routine checkups. △ Less

Submitted 16 February, 2021; v1 submitted 28 February, 2020; originally announced February 2020.

arXiv:1912.12867 [pdf, other]

Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data

Authors: Xi Chen, Ye Luo, Martin Spindler

Abstract: In this paper we develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. We allow for individual specific (non-linear) functions and estimation with econometric or machine learning methods by using weighted observations from other individuals. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions an… ▽ More In this paper we develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. We allow for individual specific (non-linear) functions and estimation with econometric or machine learning methods by using weighted observations from other individuals. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions and are measured based on initial estimates. The key feature of such a procedure is that it clusters individuals based on the distance / similarity between them, estimated in a first stage. Our estimation method can be combined with various statistical estimation procedures, in particular modern machine learning methods which are in particular fruitful in the high-dimensional case and with complex, heterogeneous data. The approach can be interpreted as a \textquotedblleft soft-clustering\textquotedblright\ in comparison to traditional\textquotedblleft\ hard clustering\textquotedblright that assigns each individual to exactly one group. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator. Finally, we analyze a big data set from didichuxing.com, a leading company in transportation industry, to analyze and predict the gap between supply and demand based on a large set of covariates. Our estimator clearly performs much better in out-of-sample prediction compared to existing linear panel data estimators. △ Less

Submitted 3 January, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

Comments: 18 pages, 1 figure, 6 tables

MSC Class: I.2.6; G.3 ACM Class: I.2.6; G.3

arXiv:1910.03072 [pdf, other]

Sequence embeddings help to identify fraudulent cases in healthcare insurance

Authors: I. Fursov, A. Zaytsev, R. Khasyanov, M. Spindler, E. Burnaev

Abstract: Fraud causes substantial costs and losses for companies and clients in the finance and insurance industries. Examples are fraudulent credit card transactions or fraudulent claims. It has been estimated that roughly $10$ percent of the insurance industry's incurred losses and loss adjustment expenses each year stem from fraudulent claims. The rise and proliferation of digitization in finance and in… ▽ More Fraud causes substantial costs and losses for companies and clients in the finance and insurance industries. Examples are fraudulent credit card transactions or fraudulent claims. It has been estimated that roughly $10$ percent of the insurance industry's incurred losses and loss adjustment expenses each year stem from fraudulent claims. The rise and proliferation of digitization in finance and insurance have lead to big data sets, consisting in particular of text data, which can be used for fraud detection. In this paper, we propose architectures for text embeddings via deep learning, which help to improve the detection of fraudulent claims compared to other machine learning methods. We illustrate our methods using a data set from a large international health insurance company. The empirical results show that our approach outperforms other state-of-the-art methods and can help make the claims management process more efficient. As (unstructured) text data become increasingly available to economists and econometricians, our proposed methods will be valuable for many similar applications, particularly when variables have a large number of categories as is typical for example of the International Classification of Disease (ICD) codes in health economics and health services. △ Less

Submitted 7 October, 2019; originally announced October 2019.

arXiv:1903.01375 [pdf, other]

An Extension of the Normal Play Convention to $N$-player Combinatorial Games

Authors: Mark Spindler

Abstract: We examine short combinatorial games for three or more players under a new play convention in which a player who cannot move on their turn is the unique loser. We show that many theorems of impartial and partizan two-player games under normal play have natural analogues in this setting. For impartial games with three players, we investigate the possible outcomes of a sum in detail, and determine t… ▽ More We examine short combinatorial games for three or more players under a new play convention in which a player who cannot move on their turn is the unique loser. We show that many theorems of impartial and partizan two-player games under normal play have natural analogues in this setting. For impartial games with three players, we investigate the possible outcomes of a sum in detail, and determine the outcomes and structure of three-player Nim. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Comments: 45 pages; Presented at Integers Conference 2018, submitted to Proceedings

MSC Class: 91A46 (Primary) 91A06 (Secondary)

arXiv:1812.04345 [pdf, other]

Closing the U.S. gender wage gap requires understanding its heterogeneity

Authors: Philipp Bach, Victor Chernozhukov, Martin Spindler

Abstract: In 2016, the majority of full-time employed women in the U.S. earned significantly less than comparable men. The extent to which women were affected by gender inequality in earnings, however, depended greatly on socio-economic characteristics, such as marital status or educational attainment. In this paper, we analyzed data from the 2016 American Community Survey using a high-dimensional wage regr… ▽ More In 2016, the majority of full-time employed women in the U.S. earned significantly less than comparable men. The extent to which women were affected by gender inequality in earnings, however, depended greatly on socio-economic characteristics, such as marital status or educational attainment. In this paper, we analyzed data from the 2016 American Community Survey using a high-dimensional wage regression and applying double lasso to quantify heterogeneity in the gender wage gap. We found that the gap varied substantially across women and was driven primarily by marital status, having children at home, race, occupation, industry, and educational attainment. We recommend that policy makers use these insights to design policies that will reduce discrimination and unequal pay more effectively. △ Less

Submitted 7 June, 2021; v1 submitted 11 December, 2018; originally announced December 2018.

Comments: Main text: 8 pages, 3 figures; Supplementary Material available online

arXiv:1809.04951 [pdf, other]

Valid Simultaneous Inference in High-Dimensional Settings (with the hdm package for R)

Authors: Philipp Bach, Victor Chernozhukov, Martin Spindler

Abstract: Due to the increasing availability of high-dimensional empirical applications in many research disciplines, valid simultaneous inference becomes more and more important. For instance, high-dimensional settings might arise in economic studies due to very rich data sets with many potential covariates or in the analysis of treatment heterogeneities. Also the evaluation of potentially more complicated… ▽ More Due to the increasing availability of high-dimensional empirical applications in many research disciplines, valid simultaneous inference becomes more and more important. For instance, high-dimensional settings might arise in economic studies due to very rich data sets with many potential covariates or in the analysis of treatment heterogeneities. Also the evaluation of potentially more complicated (non-linear) functional forms of the regression relationship leads to many potential variables for which simultaneous inferential statements might be of interest. Here we provide a review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their use by a case study using the R package hdm. The R package hdm implements valid joint powerful and efficient hypothesis tests for a potentially large number of coeffcients as well as the construction of simultaneous confidence intervals and, therefore, provides useful methods to perform valid post-selection inference based on the LASSO. △ Less

Submitted 13 September, 2018; originally announced September 2018.

Comments: 25 pages, 2 figures, 4 tables

arXiv:1808.10543 [pdf, other]

A Self-Attention Network for Hierarchical Data Structures with an Application to Claims Management

Authors: Leander Löw, Martin Spindler, Eike Brechmann

Abstract: Insurance companies must manage millions of claims per year. While most of these claims are non-fraudulent, fraud detection is core for insurance companies. The ultimate goal is a predictive model to single out the fraudulent claims and pay out the non-fraudulent ones immediately. Modern machine learning methods are well suited for this kind of problem. Health care claims often have a data structu… ▽ More Insurance companies must manage millions of claims per year. While most of these claims are non-fraudulent, fraud detection is core for insurance companies. The ultimate goal is a predictive model to single out the fraudulent claims and pay out the non-fraudulent ones immediately. Modern machine learning methods are well suited for this kind of problem. Health care claims often have a data structure that is hierarchical and of variable length. We propose one model based on piecewise feed forward neural networks (deep learning) and another model based on self-attention neural networks for the task of claim management. We show that the proposed methods outperform bag-of-words based models, hand designed features, and models based on convolutional neural networks, on a data set of two million health care claims. The proposed self-attention method performs the best. △ Less

Submitted 30 August, 2018; originally announced August 2018.

Comments: 7 pages, 6 figures, 2 tables

arXiv:1808.10532 [pdf, other]

Uniform Inference in High-Dimensional Gaussian Graphical Models

Authors: Sven Klaassen, Jannis Kück, Martin Spindler, Victor Chernozhukov

Abstract: Graphical models have become a very popular tool for representing dependencies within a large set of variables and are key for representing causal structures. We provide results for uniform inference on high-dimensional graphical models with the number of target parameters $d$ being possible much larger than sample size. This is in particular important when certain features or structures of a caus… ▽ More Graphical models have become a very popular tool for representing dependencies within a large set of variables and are key for representing causal structures. We provide results for uniform inference on high-dimensional graphical models with the number of target parameters $d$ being possible much larger than sample size. This is in particular important when certain features or structures of a causal model should be recovered. Our results highlight how in high-dimensional settings graphical models can be estimated and recovered with modern machine learning methods in complex data sets. To construct simultaneous confidence regions on many target parameters, sufficiently fast estimation rates of the nuisance functions are crucial. In this context, we establish uniform estimation rates and sparsity guarantees of the square-root estimator in a random design under approximate sparsity conditions that might be of independent interest for related problems in high-dimensions. We also demonstrate in a comprehensive simulation study that our procedure has good small sample properties. △ Less

Submitted 3 December, 2018; v1 submitted 30 August, 2018; originally announced August 2018.

Comments: 59 pages, 2 figures, 6 tables

MSC Class: 62H15; 62J07;

arXiv:1801.00364 [pdf, other]

Estimation and Inference of Treatment Effects with $L_2$-Boosting in High-Dimensional Settings

Authors: Jannis Kueck, Ye Luo, Martin Spindler, Zigan Wang

Abstract: Empirical researchers are increasingly faced with rich data sets containing many controls or instrumental variables, making it essential to choose an appropriate approach to variable selection. In this paper, we provide results for valid inference after post- or orthogonal $L_2$-Boosting is used for variable selection. We consider treatment effects after selecting among many control variables and… ▽ More Empirical researchers are increasingly faced with rich data sets containing many controls or instrumental variables, making it essential to choose an appropriate approach to variable selection. In this paper, we provide results for valid inference after post- or orthogonal $L_2$-Boosting is used for variable selection. We consider treatment effects after selecting among many control variables and instrumental variable models with potentially many instruments. To achieve this, we establish new results for the rate of convergence of iterated post-$L_2$-Boosting and orthogonal $L_2$-Boosting in a high-dimensional setting similar to Lasso, i.e., under approximate sparsity without assuming the beta-min condition. These results are extended to the 2SLS framework and valid inference is provided for treatment effect analysis. We give extensive simulation results for the proposed methods and compare them with Lasso. In an empirical application, we construct efficient IVs with our proposed methods to estimate the effect of pre-merger overlap of bank branch networks in the US on the post-merger stock returns of the acquirer bank. △ Less

Submitted 1 July, 2021; v1 submitted 31 December, 2017; originally announced January 2018.

Comments: 17 pages, 1 figure

MSC Class: 62J07; 62F12

arXiv:1712.07364 [pdf, other]

Transformation Models in High-Dimensions

Authors: Sven Klaassen, Jannis Kueck, Martin Spindler

Abstract: Transformation models are a very important tool for applied statisticians and econometricians. In many applications, the dependent variable is transformed so that homogeneity or normal distribution of the error holds. In this paper, we analyze transformation models in a high-dimensional setting, where the set of potential covariates is large. We propose an estimator for the transformation paramete… ▽ More Transformation models are a very important tool for applied statisticians and econometricians. In many applications, the dependent variable is transformed so that homogeneity or normal distribution of the error holds. In this paper, we analyze transformation models in a high-dimensional setting, where the set of potential covariates is large. We propose an estimator for the transformation parameter and we show that it is asymptotically normally distributed using an orthogonalized moment condition where the nuisance functions depend on the target parameter. In a simulation study, we show that the proposed estimator works well in small samples. A common practice in labor economics is to transform wage with the log-function. In this study, we test if this transformation holds in CPS data from the United States. △ Less

Submitted 20 December, 2017; originally announced December 2017.

Comments: 63 pages, 4 figures

MSC Class: 62H; 62F

arXiv:1702.03244 [pdf, ps, other]

$L_2$Boosting for Economic Applications

Authors: Ye Luo, Martin Spindler

Abstract: In the recent years more and more high-dimensional data sets, where the number of parameters $p$ is high compared to the number of observations $n$ or even larger, are available for applied researchers. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years and are suitable for the analysis of such data sets. While Lasso has been applied very suc… ▽ More In the recent years more and more high-dimensional data sets, where the number of parameters $p$ is high compared to the number of observations $n$ or even larger, are available for applied researchers. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years and are suitable for the analysis of such data sets. While Lasso has been applied very successfully for high-dimensional data sets in Economics, boosting has been underutilized in this field, although it has been proven very powerful in fields like Biostatistics and Pattern Recognition. We attribute this to missing theoretical results for boosting. The goal of this paper is to fill this gap and show that boosting is a competitive method for inference of a treatment effect or instrumental variable (IV) estimation in a high-dimensional setting. First, we present the $L_2$Boosting with componentwise least squares algorithm and variants which are tailored for regression problems which are the workhorse for most Econometric problems. Then we show how $L_2$Boosting can be used for estimation of treatment effects and IV estimation. We highlight the methods and illustrate them with simulations and empirical examples. For further results and technical details we refer to Luo and Spindler (2016, 2017) and to the online supplement of the paper. △ Less

Submitted 10 February, 2017; originally announced February 2017.

Comments: Submitted to American Economic Review, Papers and Proceedings 2017. arXiv admin note: text overlap with arXiv:1602.08927

arXiv:1608.00354 [pdf, ps, other]

hdm: High-Dimensional Metrics

Authors: Victor Chernozhukov, Chris Hansen, Martin Spindler

Abstract: In this article the package High-dimensional Metrics (\texttt{hdm}) is introduced. It is a collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estim… ▽ More In this article the package High-dimensional Metrics (\texttt{hdm}) is introduced. It is a collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or policy variable) in a high-dimensional approximately sparse regression model, for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well for extensions of these parameters to the endogenous setting are provided. Theory grounded, data-driven methods for selecting the penalization parameter in Lasso regressions under heteroscedastic and non-Gaussian errors are implemented. Moreover, joint/ simultaneous confidence intervals for regression coefficients of a high-dimensional sparse regression are implemented. Data sets which have been used in the literature and might be useful for classroom demonstration and for testing new estimators are included. △ Less

Submitted 1 August, 2016; originally announced August 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1603.01700

arXiv:1603.01700 [pdf, ps, other]

High-Dimensional Metrics in R

Authors: Victor Chernozhukov, Chris Hansen, Martin Spindler

Abstract: The package High-dimensional Metrics (\Rpackage{hdm}) is an evolving collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly va… ▽ More The package High-dimensional Metrics (\Rpackage{hdm}) is an evolving collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or policy variable) in a high-dimensional approximately sparse regression model, for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well for extensions of these parameters to the endogenous setting are provided. Theory grounded, data-driven methods for selecting the penalization parameter in Lasso regressions under heteroscedastic and non-Gaussian errors are implemented. Moreover, joint/ simultaneous confidence intervals for regression coefficients of a high-dimensional sparse regression are implemented, including a joint significance test for Lasso regression. Data sets which have been used in the literature and might be useful for classroom demonstration and for testing new estimators are included. \R and the package \Rpackage{hdm} are open-source software projects and can be freely downloaded from CRAN: \texttt{http://cran.r-project.org}. △ Less

Submitted 1 August, 2016; v1 submitted 5 March, 2016; originally announced March 2016.

Comments: 34 pages; vignette for the R package hdm, available at http://cran.r-project.org/web/packages/hdm/ and http://r-forge.r-project.org/R/?group_id=2084 (development version)

MSC Class: 62-01; 62-04; 62J07; 62G05

arXiv:1602.08927 [pdf, other]

High-Dimensional $L_2$Boosting: Rate of Convergence

Authors: Ye Luo, Martin Spindler, Jannis Kück

Abstract: Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage… ▽ More Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by $L_2$Boosting. Another variant is \textquotedblleft Orthogonal Boosting\textquotedblright\ where after each step an orthogonal projection is conducted. We show that both post-$L_2$Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical $L_2$Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of $L_2$Boosting. We also introduce feasible rules for early stop**, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-$L_2$Boosting clearly outperforms LASSO. △ Less

Submitted 21 July, 2022; v1 submitted 29 February, 2016; originally announced February 2016.

Comments: 19 pages, 4 tables; AMS 2000 subject classifications: Primary 62J05, 62J07, 41A25; secondary 49M15, 68Q32

MSC Class: 62J05; 62J07; 41A25; 49M15; 68Q32

arXiv:1506.02261 [pdf, ps, other]

Equality Classes of Nim Positions under Misère Play

Authors: Mark Spindler

Abstract: We determine the misère equivalence classes of Nim positions under two equivalence relations: one based on playing disjunctive sums with other impartial games, and one allowing sums with partizan games. In the impartial context, the only identifications we can make are those stemming from the known fact about adding a heap of size 1. In the partizan context, distinct Nim positions are inequivalent… ▽ More We determine the misère equivalence classes of Nim positions under two equivalence relations: one based on playing disjunctive sums with other impartial games, and one allowing sums with partizan games. In the impartial context, the only identifications we can make are those stemming from the known fact about adding a heap of size 1. In the partizan context, distinct Nim positions are inequivalent. △ Less

Submitted 22 July, 2016; v1 submitted 7 June, 2015; originally announced June 2015.

Comments: 10 pages, LaTeX; fixed typos, improved formatting consistency, added section on transfinite Nim

MSC Class: 91A46

Journal ref: Integers 16 (2016) G3

arXiv:1501.03430 [pdf, other]

doi 10.1146/annurev-economics-012315-015826

Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach

Authors: Victor Chernozhukov, Christian Hansen, Martin Spindler

Abstract: Here we present an expository, general analysis of valid post-selection or post-regularization inference about a low-dimensional target parameter, $α$, in the presence of a very high-dimensional nuisance parameter, $η$, which is estimated using modern selection or regularization methods. Our analysis relies on high-level, easy-to-interpret conditions that allow one to clearly see the structures ne… ▽ More Here we present an expository, general analysis of valid post-selection or post-regularization inference about a low-dimensional target parameter, $α$, in the presence of a very high-dimensional nuisance parameter, $η$, which is estimated using modern selection or regularization methods. Our analysis relies on high-level, easy-to-interpret conditions that allow one to clearly see the structures needed for achieving valid post-regularization inference. Simple, readily verifiable sufficient conditions are provided for a class of affine-quadratic models. We focus our discussion on estimation and inference procedures based on using the empirical analog of theoretical equations $$M(α, η)=0$$ which identify $α$. Within this structure, we show that setting up such equations in a manner such that the orthogonality/immunization condition $$\partial_ηM(α, η) = 0$$ at the true parameter values is satisfied, coupled with plausible conditions on the smoothness of $M$ and the quality of the estimator $\hat η$, guarantees that inference on for the main parameter $α$ based on testing or point estimation methods discussed below will be regular despite selection or regularization biases occurring in estimation of $η$. In particular, the estimator of $α$ will often be uniformly consistent at the root-$n$ rate and uniformly asymptotically normal even though estimators $\hat η$ will generally not be asymptotically linear and regular. The uniformity holds over large classes of models that do not impose highly implausible "beta-min" conditions. We also show that inference can be carried out by inverting tests formed from Neyman's $C(α)$ (orthogonal score) statistics. △ Less

Submitted 18 August, 2015; v1 submitted 14 January, 2015; originally announced January 2015.

Comments: 47 pages

Journal ref: Annual Review of Economics, Vol. 7: 649-688 (August 2015)

arXiv:1501.03185 [pdf, ps, other]

Post-Selection and Post-Regularization Inference in Linear Models with Many Controls and Instruments

Authors: Victor Chernozhukov, Christian Hansen, Martin Spindler

Abstract: In this note, we offer an approach to estimating causal/structural parameters in the presence of many instruments and controls based on methods for estimating sparse high-dimensional models. We use these high-dimensional methods to select both which instruments and which control variables to use. The approach we take extends BCCH2012, which covers selection of instruments for IV models with a smal… ▽ More In this note, we offer an approach to estimating causal/structural parameters in the presence of many instruments and controls based on methods for estimating sparse high-dimensional models. We use these high-dimensional methods to select both which instruments and which control variables to use. The approach we take extends BCCH2012, which covers selection of instruments for IV models with a small number of controls, and extends BCH2014, which covers selection of controls in models where the variable of interest is exogenous conditional on observables, to accommodate both a large number of controls and a large number of instruments. We illustrate the approach with a simulation and an empirical example. Technical supporting material is available in a supplementary online appendix. △ Less

Submitted 13 January, 2015; originally announced January 2015.

Comments: American Economic Review 2015, Papers and Proceedings

Showing 1–32 of 32 results for author: Spindler, M