Search | arXiv e-print repository

Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics

Abstract: In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims.… ▽ More In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods in order to help leverage an iterative process to enhance predictive accuracy. Despite the inherent slowdown in learning algorithms due to this iteration, several efficient implementation techniques that also help precise tuning of parameter like XGBoost, LightGBM, and CatBoost have emerged. Nonetheless, we chose to utilize CatBoost, a efficient boosting approach that effectively handles categorical and other special types of data. The core contribution of our paper is the assembly of separate modeling for zero claims and the application of tree-based boosting ensemble methods within a CatBoost framework, assuming that the inflated probability of zero is a function of the mean parameter. The efficacy of our enhanced Tweedie model is demonstrated through the application of an insurance telematics dataset, which presents the additional complexity of compositional feature variables. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions suitable for insurance claim analytics. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2401.16723 [pdf, other]

Improving Business Insurance Loss Models by Leveraging InsurTech Innovation

Authors: Zhiyu Quan, Changyue Hu, Panyi Dong, Emiliano A. Valdez

Abstract: Recent transformative and disruptive advancements in the insurance industry have embraced various InsurTech innovations. In particular, with the rapid progress in data science and computational capabilities, InsurTech is able to integrate a multitude of emerging data sources, shedding light on opportunities to enhance risk classification and claims management. This paper presents a groundbreaking… ▽ More Recent transformative and disruptive advancements in the insurance industry have embraced various InsurTech innovations. In particular, with the rapid progress in data science and computational capabilities, InsurTech is able to integrate a multitude of emerging data sources, shedding light on opportunities to enhance risk classification and claims management. This paper presents a groundbreaking effort as we combine real-life proprietary insurance claims information together with InsurTech data to enhance the loss model, a fundamental component of insurance companies' risk management. Our study further utilizes various machine learning techniques to quantify the predictive improvement of the InsurTech-enhanced loss model over that of the insurance in-house. The quantification process provides a deeper understanding of the value of the InsurTech innovation and advocates potential risk factors that are unexplored in traditional insurance loss modeling. This study represents a successful undertaking of an academic-industry collaboration, suggesting an inspiring path for future partnerships between industry and academic institutions. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2112.14868 [pdf, other]

The SAMME.C2 algorithm for severely imbalanced multi-class classification

Authors: Banghee So, Emiliano A. Valdez

Abstract: Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often… ▽ More Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet develo** a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 25 pages, 8 figures, algorithms

MSC Class: 62P99

arXiv:2112.14865 [pdf, other]

Compositional Data Regression in Insurance with Exponential Family PCA

Authors: Guojun Gan, Emiliano A. Valdez

Abstract: Compositional data are multivariate observations that carry only relative information between components. Applying standard multivariate statistical methodology directly to analyze compositional data can lead to paradoxes and misinterpretations. Compositional data also frequently appear in insurance, especially with telematics information. However, such type of data does not receive deserved speci… ▽ More Compositional data are multivariate observations that carry only relative information between components. Applying standard multivariate statistical methodology directly to analyze compositional data can lead to paradoxes and misinterpretations. Compositional data also frequently appear in insurance, especially with telematics information. However, such type of data does not receive deserved special treatment in most existing actuarial literature. In this paper, we explore and investigate the use of exponential family principal component analysis (EPCA) to analyze compositional data in insurance. The method is applied to analyze a dataset obtained from the U.S. Mine Safety and Health Administration. The numerical results show that EPCA is able to produce principal components that are significant predictors and improve the prediction accuracy of the regression model. The EPCA method can be a promising useful tool for actuaries to analyze compositional data. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 21 pages, 5 figures, 10 tables

MSC Class: 62P05

arXiv:2102.00252 [pdf, other]

Synthetic Dataset Generation of Driver Telematics

Authors: Banghee So, Jean-Philippe Boucher, Emiliano A. Valdez

Abstract: This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can… ▽ More This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process using machine learning algorithms. The first stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The second stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work valuable. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 24 pages, 11 figures, 6 tables

MSC Class: 62P05

arXiv:2101.10896 [pdf, other]

Applications of Clustering with Mixed Type Data in Life Insurance

Authors: Shuang Yin, Guojun Gan, Emiliano A. Valdez, Jeyaraj Vadiveloo

Abstract: Death benefits are generally the largest cash flow item that affects financial statements of life insurers where some still do not have a systematic process to track and monitor death claims experience. In this article, we explore data clustering to examine and understand how actual death claims differ from expected, an early stage of develo** a monitoring system crucial for risk management. We… ▽ More Death benefits are generally the largest cash flow item that affects financial statements of life insurers where some still do not have a systematic process to track and monitor death claims experience. In this article, we explore data clustering to examine and understand how actual death claims differ from expected, an early stage of develo** a monitoring system crucial for risk management. We extend the $k$-prototypes clustering algorithm to draw inference from a life insurance dataset using only the insured's characteristics and policy information without regard to known mortality. This clustering has the feature to efficiently handle categorical, numerical, and spatial attributes. Using gap statistics, the optimal clusters obtained from the algorithm are then used to compare actual to expected death claims experience of the life insurance portfolio. Our empirical data contains observations, during 2014, of approximately 1.14 million policies with a total insured amount of over 650 billion dollars. For this portfolio, the algorithm produced three natural clusters, with each cluster having a lower actual to expected death claims but with differing variability. The analytical results provide management a process to identify policyholders' attributes that dominate significant mortality deviations, and thereby enhance decision making for taking necessary actions. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: 25 pages, 6 figures, 5 tables

MSC Class: 62P05

arXiv:2008.05968 [pdf, other]

Flexible Modeling of Hurdle Conway-Maxwell-Poisson Distributions with Application to Mining Injuries

Authors: Shuang Yin, Dipak K. Dey, Emiliano A. Valdez, Xiaomeng Li

Abstract: While the hurdle Poisson regression is a popular class of models for count data with excessive zeros, the link function in the binary component may be unsuitable for highly imbalanced cases. Ordinary Poisson regression is unable to handle the presence of dispersion. In this paper, we introduce Conway-Maxwell-Poisson (CMP) distribution and integrate use of flexible skewed Weibull link functions as… ▽ More While the hurdle Poisson regression is a popular class of models for count data with excessive zeros, the link function in the binary component may be unsuitable for highly imbalanced cases. Ordinary Poisson regression is unable to handle the presence of dispersion. In this paper, we introduce Conway-Maxwell-Poisson (CMP) distribution and integrate use of flexible skewed Weibull link functions as better alternative. We take a fully Bayesian approach to draw inference from the underlying models to better explain skewness and quantify dispersion, with Deviance Information Criteria (DIC) used for model selection. For empirical investigation, we analyze mining injury data for period 2013-2016 from the U.S. Mine Safety and Health Administration (MSHA). The risk factors describing proportions of employee hours spent in each type of mining work are compositional data; the probabilistic principal components analysis (PPCA) is deployed to deal with such covariates. The hurdle CMP regression is additionally adjusted for exposure, measured by the total employee working hours, to make inference on rate of mining injuries; we tested its competitiveness against other models. This can be used as predictive model in the mining workplace to identify features that increase the risk of injuries so that prevention can be implemented. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: 23 pages, 7 Tables, 3 Figures

MSC Class: 62P99

arXiv:2008.00048 [pdf, other]

Analysis of Prescription Drug Utilization with Beta Regression Models

Authors: Guojun Gan, Emiliano A. Valdez

Abstract: The healthcare sector in the U.S. is complex and is also a large sector that generates about 20% of the country's gross domestic product. Healthcare analytics has been used by researchers and practitioners to better understand the industry. In this paper, we examine and demonstrate the use of Beta regression models to study the utilization of brand name drugs in the U.S. to understand the variabil… ▽ More The healthcare sector in the U.S. is complex and is also a large sector that generates about 20% of the country's gross domestic product. Healthcare analytics has been used by researchers and practitioners to better understand the industry. In this paper, we examine and demonstrate the use of Beta regression models to study the utilization of brand name drugs in the U.S. to understand the variability of brand name drug utilization across different areas. The models are fitted to public datasets obtained from the Medicare & Medicaid Services and the Internal Revenue Service. Integrated Nested Laplace Approximation (INLA) is used to perform the inference. The numerical results show that Beta regression models can fit the brand name drug claim rates well and including spatial dependence improves the performance of the Beta regression models. Such models can be used to reflect the effect of prescription drug utilization when updating an insured's health risk in a risk scoring model. △ Less

Submitted 31 July, 2020; originally announced August 2020.

Comments: 26 pages, 10 Figures, 11 Tables

MSC Class: 91G05

arXiv:2007.15172 [pdf, other]

Skewed link regression models for imbalanced binary response with applications to life insurance

Authors: Shuang Yin, Dipak K. Dey, Emiliano A. Valdez, Guojun Gan, Jeyaraj Vadiveloo

Abstract: For a portfolio of life insurance policies observed for a stated period of time, e.g., one year, mortality is typically a rare event. When we examine the outcome of dying or not from such portfolios, we have an imbalanced binary response. The popular logistic and probit regression models can be inappropriate for imbalanced binary response as model estimates may be biased, and if not addressed prop… ▽ More For a portfolio of life insurance policies observed for a stated period of time, e.g., one year, mortality is typically a rare event. When we examine the outcome of dying or not from such portfolios, we have an imbalanced binary response. The popular logistic and probit regression models can be inappropriate for imbalanced binary response as model estimates may be biased, and if not addressed properly, it can lead to serious adverse predictions. In this paper, we propose the use of skewed link regression models (Generalized Extreme Value, Weibull, and Frechet link models) as more superior models to handle imbalanced binary response. We adopt a fully Bayesian approach for the generalized linear models (GLMs) under the proposed link functions to help better explain the high skewness. To calibrate our proposed Bayesian models, we use a real dataset of death claims experience drawn from a life insurance company's portfolio. Bayesian estimates of parameters were obtained using the Metropolis-Hastings algorithm and for Bayesian model selection and comparison, the Deviance Information Criterion (DIC) statistic has been used. For our mortality dataset, we find that these skewed link models are more superior than the widely used binary models with standard link functions. We evaluate the predictive power of the different underlying models by measuring and comparing aggregated death counts and death benefits. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Comments: 25 pages, 7 Tables, 2 Figures

MSC Class: 62P05

arXiv:2007.03100 [pdf, other]

Cost-sensitive Multi-class AdaBoost for Understanding Driving Behavior with Telematics

Authors: Banghee So, Jean-Philippe Boucher, Emiliano A. Valdez

Abstract: Powered with telematics technology, insurers can now capture a wide range of data, such as distance traveled, how drivers brake, accelerate or make turns, and travel frequency each day of the week, to better decode driver's behavior. Such additional information helps insurers improve risk assessments for usage-based insurance (UBI), an increasingly popular industry innovation. In this article, we… ▽ More Powered with telematics technology, insurers can now capture a wide range of data, such as distance traveled, how drivers brake, accelerate or make turns, and travel frequency each day of the week, to better decode driver's behavior. Such additional information helps insurers improve risk assessments for usage-based insurance (UBI), an increasingly popular industry innovation. In this article, we explore how to integrate telematics information to better predict claims frequency. For motor insurance during a policy year, we typically observe a large proportion of drivers with zero claims, a less proportion with exactly one claim, and far lesser with two or more claims. We introduce the use of a cost-sensitive multi-class adaptive boosting (AdaBoost) algorithm, which we call SAMME.C2, to handle such imbalances. To calibrate SAMME.C2 algorithm, we use empirical data collected from a telematics program in Canada and we find improved assessment of driving behavior with telematics relative to traditional risk variables. We demonstrate our algorithm can outperform other models that can handle class imbalances: SAMME, SAMME with SMOTE, RUSBoost, and SMOTEBoost. The sampled data on telematics were observations during 2013-2016 for which 50,301 are used for training and another 21,574 for testing. Broadly speaking, the additional information derived from vehicle telematics helps refine risk classification of drivers of UBI. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: 27 pages, 9 figures, 10 tables

MSC Class: 62P05

arXiv:2006.06151 [pdf, other]

On a Multi-Year Microlevel Collective Risk Model

Authors: Rosy Oh, Himchan Jeong, Jae Youn Ahn, Emiliano A. Valdez

Abstract: For a typical insurance portfolio, the claims process for a short period, typically one year, is characterized by observing frequency of claims together with the associated claims severities. The collective risk model describes this portfolio as a random sum of the aggregation of the claim amounts. In the classical framework, for simplicity, the claim frequency and claim severities are assumed to… ▽ More For a typical insurance portfolio, the claims process for a short period, typically one year, is characterized by observing frequency of claims together with the associated claims severities. The collective risk model describes this portfolio as a random sum of the aggregation of the claim amounts. In the classical framework, for simplicity, the claim frequency and claim severities are assumed to be mutually independent. However, there is a growing interest in relaxing this independence assumption which is more realistic and useful for the practical insurance ratemaking. While the common thread has been capturing the dependence between frequency and aggregate severity within a single period, the work of Oh et al. (2020a) provides an interesting extension to the addition of capturing dependence among individual severities. In this paper, we extend these works within a framework where we have a portfolio of microlevel frequencies and severities for multiple years. This allows us to develop a factor copula model framework that captures various types of dependence between claim frequencies and claim severities over multiple years. It is therefore a clear extension of earlier works on one-year dependent frequency-severity models and on random effects model for capturing serial dependence of claims. We focus on the results using a family of elliptical copulas to model the dependence. The paper further describes how to calibrate the proposed model using illustrative claims data arising from a Singapore insurance company. The estimated results provide strong evidence of all forms of dependencies captured by our model. △ Less

Submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.05617 [pdf, other]

Hybrid Tree-based Models for Insurance Claims

Authors: Zhiyu Quan, Zhiguo Wang, Guojun Gan, Emiliano A. Valdez

Abstract: Two-part models and Tweedie generalized linear models (GLMs) have been used to model loss costs for short-term insurance contract. For most portfolios of insurance claims, there is typically a large proportion of zero claims that leads to imbalances resulting in inferior prediction accuracy of these traditional approaches. This article proposes the use of tree-based models with a hybrid structure… ▽ More Two-part models and Tweedie generalized linear models (GLMs) have been used to model loss costs for short-term insurance contract. For most portfolios of insurance claims, there is typically a large proportion of zero claims that leads to imbalances resulting in inferior prediction accuracy of these traditional approaches. This article proposes the use of tree-based models with a hybrid structure that involves a two-step algorithm as an alternative approach to these traditional models. The first step is the construction of a classification tree to build the probability model for frequency. In the second step, we employ elastic net regression models at each terminal node from the classification tree to build the distribution model for severity. This hybrid structure captures the benefits of tuning hyperparameters at each step of the algorithm; this allows for improved prediction accuracy and tuning can be performed to meet specific business objectives. We examine and compare the predictive performance of such a hybrid tree-based structure in relation to the traditional Tweedie model using both real and synthetic datasets. Our empirical results show that these hybrid tree-based models produce more accurate predictions without the loss of intuitive interpretation. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 24 pages, 6 figures

MSC Class: 62P05

arXiv:2004.08032 [pdf, other]

A non-convex regularization approach for stable estimation of loss development factors

Authors: Himchan Jeong, Hyunwoong Chang, Emiliano A. Valdez

Abstract: In this article, we apply non-convex regularization methods in order to obtain stable estimation of loss development factors in insurance claims reserving. Among the non-convex regularization methods, we focus on the use of the log-adjusted absolute deviation (LAAD) penalty and provide discussion on optimization of LAAD penalized regression model, which we prove to converge with a coordinate desce… ▽ More In this article, we apply non-convex regularization methods in order to obtain stable estimation of loss development factors in insurance claims reserving. Among the non-convex regularization methods, we focus on the use of the log-adjusted absolute deviation (LAAD) penalty and provide discussion on optimization of LAAD penalized regression model, which we prove to converge with a coordinate descent algorithm under mild conditions. This has the advantage of obtaining a consistent estimator for the regression coefficients while allowing for the variable selection, which is linked to the stable estimation of loss development factors. We calibrate our proposed model using a multi-line insurance dataset from a property and casualty insurer where we observed reported aggregate loss along accident years and development periods. When compared to other regression models, our LAAD penalized regression model provides very promising results. △ Less

Submitted 6 December, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: 23 pages, 11 Tables, 6 Figures

MSC Class: 62P05

Showing 1–13 of 13 results for author: Valdez, E A