Search | arXiv e-print repository

Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics

Abstract: In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims.… ▽ More In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods in order to help leverage an iterative process to enhance predictive accuracy. Despite the inherent slowdown in learning algorithms due to this iteration, several efficient implementation techniques that also help precise tuning of parameter like XGBoost, LightGBM, and CatBoost have emerged. Nonetheless, we chose to utilize CatBoost, a efficient boosting approach that effectively handles categorical and other special types of data. The core contribution of our paper is the assembly of separate modeling for zero claims and the application of tree-based boosting ensemble methods within a CatBoost framework, assuming that the inflated probability of zero is a function of the mean parameter. The efficacy of our enhanced Tweedie model is demonstrated through the application of an insurance telematics dataset, which presents the additional complexity of compositional feature variables. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions suitable for insurance claim analytics. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2307.07771 [pdf, other]

Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM

Authors: Banghee So

Abstract: The property and casualty (P&C) insurance industry faces challenges in develo** claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process ins… ▽ More The property and casualty (P&C) insurance industry faces challenges in develo** claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process insurance claim data, including zero-inflated telematics data, to construct claim frequency models. Three popular gradient boosting libraries - XGBoost, LightGBM, and CatBoost - are evaluated and compared to determine the most suitable library for training insurance claim data and fitting actuarial frequency models. Through a comprehensive analysis of two distinct datasets, it is determined that CatBoost is the best for develo** auto claim frequency models based on predictive performance. Furthermore, we propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $μ$, and find that it outperforms others depending on data characteristics. This model enables us to take advantage of particular CatBoost tools, which makes it easier and more convenient to investigate the effects and interactions of various risk features on the frequency model when using telematics data. △ Less

Submitted 18 June, 2024; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: 26pages, 6tables, 7figures

arXiv:2304.13345 [pdf, ps, other]

Deformation and $K$-theoretic Index Formulae on Boundary Groupoids

Authors: Yu Qiao, Bing Kwan So

Abstract: Boundary groupoids were introduced by the second author, which can be used to model many analysis problems on singular spaces. In order to investigate index theory on boundary groupoids, we introduce the notion of {\em a deformation from the pair groupoid}.Under the assumption that a deformation from the pair groupoid $M \times M$ exists for Lie groupoid $\mathcal{G}\rightrightarrows M$, we constr… ▽ More Boundary groupoids were introduced by the second author, which can be used to model many analysis problems on singular spaces. In order to investigate index theory on boundary groupoids, we introduce the notion of {\em a deformation from the pair groupoid}.Under the assumption that a deformation from the pair groupoid $M \times M$ exists for Lie groupoid $\mathcal{G}\rightrightarrows M$, we construct explicitly a deformation index map relating the analytic index on $\mathcal{G}$ and the index on the pair groupoid. We apply this map to boundary groupoids of the form $\mathcal{G} = M_0 \times M_0 \sqcup G \times M_1 \times M_1 \rightrightarrows M=M_0\sqcup M_1$, where $G$ is an exponential Lie group, to obtain index formulae for (fully) elliptic (pseudo)-differential operators on $\mathcal{G}$, with the aid of the index formula by M. J. Pflaum, H. Posthuma, and X. Tang. These results recover and generalize our previous results for renormalizable boundary groupoids via the method of renormalized trace. △ Less

Submitted 26 April, 2023; originally announced April 2023.

MSC Class: 58J20; 46L80; 19K56; 58H15

arXiv:2205.13979

Analytic surgery and gluing of the Bismut-Lott torsion form and eta form

Authors: Bing Kwan So

Abstract: Given a fiber bundle with closed connected fibers, and a family of separating hypersurfaces, we study the behavior of the Bismut-Lott analytic torsion form, and the eta form for a duality bundle, under analytic surgery in the sense of Hassell, Mazzeo and Melrose. We find that under the surgery limit, the rescaled heat kernel is non-singular, while both the Bismut-Lott analytic torsion form and eta… ▽ More Given a fiber bundle with closed connected fibers, and a family of separating hypersurfaces, we study the behavior of the Bismut-Lott analytic torsion form, and the eta form for a duality bundle, under analytic surgery in the sense of Hassell, Mazzeo and Melrose. We find that under the surgery limit, the rescaled heat kernel is non-singular, while both the Bismut-Lott analytic torsion form and eta form can be written as the sum of a logarithmic term, which satisfies the Igusa additivity property, the b- Bismut-Lott analytic torsion form (respectively the b- eta form), and an error term coming from the reduced normal operator. Hence we obtain a gluing formula for these invariants. △ Less

Submitted 5 September, 2022; v1 submitted 27 May, 2022; originally announced May 2022.

Comments: There are some major errors in the paper, like clarifying the limit of the torsion and eta forms at the size of the underlying manifold going to infinity

MSC Class: 58J52; 58J28

arXiv:2202.01764 [pdf, ps, other]

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

Authors: ByungHoon So, Kyuhong Byun, Kyungwon Kang, Seong** Cho

Abstract: Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,6… ▽ More Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set. The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD. △ Less

Submitted 3 February, 2022; originally announced February 2022.

Comments: 11 pages, 3 figures, 6 tables

arXiv:2112.14868 [pdf, other]

The SAMME.C2 algorithm for severely imbalanced multi-class classification

Authors: Banghee So, Emiliano A. Valdez

Abstract: Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often… ▽ More Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet develo** a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 25 pages, 8 figures, algorithms

MSC Class: 62P99

arXiv:2108.04592 [pdf, ps, other]

Renormalized Index Formulas for Elliptic Differential Operators on Boundary Groupoids

Authors: Yu Qiao, Bing Kwan So

Abstract: We consider the index problem of certain boundary groupoids of the form $\mathcal{G} = M _0 \times M _0 \cup \mathbb{R}^q \times M _1 \times M _1$. Since it has been shown that for the case that $q \geq 3$ is odd, $K _0 (C^* (\mathcal{G})) \cong \bbZ $, and moreover the $K$-theoretic index coincides with the Fredholm index, we attempt in this paper to derive a numerical formula for elliptic differ… ▽ More We consider the index problem of certain boundary groupoids of the form $\mathcal{G} = M _0 \times M _0 \cup \mathbb{R}^q \times M _1 \times M _1$. Since it has been shown that for the case that $q \geq 3$ is odd, $K _0 (C^* (\mathcal{G})) \cong \bbZ $, and moreover the $K$-theoretic index coincides with the Fredholm index, we attempt in this paper to derive a numerical formula for elliptic differential operators on $\mathcal{G}$. Our approach is similar to that of renormalized trace of Moroianu and Nistor \cite{Nistor;Hom2}. However, we find that when $q \geq 3$, the eta term vanishes, and hence the $K$-theoretic and Fredholm indices of elliptic (respectively fully elliptic) pseudo-differential operators on these groupoids are given only by the Atiyah-Singer term. As for the $q=1$ case we find that the result depends on how the singularity set $M_1$ lies in $M$. △ Less

Submitted 22 October, 2021; v1 submitted 10 August, 2021; originally announced August 2021.

Comments: 20 pages. arXiv admin note: substantial text overlap with arXiv:1804.10426

MSC Class: 58J20 (primary) 46L80; 19K56 (Secondary)

arXiv:2102.00252 [pdf, other]

Synthetic Dataset Generation of Driver Telematics

Authors: Banghee So, Jean-Philippe Boucher, Emiliano A. Valdez

Abstract: This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can… ▽ More This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process using machine learning algorithms. The first stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The second stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work valuable. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 24 pages, 11 figures, 6 tables

MSC Class: 62P05

arXiv:2007.03100 [pdf, other]

Cost-sensitive Multi-class AdaBoost for Understanding Driving Behavior with Telematics

Authors: Banghee So, Jean-Philippe Boucher, Emiliano A. Valdez

Abstract: Powered with telematics technology, insurers can now capture a wide range of data, such as distance traveled, how drivers brake, accelerate or make turns, and travel frequency each day of the week, to better decode driver's behavior. Such additional information helps insurers improve risk assessments for usage-based insurance (UBI), an increasingly popular industry innovation. In this article, we… ▽ More Powered with telematics technology, insurers can now capture a wide range of data, such as distance traveled, how drivers brake, accelerate or make turns, and travel frequency each day of the week, to better decode driver's behavior. Such additional information helps insurers improve risk assessments for usage-based insurance (UBI), an increasingly popular industry innovation. In this article, we explore how to integrate telematics information to better predict claims frequency. For motor insurance during a policy year, we typically observe a large proportion of drivers with zero claims, a less proportion with exactly one claim, and far lesser with two or more claims. We introduce the use of a cost-sensitive multi-class adaptive boosting (AdaBoost) algorithm, which we call SAMME.C2, to handle such imbalances. To calibrate SAMME.C2 algorithm, we use empirical data collected from a telematics program in Canada and we find improved assessment of driving behavior with telematics relative to traditional risk variables. We demonstrate our algorithm can outperform other models that can handle class imbalances: SAMME, SAMME with SMOTE, RUSBoost, and SMOTEBoost. The sampled data on telematics were observations during 2013-2016 for which 50,301 are used for training and another 21,574 for testing. Broadly speaking, the additional information derived from vehicle telematics helps refine risk classification of drivers of UBI. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: 27 pages, 9 figures, 10 tables

MSC Class: 62P05

arXiv:1804.10426 [pdf, ps, other]

K-theory and index formulas for boundary groupoid C*-algebras

Authors: Bing Kwan So

Abstract: We compute explicitly the K-groups of some boundary groupoid C*-algebras with exponential isotropy subgroups. Then we derive index formulas that computes the K-theoretic and Fredholm indexes of elliptic (respectively totally elliptic) pseudo-differential operators on these groupoids. We compute explicitly the K-groups of some boundary groupoid C*-algebras with exponential isotropy subgroups. Then we derive index formulas that computes the K-theoretic and Fredholm indexes of elliptic (respectively totally elliptic) pseudo-differential operators on these groupoids. △ Less

Submitted 27 April, 2018; originally announced April 2018.

arXiv:1701.04513 [pdf, ps, other]

Non-commutative analytic torsion form on the transformation groupoid convolution algebra

Authors: Bing Kwan So, GuangXiang Su

Abstract: Given a fiber bundle $Z \to M \to B$ and a flat vector bundle $E \to M$ with a compatible action of a discrete group $G$, and regarding $B / G$ as the non-commutative space corresponding to the crossed product algebra, we construct an analytic torsion form as a non-commutative deRham differential form. We show that our construction is well defined under the weaker assumption of positive Novikov-Sh… ▽ More Given a fiber bundle $Z \to M \to B$ and a flat vector bundle $E \to M$ with a compatible action of a discrete group $G$, and regarding $B / G$ as the non-commutative space corresponding to the crossed product algebra, we construct an analytic torsion form as a non-commutative deRham differential form. We show that our construction is well defined under the weaker assumption of positive Novikov-Shubin invariant. We prove that this torsion form appears in a transgression formula, from which a non-commutative Riamannian-Roch-Grothendieck index formula follows. △ Less

Submitted 17 January, 2017; v1 submitted 16 January, 2017; originally announced January 2017.

arXiv:1405.4631 [pdf, ps, other]

Regularity of analytic torsion form on families of normal coverings

Authors: Bing Kwan So, GuangXiang Su

Abstract: We prove the smoothness of the L^2-analytic torsion form on some fiber bundles with non-compact fibers of positive Novikov-Shubin invariant. We do so by generalizing the arguments of Azzali-Goette-Schick to an appropriate Sobolev space, and proving that the Novikov-Shubin invariant remains positive in the Sobolev settings, using an argument of Alvarez Lopez-Kordyukov. We prove the smoothness of the L^2-analytic torsion form on some fiber bundles with non-compact fibers of positive Novikov-Shubin invariant. We do so by generalizing the arguments of Azzali-Goette-Schick to an appropriate Sobolev space, and proving that the Novikov-Shubin invariant remains positive in the Sobolev settings, using an argument of Alvarez Lopez-Kordyukov. △ Less

Submitted 21 November, 2016; v1 submitted 19 May, 2014; originally announced May 2014.

Comments: Major revision 21 Nov 2016

arXiv:1210.4729 [pdf, ps, other]

Exponential coordinates and regularity of groupoid heat kernels

Authors: Bing Kwan So

Abstract: We prove that on an asymptotically Euclidean boundary groupoid, the heat kernel of the Laplacian is a smooth groupoid pseudo-differential operator. We prove that on an asymptotically Euclidean boundary groupoid, the heat kernel of the Laplacian is a smooth groupoid pseudo-differential operator. △ Less

Submitted 17 October, 2012; originally announced October 2012.

MSC Class: 58H05; 58J05; 35S05

arXiv:1111.7274 [pdf, ps, other]

On the full calculus of pseudo-differential operators on boundary groupoids with polynomial growth

Authors: Bing Kwan So

Abstract: In this paper, we enlarge the space of uniformly supported pseudo-differential operators on some groupoids by considering kernels satisfying certain asymptotic estimates. We show that such enlarged space contains the compact parametrix, and the generalized inverse of uniformly supported operators with Fredholm vector representation. In this paper, we enlarge the space of uniformly supported pseudo-differential operators on some groupoids by considering kernels satisfying certain asymptotic estimates. We show that such enlarged space contains the compact parametrix, and the generalized inverse of uniformly supported operators with Fredholm vector representation. △ Less

Submitted 26 February, 2013; v1 submitted 30 November, 2011; originally announced November 2011.

Comments: v3: final version v2: corrected 4.12 and proof of 4.11

MSC Class: 58H05; 58J05; 35S05

Journal ref: Advances in Mathematics 237 (2013) 1-32

arXiv:1006.5623 [pdf, other]

Pseudo-differential operators, heat calculus and index theory of groupoids satisfying the Lauter-Nistor condition

Authors: Bing Kwan So

Abstract: In this thesis, we study singular pseudo-differential operators defined by groupoids satisfying the Lauter-Nistor condition, by a method parallel to that of manifolds with boundary and edge differential operators. The example of the Bruhat sphere is studied in detail. In particular, we construct an extension to the calculus of uniformly supported pseudo-differential operators that is analogous to… ▽ More In this thesis, we study singular pseudo-differential operators defined by groupoids satisfying the Lauter-Nistor condition, by a method parallel to that of manifolds with boundary and edge differential operators. The example of the Bruhat sphere is studied in detail. In particular, we construct an extension to the calculus of uniformly supported pseudo-differential operators that is analogous to the calculus with bounds defined on manifolds with boundary. We derive a Fredholmness criterion for operators on the Bruhat sphere, and prove that their parametrices up to compact operators lie inside the extended calculus; we construct the heat kernel of perturbed Laplacian operators; and prove an Atiyah-Singer type renormalized index formula for perturbed Dirac operators on the Bruhat sphere using the heat kernel method. △ Less

Submitted 29 June, 2010; originally announced June 2010.

Comments: Warwick PhD Thesis

Showing 1–15 of 15 results for author: So, B