-
Causal Inference for Banking Finance and Insurance A Survey
Authors:
Satyam Kumar,
Yelleti Vivek,
Vadlamani Ravi,
Indranil Bose
Abstract:
Causal Inference plays an significant role in explaining the decisions taken by statistical models and artificial intelligence models. Of late, this field started attracting the attention of researchers and practitioners alike. This paper presents a comprehensive survey of 37 papers published during 1992-2023 and concerning the application of causal inference to banking, finance, and insurance. Th…
▽ More
Causal Inference plays an significant role in explaining the decisions taken by statistical models and artificial intelligence models. Of late, this field started attracting the attention of researchers and practitioners alike. This paper presents a comprehensive survey of 37 papers published during 1992-2023 and concerning the application of causal inference to banking, finance, and insurance. The papers are categorized according to the following families of domains: (i) Banking, (ii) Finance and its subdomains such as corporate finance, governance finance including financial risk and financial policy, financial economics, and Behavioral finance, and (iii) Insurance. Further, the paper covers the primary ingredients of causal inference namely, statistical methods such as Bayesian Causal Network, Granger Causality and jargon used thereof such as counterfactuals. The review also recommends some important directions for future research. In conclusion, we observed that the application of causal inference in the banking and insurance sectors is still in its infancy, and thus more research is possible to turn it into a viable method.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
FedPNN: One-shot Federated Classification via Evolving Clustering Method and Probabilistic Neural Network hybrid
Authors:
Polaki Durga Prasad,
Yelleti Vivek,
Vadlamani Ravi
Abstract:
Protecting data privacy is paramount in the fields such as finance, banking, and healthcare. Federated Learning (FL) has attracted widespread attention due to its decentralized, distributed training and the ability to protect the privacy while obtaining a global shared model. However, FL presents challenges such as communication overhead, and limited resource capability. This motivated us to propo…
▽ More
Protecting data privacy is paramount in the fields such as finance, banking, and healthcare. Federated Learning (FL) has attracted widespread attention due to its decentralized, distributed training and the ability to protect the privacy while obtaining a global shared model. However, FL presents challenges such as communication overhead, and limited resource capability. This motivated us to propose a two-stage federated learning approach toward the objective of privacy protection, which is a first-of-its-kind study as follows: (i) During the first stage, the synthetic dataset is generated by employing two different distributions as noise to the vanilla conditional tabular generative adversarial neural network (CTGAN) resulting in modified CTGAN, and (ii) In the second stage, the Federated Probabilistic Neural Network (FedPNN) is developed and employed for building globally shared classification model. We also employed synthetic dataset metrics to check the quality of the generated synthetic dataset. Further, we proposed a meta-clustering algorithm whereby the cluster centers obtained from the clients are clustered at the server for training the global model. Despite PNN being a one-pass learning classifier, its complexity depends on the training data size. Therefore, we employed a modified evolving clustering method (ECM), another one-pass algorithm to cluster the training data thereby increasing the speed further. Moreover, we conducted sensitivity analysis by varying Dthr, a hyperparameter of ECM at the server and client, one at a time. The effectiveness of our approach is validated on four finance and medical datasets.
△ Less
Submitted 8 April, 2023;
originally announced April 2023.
-
ATM Fraud Detection using Streaming Data Analytics
Authors:
Yelleti Vivek,
Vadlamani Ravi,
Abhay Anand Mane,
Laveti Ramesh Naidu
Abstract:
Gaining the trust and confidence of customers is the essence of the growth and success of financial institutions and organizations. Of late, the financial industry is significantly impacted by numerous instances of fraudulent activities. Further, owing to the generation of large voluminous datasets, it is highly essential that underlying framework is scalable and meet real time needs. To address t…
▽ More
Gaining the trust and confidence of customers is the essence of the growth and success of financial institutions and organizations. Of late, the financial industry is significantly impacted by numerous instances of fraudulent activities. Further, owing to the generation of large voluminous datasets, it is highly essential that underlying framework is scalable and meet real time needs. To address this issue, in the study, we proposed ATM fraud detection in static and streaming contexts respectively. In the static context, we investigated a parallel and scalable machine learning algorithms for ATM fraud detection that is built on Spark and trained with a variety of machine learning (ML) models including Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), and Multi-layer perceptron (MLP). We also employed several balancing techniques like Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to address the rarity in the dataset. In addition, we proposed a streaming based ATM fraud detection in the streaming context. Our sliding window based method collects ATM transactions that are performed within a specified time interval and then utilizes to train several ML models, including NB, RF, DT, and K-Nearest Neighbour (KNN). We selected these models based on their less model complexity and quicker response time. In both contexts, RF turned out to be the best model. RF obtained the best mean AUC of 0.975 in the static context and mean AUC of 0.910 in the streaming context. RF is also empirically proven to be statistically significant than the next-best performing models.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Chaotic Variational Auto encoder-based Adversarial Machine Learning
Authors:
Pavan Venkata Sainadh Reddy,
Yelleti Vivek,
Gopi Pranay,
Vadlamani Ravi
Abstract:
Machine Learning (ML) has become the new contrivance in almost every field. This makes them a target of fraudsters by various adversary attacks, thereby hindering the performance of ML models. Evasion and Data-Poison-based attacks are well acclaimed, especially in finance, healthcare, etc. This motivated us to propose a novel computationally less expensive attack mechanism based on the adversarial…
▽ More
Machine Learning (ML) has become the new contrivance in almost every field. This makes them a target of fraudsters by various adversary attacks, thereby hindering the performance of ML models. Evasion and Data-Poison-based attacks are well acclaimed, especially in finance, healthcare, etc. This motivated us to propose a novel computationally less expensive attack mechanism based on the adversarial sample generation by Variational Auto Encoder (VAE). It is well known that Wavelet Neural Network (WNN) is considered computationally efficient in solving image and audio processing, speech recognition, and time-series forecasting. This paper proposed VAE-Deep-Wavelet Neural Network (VAE-Deep-WNN), where Encoder and Decoder employ WNN networks. Further, we proposed chaotic variants of both VAE with Multi-layer perceptron (MLP) and Deep-WNN and named them C-VAE-MLP and C-VAE-Deep-WNN, respectively. Here, we employed a Logistic map to generate random noise in the latent space. In this paper, we performed VAE-based adversary sample generation and applied it to various problems related to finance and cybersecurity domain-related problems such as loan default, credit card fraud, and churn modelling, etc., We performed both Evasion and Data-Poison attacks on Logistic Regression (LR) and Decision Tree (DT) models. The results indicated that VAE-Deep-WNN outperformed the rest in the majority of the datasets and models. However, its chaotic variant C-VAE-Deep-WNN performed almost similarly to VAE-Deep-WNN in the majority of the datasets.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Chaotic Variational Auto Encoder based One Class Classifier for Insurance Fraud Detection
Authors:
K. S. N. V. K. Gangadhar,
B. Akhil Kumar,
Yelleti Vivek,
Vadlamani Ravi
Abstract:
Of late, insurance fraud detection has assumed immense significance owing to the huge financial & reputational losses fraud entails and the phenomenal success of the fraud detection techniques. Insurance is majorly divided into two categories: (i) Life and (ii) Non-life. Non-life insurance in turn includes health insurance and auto insurance among other things. In either of the categories, the fra…
▽ More
Of late, insurance fraud detection has assumed immense significance owing to the huge financial & reputational losses fraud entails and the phenomenal success of the fraud detection techniques. Insurance is majorly divided into two categories: (i) Life and (ii) Non-life. Non-life insurance in turn includes health insurance and auto insurance among other things. In either of the categories, the fraud detection techniques should be designed in such a way that they capture as many fraudulent transactions as possible. Owing to the rarity of fraudulent transactions, in this paper, we propose a chaotic variational autoencoder (C-VAE to perform one-class classification (OCC) on genuine transactions. Here, we employed the logistic chaotic map to generate random noise in the latent space. The effectiveness of C-VAE is demonstrated on the health insurance fraud and auto insurance datasets. We considered vanilla Variational Auto Encoder (VAE) as the baseline. It is observed that C-VAE outperformed VAE in both datasets. C-VAE achieved a classification rate of 77.9% and 87.25% in health and automobile insurance datasets respectively. Further, the t-test conducted at 1% level of significance and 18 degrees of freedom infers that C-VAE is statistically significant than the VAE.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection
Authors:
Yelleti Vivek,
Vadlamani Ravi,
Abhay Anand Mane,
Laveti Ramesh Naidu
Abstract:
Gaining the trust of customers and providing them empathy are very critical in the financial domain. Frequent occurrence of fraudulent activities affects these two factors. Hence, financial organizations and banks must take utmost care to mitigate them. Among them, ATM fraudulent transaction is a common problem faced by banks. There following are the critical challenges involved in fraud datasets:…
▽ More
Gaining the trust of customers and providing them empathy are very critical in the financial domain. Frequent occurrence of fraudulent activities affects these two factors. Hence, financial organizations and banks must take utmost care to mitigate them. Among them, ATM fraudulent transaction is a common problem faced by banks. There following are the critical challenges involved in fraud datasets: the dataset is highly imbalanced, the fraud pattern is changing, etc. Owing to the rarity of fraudulent activities, Fraud detection can be formulated as either a binary classification problem or One class classification (OCC). In this study, we handled these techniques on an ATM transactions dataset collected from India. In binary classification, we investigated the effectiveness of various over-sampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to achieve oversampling. Further, we employed various machine learning techniques viz., Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958 AUC. DT is the winner if the complexity and interpretability aspects are considered. Among all the oversampling approaches, SMOTE and its variants were observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM secured second place with 0.947 CR. Further, we incorporated explainable artificial intelligence (XAI) and causal inference (CI) in the fraud detection framework and studied it through various analyses.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Parallel and Streaming Wavelet Neural Networks for Classification and Regression under Apache Spark
Authors:
Eduru Harindra Venkatesh,
Yelleti Vivek,
Vadlamani Ravi,
Orsu Shiva Shankar
Abstract:
Wavelet neural networks (WNN) have been applied in many fields to solve regression as well as classification problems. After the advent of big data, as data gets generated at a brisk pace, it is imperative to analyze it as soon as it is generated owing to the fact that the nature of the data may change dramatically in short time intervals. This is necessitated by the fact that big data is all perv…
▽ More
Wavelet neural networks (WNN) have been applied in many fields to solve regression as well as classification problems. After the advent of big data, as data gets generated at a brisk pace, it is imperative to analyze it as soon as it is generated owing to the fact that the nature of the data may change dramatically in short time intervals. This is necessitated by the fact that big data is all pervasive and throws computational challenges for data scientists. Therefore, in this paper, we built an efficient Scalable, Parallelized Wavelet Neural Network (SPWNN) which employs the parallel stochastic gradient algorithm (SGD) algorithm. SPWNN is designed and developed under both static and streaming environments in the horizontal parallelization framework. SPWNN is implemented by using Morlet and Gaussian functions as activation functions. This study is conducted on big datasets like gas sensor data which has more than 4 million samples and medical research data which has more than 10,000 features, which are high dimensional in nature. The experimental analysis indicates that in the static environment, SPWNN with Morlet activation function outperformed SPWNN with Gaussian on the classification datasets. However, in the case of regression, the opposite was observed. In contrast, in the streaming environment i.e., Gaussian outperformed Morlet on the classification and Morlet outperformed Gaussian on the regression datasets. Overall, the proposed SPWNN architecture achieved a speedup of 1.32-1.40.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework
Authors:
Yelleti Vivek,
P. S. V. S. Sai Prasad
Abstract:
While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutio…
▽ More
While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.
△ Less
Submitted 21 August, 2022;
originally announced August 2022.
-
Parallel bi-objective evolutionary algorithms for scalable feature subset selection via migration strategy under Spark
Authors:
Yelleti Vivek,
Vadlamani Ravi,
P. Radha Krishna
Abstract:
Feature subset selection (FSS) for classification is inherently a bi-objective optimization problem, where the task is to obtain a feature subset which yields the maximum possible area under the receiver operator characteristic curve (AUC) with minimum cardinality of the feature subset. In todays world, a humungous amount of data is generated in all activities of humans. To mine such voluminous da…
▽ More
Feature subset selection (FSS) for classification is inherently a bi-objective optimization problem, where the task is to obtain a feature subset which yields the maximum possible area under the receiver operator characteristic curve (AUC) with minimum cardinality of the feature subset. In todays world, a humungous amount of data is generated in all activities of humans. To mine such voluminous data, which is often high-dimensional, there is a need to develop parallel and scalable frameworks. In the first-of-its-kind study, we propose and develop an iterative MapReduce-based framework for bi-objective evolutionary algorithms (EAs) based wrappers under Apache spark with the migration strategy. In order to accomplish this, we parallelized the non-dominated sorting based algorithms namely non dominated sorting algorithm (NSGA-II), and non-dominated sorting particle swarm optimization (NSPSO), also the decomposition-based algorithm, namely the multi-objective evolutionary algorithm based on decomposition (MOEA-D), and named them P-NSGA-II-IS, P-NSPSO-IS, P-MOEA-D-IS, respectively. We proposed a modified MOEA-D by incorporating the non-dominated sorting principle while parallelizing it. Throughout the study, AUC is computed by logistic regression (LR). We test the effectiveness of the proposed methodology on various datasets. It is noteworthy that the P-NSGA-II turns out to be statistically significant by being in the top 2 positions on most datasets. We also reported the empirical attainment plots, speed up analysis, and mean AUC obtained by the most repeated feature subset and the least cardinal feature subset with the highest AUC, and diversity analysis using hypervolume.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Feature subset selection for Big Data via Chaotic Binary Differential Evolution under Apache Spark
Authors:
Yelleti Vivek,
Vadlamani Ravi,
P. Radhakrishna
Abstract:
Feature subset selection (FSS) using a wrapper approach is essentially a combinatorial optimization problem having two objective functions namely cardinality of the selected-feature-subset, which should be minimized and the corresponding area under the ROC curve (AUC) to be maximized. In this research study, we propose a novel multiplicative single objective function involving cardinality and AUC.…
▽ More
Feature subset selection (FSS) using a wrapper approach is essentially a combinatorial optimization problem having two objective functions namely cardinality of the selected-feature-subset, which should be minimized and the corresponding area under the ROC curve (AUC) to be maximized. In this research study, we propose a novel multiplicative single objective function involving cardinality and AUC. The randomness involved in the Binary Differential Evolution (BDE) may yield less diverse solutions thereby getting trapped in local minima. Hence, we embed Logistic and Tent chaotic maps into the BDE and named it as Chaotic Binary Differential Evolution (CBDE). Designing a scalable solution to the FSS is critical when dealing with high-dimensional and voluminous datasets. Hence, we propose a scalable island (iS) based parallelization approach where the data is divided into multiple partitions/islands thereby the solution evolves individually and gets combined eventually in a migration strategy. The results empirically show that the proposed parallel Chaotic Binary Differential Evolution (P-CBDE-iS) is able to find the better quality feature subsets than the Parallel Bi-nary Differential Evolution (P-BDE-iS). Logistic Regression (LR) is used as a classifier owing to its simplicity and power. The speedup attained by the proposed parallel approach signifies the importance.
△ Less
Submitted 8 February, 2022;
originally announced February 2022.
-
Scalable Feature Subset Selection for Big Data using Parallel Hybrid Evolutionary Algorithm based Wrapper in Apache Spark
Authors:
Yelleti Vivek,
Vadlamani Ravi,
Pisipati Radhakrishna
Abstract:
Owing to the emergence of large datasets, applying current sequential wrapper-based feature subset selection (FSS) algorithms increases the complexity. This limitation motivated us to propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms (EAs) under the Apache Spark environment. The hybrid EAs are based on the BDE and Binary Threshold…
▽ More
Owing to the emergence of large datasets, applying current sequential wrapper-based feature subset selection (FSS) algorithms increases the complexity. This limitation motivated us to propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms (EAs) under the Apache Spark environment. The hybrid EAs are based on the BDE and Binary Threshold Accepting (BTA), a point-based EA, which is invoked to enhance the search capability and avoid premature convergence of the PB-DE. Thus, we designed the hybrid variants (i) parallel binary differential evolution and threshold accepting (PB-DETA), where DE and TA work in tandem in every iteration, and (ii) parallel binary threshold accepting and differential evolution (PB-TADE), where TA and DE work in tandem in every iteration under the Apache Spark environment. Both PB-DETA and PB-TADE are compared with the baseline, viz., the parallel version of the binary differential evolution (PB-DE). All three proposed approaches use logistic regression (LR) to compute the fitness function, namely, the area under ROC curve (AUC). The effectiveness of the proposed algorithms is tested over the five large datasets of varying feature space dimension, taken from cyber security and biology domains. It is noteworthy that the PB-TADE turned out to be statistically significant compared to PB-DE and PB-DETA. We reported the speedup analysis, average AUC obtained by the most repeated feature subset, feature subset with high AUC and least cardinality.
△ Less
Submitted 25 January, 2022; v1 submitted 26 June, 2021;
originally announced June 2021.