-
fakenewsbr: A Fake News Detection Platform for Brazilian Portuguese
Authors:
Luiz Giordani,
Gilsiley Darú,
Rhenan Queiroz,
Vitor Buzinaro,
Davi Keglevich Neiva,
Daniel Camilo Fuentes Guzmán,
Marcos Jardel Henriques,
Oilson Alberto Gonzatto Junior,
Francisco Louzada
Abstract:
The proliferation of fake news has become a significant concern in recent times due to its potential to spread misinformation and manipulate public opinion. This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese, focusing on journalistic-type news. We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF…
▽ More
The proliferation of fake news has become a significant concern in recent times due to its potential to spread misinformation and manipulate public opinion. This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese, focusing on journalistic-type news. We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF and Word2Vec, to extract features from textual data. We evaluate the performance of various classification algorithms, such as logistic regression, support vector machine, random forest, AdaBoost, and LightGBM, on a dataset containing both true and fake news articles. The proposed approach achieves high accuracy and F1-Score, demonstrating its effectiveness in identifying fake news. Additionally, we developed a user-friendly web platform, fakenewsbr.com, to facilitate the verification of news articles' veracity. Our platform provides real-time analysis, allowing users to assess the likelihood of fake news articles. Through empirical analysis and comparative studies, we demonstrate the potential of our approach to contribute to the fight against the spread of fake news and promote more informed media consumption.
△ Less
Submitted 20 September, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Generalizing the normality: a novel towards different estimation methods for skewed information
Authors:
Diego C Nascimento,
Pedro Luiz Ramos,
David Elal-Olivero,
Milton Cortes-Araya,
Francisco Louzada
Abstract:
Normality is the most often mathematical supposition used in data modeling. Nonetheless, even based on the law of large numbers (LLN), normality is a strong presumption given that the presence of asymmetry and multi-modality in real-world problems is expected. Thus, a flexible modification in the Normal distribution proposed by Elal-Olivero [12] adds a skewness parameter, called Alpha-skew Normal…
▽ More
Normality is the most often mathematical supposition used in data modeling. Nonetheless, even based on the law of large numbers (LLN), normality is a strong presumption given that the presence of asymmetry and multi-modality in real-world problems is expected. Thus, a flexible modification in the Normal distribution proposed by Elal-Olivero [12] adds a skewness parameter, called Alpha-skew Normal (ASN) distribution, enabling bimodality and fat-tail, if needed, although sometimes not trivial to estimate this third parameter (regardless of the location and scale). This work analyzed seven different statistical inferential methods towards the ASNdistribution on synthetic data and historical data of water flux from 21 rivers (channels) in the Atacama region. Moreover, the contribution of this paper is related to the probability estimation surrounding the rivers' flux level in Copiapo city neighborhood, the most important economic city of the third Chilean region, and known to be located in one of the driest areas on Earth, besides the North and the South Pole
△ Less
Submitted 30 April, 2021;
originally announced May 2021.
-
Asymptotic properties of generalized closed-form maximum likelihood estimators
Authors:
Pedro L. Ramos,
Eduardo Ramos,
Francisco A. Rodrigues,
Francisco Louzada
Abstract:
The maximum likelihood estimator (MLE) is pivotal in statistical inference, yet its application is often hindered by the absence of closed-form solutions for many models. This poses challenges in real-time computation scenarios, particularly within embedded systems technology, where numerical methods are impractical. This study introduces a generalized form of the MLE that yields closed-form estim…
▽ More
The maximum likelihood estimator (MLE) is pivotal in statistical inference, yet its application is often hindered by the absence of closed-form solutions for many models. This poses challenges in real-time computation scenarios, particularly within embedded systems technology, where numerical methods are impractical. This study introduces a generalized form of the MLE that yields closed-form estimators under certain conditions. We derive the asymptotic properties of the proposed estimator and demonstrate that our approach retains key properties such as invariance under one-to-one transformations, strong consistency, and an asymptotic normal distribution. The effectiveness of the generalized MLE is exemplified through its application to the Gamma, Nakagami, and Beta distributions, showcasing improvements over the traditional MLE. Additionally, we extend this methodology to a bivariate gamma distribution, successfully deriving closed-form estimators. This advancement presents significant implications for real-time statistical analysis across various applications.
△ Less
Submitted 14 November, 2023; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Objective Bayesian Analysis for the Differential Entropy of the Gamma Distribution
Authors:
Eduardo Ramos,
Osafu A. Egbon,
Pedro L. Ramos,
Francisco A. Rodrigues,
Francisco Louzada
Abstract:
The present paper introduces a fully objective Bayesian analysis to obtain the posterior distribution of an entropy measure. Notably, we consider the gamma distribution, which describes many natural phenomena in physics, engineering, and biology. We reparametrize the model in terms of entropy, and different objective priors are derived, such as Jeffreys prior, reference prior, and matching priors.…
▽ More
The present paper introduces a fully objective Bayesian analysis to obtain the posterior distribution of an entropy measure. Notably, we consider the gamma distribution, which describes many natural phenomena in physics, engineering, and biology. We reparametrize the model in terms of entropy, and different objective priors are derived, such as Jeffreys prior, reference prior, and matching priors. Since the obtained priors are improper, we prove that the obtained posterior distributions are proper and that their respective posterior means are finite. An intensive simulation study is conducted to select the prior that returns better results regarding bias, mean square error, and coverage probabilities. The proposed approach is illustrated in two datasets: the first relates to the Achaemenid dynasty reign period, and the second describes the time to failure of an electronic component in a sugarcane harvest machine.
△ Less
Submitted 13 November, 2023; v1 submitted 27 December, 2020;
originally announced December 2020.
-
Sampling with censored data: a practical guide
Authors:
Pedro L. Ramos,
Daniel C. F. Guzman,
Alex L. Mota,
Daniel A. Saavedra,
Francisco A. Rodrigues,
Francisco Louzada
Abstract:
In this review, we present a simple guide for researchers to obtain pseudo-random samples with censored data. We focus our attention on the most common types of censored data, such as type I, type II, and random censoring. We discussed the necessary steps to sample pseudo-random values from long-term survival models where an additional cure fraction is informed. For illustrative purposes, these te…
▽ More
In this review, we present a simple guide for researchers to obtain pseudo-random samples with censored data. We focus our attention on the most common types of censored data, such as type I, type II, and random censoring. We discussed the necessary steps to sample pseudo-random values from long-term survival models where an additional cure fraction is informed. For illustrative purposes, these techniques are applied in the Weibull distribution. The algorithms and codes in R are presented, enabling the reproducibility of our study. Finally, we have developed an R package that encapsulates these methodologies, providing researchers with practical tools for implementation.
△ Less
Submitted 10 March, 2024; v1 submitted 16 November, 2020;
originally announced November 2020.
-
Spatial Statistical Models: an overview under the Bayesian Approach
Authors:
Francisco Louzada,
Diego C. Nascimento,
Osafu Augustine Egbon
Abstract:
Spatial documentation is exponentially increasing given the availability of Big IoT Data, enabled by the devices miniaturization and data storage capacity. Bayesian spatial statistics is a useful statistical tool to determine the dependence structure and hidden patterns over space through prior knowledge and data likelihood. Nevertheless, this modeling class is not well explored as the classificat…
▽ More
Spatial documentation is exponentially increasing given the availability of Big IoT Data, enabled by the devices miniaturization and data storage capacity. Bayesian spatial statistics is a useful statistical tool to determine the dependence structure and hidden patterns over space through prior knowledge and data likelihood. Nevertheless, this modeling class is not well explored as the classification and regression machine learning models given their simplicity and often weak (data) independence supposition. In this manner, this systematic review aimed to unravel the main models presented in the literature in the past 20 years, identify gaps, and research opportunities. Elements such as random fields, spatial domains, prior specification, covariance function, and numerical approximations were discussed. This work explored the two subclasses of spatial smoothing global and local.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Power laws in the Roman Empire: a survival analysis
Authors:
Pedro L. Ramos,
Luciano da F. Costa,
Francisco Louzada,
Francisco A. Rodrigues
Abstract:
The Roman Empire shaped Western civilization, and many Roman principles are embodied in modern institutions. Although its political institutions proved both resilient and adaptable, allowing it to incorporate diverse populations, the Empire suffered from many internal conflicts. Indeed, most emperors died violently, from assassination, suicide, or in battle. These internal conflicts produced patte…
▽ More
The Roman Empire shaped Western civilization, and many Roman principles are embodied in modern institutions. Although its political institutions proved both resilient and adaptable, allowing it to incorporate diverse populations, the Empire suffered from many internal conflicts. Indeed, most emperors died violently, from assassination, suicide, or in battle. These internal conflicts produced patterns in the length of time that can be identified by statistical analysis. In this paper, we study the underlying patterns associated with the reign of the Roman emperors by using statistical tools of survival data analysis. We consider all the 175 Roman emperors and propose a new power-law model with change points to predict the time-to-violent-death of the Roman emperors. This model encompasses data in the presence of censoring and long-term survivors, providing more accurate predictions than previous models. Our results show that power-law distributions can also occur in survival data, as verified in other data types from natural and artificial systems, reinforcing the ubiquity of power law distributions. The generality of our approach paves the way to further related investigations not only in other ancient civilizations but also in applications in engineering and medicine.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Incorporation of frailties into a non-proportional hazard regression model and its diagnostics for reliability modeling of downhole safety valves
Authors:
Francisco Louzada,
José A. Cuminato,
Oscar M. H. Rodriguez,
Vera L. D. Tomazella,
Eder A. Milani,
Paulo H. Ferreira,
Pedro L. Ramos,
Gustavo Bochio,
Ivan C. Perissini,
Oilson A. Gonzatto Junior,
Alex L. Mota,
Luis F. A. Alegría,
Danilo Colombo,
Paulo G. O. Oliveira,
Hugo F. L. Santos,
Marcus V. C. Magalhães
Abstract:
In this paper, our proposal consists of incorporating frailty into a statistical methodology for modeling time-to-event data, based on non-proportional hazards regression model. Specifically, we use the generalized time-dependent logistic (GTDL) model with a frailty term introduced in the hazard function to control for unobservable heterogeneity among the sampling units. We also add a regression i…
▽ More
In this paper, our proposal consists of incorporating frailty into a statistical methodology for modeling time-to-event data, based on non-proportional hazards regression model. Specifically, we use the generalized time-dependent logistic (GTDL) model with a frailty term introduced in the hazard function to control for unobservable heterogeneity among the sampling units. We also add a regression in the parameter that measures the effect of time, since it can directly reflect the influence of covariates on the effect of time-to-failure. The practical relevance of the proposed model is illustrated in a real problem based on a data set for downhole safety valves (DHSVs) used in offshore oil and gas production wells. The reliability estimation of DHSVs can be used, among others, to predict the blowout occurrence, assess the workover demand and aid decision-making actions.
△ Less
Submitted 1 December, 2020; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Power laws distributions in objective priors
Authors:
Pedro L. Ramos,
Francisco A. Rodrigues,
Eduardo Ramos,
Dipak K. Dey,
Francisco Louzada
Abstract:
The use of objective prior in Bayesian applications has become a common practice to analyze data without subjective information. Formal rules usually obtain these priors distributions, and the data provide the dominant information in the posterior distribution. However, these priors are typically improper and may lead to improper posterior. Here, we show, for a general family of distributions, tha…
▽ More
The use of objective prior in Bayesian applications has become a common practice to analyze data without subjective information. Formal rules usually obtain these priors distributions, and the data provide the dominant information in the posterior distribution. However, these priors are typically improper and may lead to improper posterior. Here, we show, for a general family of distributions, that the obtained objective priors for the parameters either follow a power-law distribution or has an asymptotic power-law behavior. As a result, we observed that the exponents of the model are between 0.5 and 1. Understand these behaviors allow us to easily verify if such priors lead to proper or improper posteriors directly from the exponent of the power-law. The general family considered in our study includes essential models such as Exponential, Gamma, Weibull, Nakagami-m, Haf-Normal, Rayleigh, Erlang, and Maxwell Boltzmann distributions, to list a few. In summary, we show that comprehending the mechanisms describing the shapes of the priors provides essential information that can be used in situations where additional complexity is presented.
△ Less
Submitted 15 May, 2020;
originally announced May 2020.
-
Multiple repairable systems under dependent competing risks with nonparametric Frailty
Authors:
Marco Pollo Almeida,
Rafael Paixao,
Pedro Ramos,
Vera Tomazella,
Francisco Louzada,
Ricardo Ehlers
Abstract:
The aim of this article is to analyze data from multiple repairable systems under the presence of dependent competing risks. In order to model this dependence structure, we adopted the well-known shared frailty model. This model provides a suitable theoretical basis for generating dependence between the components failure times in the dependent competing risks model. It is known that the dependenc…
▽ More
The aim of this article is to analyze data from multiple repairable systems under the presence of dependent competing risks. In order to model this dependence structure, we adopted the well-known shared frailty model. This model provides a suitable theoretical basis for generating dependence between the components failure times in the dependent competing risks model. It is known that the dependence effect in this scenario influences the estimates of the model parameters. Hence, under the assumption that the cause-specific intensities follow a PLP, we propose a frailty-induced dependence approach to incorporate the dependence among the cause-specific recurrent processes. Moreover, the misspecification of the frailty distribution may lead to errors when estimating the parameters of interest. Because of this, we considered a Bayesian nonparametric approach to model the frailty density in order to offer more flexibility and to provide consistent estimates for the PLP model, as well as insights about heterogeneity among the systems. Both simulation studies and real case studies are provided to illustrate the proposed approaches and demonstrate their validity.
△ Less
Submitted 10 April, 2020;
originally announced April 2020.
-
Random Machines Regression Approach: an ensemble support vector regression model with free kernel choice
Authors:
Anderson Ara,
Mateus Maia,
Samuel Macêdo,
Francisco Louzada
Abstract:
Machine learning techniques always aim to reduce the generalized prediction error. In order to reduce it, ensemble methods present a good approach combining several models that results in a greater forecasting capacity. The Random Machines already have been demonstrated as strong technique, i.e: high predictive power, to classification tasks, in this article we propose an procedure to use the bagg…
▽ More
Machine learning techniques always aim to reduce the generalized prediction error. In order to reduce it, ensemble methods present a good approach combining several models that results in a greater forecasting capacity. The Random Machines already have been demonstrated as strong technique, i.e: high predictive power, to classification tasks, in this article we propose an procedure to use the bagged-weighted support vector model to regression problems. Simulation studies were realized over artificial datasets, and over real data benchmarks. The results exhibited a good performance of Regression Random Machines through lower generalization error without needing to choose the best kernel function during tuning process.
△ Less
Submitted 27 March, 2020;
originally announced March 2020.
-
Random Machines: A bagged-weighted support vector model with free kernel choice
Authors:
Anderson Ara,
Mateus Maia,
Samuel Macêdo,
Francisco Louzada
Abstract:
Improvement of statistical learning models in order to increase efficiency in solving classification or regression problems is still a goal pursued by the scientific community. In this way, the support vector machine model is one of the most successful and powerful algorithms for those tasks. However, its performance depends directly from the choice of the kernel function and their hyperparameters…
▽ More
Improvement of statistical learning models in order to increase efficiency in solving classification or regression problems is still a goal pursued by the scientific community. In this way, the support vector machine model is one of the most successful and powerful algorithms for those tasks. However, its performance depends directly from the choice of the kernel function and their hyperparameters. The traditional choice of them, actually, can be computationally expensive to do the kernel choice and the tuning processes. In this article, it is proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time. The data study was performed in simulated data and over 27 real benchmarking datasets.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
On the unified zero-inflated cure-rate survival models
Authors:
Francisco Louzada,
Pedro Luiz Ramos,
Hayala C. C. Souza,
Gleici da Silva Castro Perdona
Abstract:
In this paper, we propose a unified version for survival models that includes zero-inflation and cure rate proportions, and allows different distributions for the unknown competitive causes. Our model has as particular cases several usual cure rate survival models.
In this paper, we propose a unified version for survival models that includes zero-inflation and cure rate proportions, and allows different distributions for the unknown competitive causes. Our model has as particular cases several usual cure rate survival models.
△ Less
Submitted 18 May, 2020; v1 submitted 26 January, 2019;
originally announced January 2019.
-
An Extended Poisson Family of Life Distribution: A Unified Approach in Competitive and Complementary Risks
Authors:
Pedro L. Ramos,
Dipak K. Dey,
Francisco Louzada,
Victor H. Lachos
Abstract:
In this paper, we introduce a new approach to generate flexible parametric families of distributions. These models arise on competitive and complementary risks scenario, in which the lifetime associated with a particular risk is not observable, rather, we observe only the minimum/maximum lifetime value among all risks. The latent variables have a zero truncated Poisson distribution. For the propos…
▽ More
In this paper, we introduce a new approach to generate flexible parametric families of distributions. These models arise on competitive and complementary risks scenario, in which the lifetime associated with a particular risk is not observable, rather, we observe only the minimum/maximum lifetime value among all risks. The latent variables have a zero truncated Poisson distribution. For the proposed family of distribution, the extra shape parameter has an important physical interpretation in the competing and complementary risks scenario. The mathematical properties and inferential procedures are discussed. The proposed approach is applied in some existing distributions in which it is fully illustrated by an important data set.
△ Less
Submitted 19 May, 2018;
originally announced May 2018.
-
Objective Bayesian Inference for Repairable System Subject to Competing Risks
Authors:
Marco Pollo,
Vera Tomazella,
Gustavo Gilardoni,
Pedro L. Ramos,
Marcio J. Nicola,
Francisco Louzada
Abstract:
Competing risks models for a repairable system subject to several failure modes are discussed. Under minimal repair, it is assumed that each failure mode has a power law intensity. An orthogonal reparametrization is used to obtain an objective Bayesian prior which is invariant under relabelling of the failure modes. The resulting posterior is a product of gamma distributions and has appealing prop…
▽ More
Competing risks models for a repairable system subject to several failure modes are discussed. Under minimal repair, it is assumed that each failure mode has a power law intensity. An orthogonal reparametrization is used to obtain an objective Bayesian prior which is invariant under relabelling of the failure modes. The resulting posterior is a product of gamma distributions and has appealing properties: one-to-one invariance, consistent marginalization and consistent sampling properties. Moreover, the resulting Bayes estimators have closed-form expressions and are naturally unbiased for all the parameters of the model. The methodology is applied in the analysis of (i) a previously unpublished dataset about recurrent failure history of a sugarcane harvester and (ii) records of automotive warranty claims introduced in [1]. A simulation study was carried out to study the efficiency of the methods proposed.
△ Less
Submitted 17 April, 2018;
originally announced April 2018.
-
The Frechet distribution: Estimation and Application an Overview
Authors:
Pedro Luiz Ramos,
Francisco Louzada,
Eduardo Ramos,
Sanku Dey
Abstract:
In this article, we consider the problem of estimating the parameters of the Fréchet distribution from both frequentist and Bayesian points of view. First we briefly describe different frequentist approaches, namely, maximum likelihood, method of moments, percentile estimators, L-moments, ordinary and weighted least squares, maximum product of spacings, maximum goodness-of-fit estimators and compa…
▽ More
In this article, we consider the problem of estimating the parameters of the Fréchet distribution from both frequentist and Bayesian points of view. First we briefly describe different frequentist approaches, namely, maximum likelihood, method of moments, percentile estimators, L-moments, ordinary and weighted least squares, maximum product of spacings, maximum goodness-of-fit estimators and compare them with respect to mean relative estimates, mean squared errors and the 95\% coverage probability of the asymptotic confidence intervals using extensive numerical simulations. Next, we consider the Bayesian inference approach using reference priors. The Metropolis-Hasting algorithm is used to draw Markov Chain Monte Carlo samples, and they have in turn been used to compute the Bayes estimates and also to construct the corresponding credible intervals. Five real data sets related to the minimum flow of water on Piracicaba river in Brazil are used to illustrate the applicability of the discussed procedures.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.
-
Reliability-centered maintenance: analyzing failure in harvest sugarcane machine using some generalizations of the Weibull distribution
Authors:
Pedro Luiz Ramos,
Diego Nascimento,
Camila Cocolo,
Márcio José Nicola,
Carlos Alonso,
Luiz Gustavo Ribeiro,
André Ennes,
Francisco Louzada
Abstract:
In this study we considered five generalizations of the standard Weibull distribution to describe the lifetime of two important components of harvest sugarcane machines. The harvesters considered in the analysis does the harvest of an average of 20 tons of sugarcane per hour and their malfunction may lead to major losses, therefore, an effective maintenance approach is of main interesting for cost…
▽ More
In this study we considered five generalizations of the standard Weibull distribution to describe the lifetime of two important components of harvest sugarcane machines. The harvesters considered in the analysis does the harvest of an average of 20 tons of sugarcane per hour and their malfunction may lead to major losses, therefore, an effective maintenance approach is of main interesting for cost savings. For the considered distributions, the mathematical background is presented. Maximum likelihood is used for parameter estimation. Further, different discrimination procedures were used to obtain the best fit for each component. At the end, we propose a maintenance scheduling for the components of the harvesters using predictive analysis.
△ Less
Submitted 8 December, 2017;
originally announced December 2017.
-
The Inverse Weighted Lindley Distribution: Properties, Estimation and an Application on a Failure Time Data
Authors:
Pedro L. Ramos,
Francisco Louzada,
Taciana K. O. Shimizu,
Aline O. Luiz
Abstract:
In this paper a new distribution is proposed. This new model provides more flexibility to modeling data with upside-down bathtub hazard rate function. A significant account of mathematical properties of the new distribution is presented. The maximum likelihood estimators for the parameters in the presence of complete and censored data are presented. Two corrective approaches are considered to deri…
▽ More
In this paper a new distribution is proposed. This new model provides more flexibility to modeling data with upside-down bathtub hazard rate function. A significant account of mathematical properties of the new distribution is presented. The maximum likelihood estimators for the parameters in the presence of complete and censored data are presented. Two corrective approaches are considered to derive modified estimators that are bias-free to second order. A numerical simulation is carried out to examine the efficiency of the bias correction. Finally, an application using a real data set is presented in order to illustrate our proposed distribution.
△ Less
Submitted 26 November, 2017;
originally announced November 2017.
-
The Long Term Fréchet distribution: Estimation, Properties and its Application
Authors:
Pedro Luiz Ramos,
Diego Nascimento,
Francisco Louzada
Abstract:
In this paper a new long-term survival distribution is proposed. The so called long term Fréchet distribution allows us to fit data where a part of the population is not susceptible to the event of interest. This model may be used, for example, in clinical studies where a portion of the population can be cured during a treatment. It is shown an account of mathematical properties of the new distrib…
▽ More
In this paper a new long-term survival distribution is proposed. The so called long term Fréchet distribution allows us to fit data where a part of the population is not susceptible to the event of interest. This model may be used, for example, in clinical studies where a portion of the population can be cured during a treatment. It is shown an account of mathematical properties of the new distribution such as its moments and survival properties. As well is presented the maximum likelihood estimators (MLEs) for the parameters. A numerical simulation is carried out in order to verify the performance of the MLEs. Finally, an important application related to the leukemia free-survival times for transplant patients are discussed to illustrates our proposed distribution
△ Less
Submitted 22 September, 2017;
originally announced September 2017.
-
Cooperative Parallel Particle Filters for online model selection and applications to Urban Mobility
Authors:
Luca Martino,
Jesse Read,
Victor Elvira,
Francisco Louzada
Abstract:
We design a sequential Monte Carlo scheme for the dual purpose of Bayesian inference and model selection. We consider the application context of urban mobility, where several modalities of transport and different measurement devices can be employed. Therefore, we address the joint problem of online tracking and detection of the current modality. For this purpose, we use interacting parallel partic…
▽ More
We design a sequential Monte Carlo scheme for the dual purpose of Bayesian inference and model selection. We consider the application context of urban mobility, where several modalities of transport and different measurement devices can be employed. Therefore, we address the joint problem of online tracking and detection of the current modality. For this purpose, we use interacting parallel particle filters, each one addressing a different model. They cooperate for providing a global estimator of the variable of interest and, at the same time, an approximation of the posterior density of each model given the data. The interaction occurs by a parsimonious distribution of the computational effort, with online adaptation for the number of particles of each filter according to the posterior probability of the corresponding model. The resulting scheme is simple and flexible. We have tested the novel technique in different numerical experiments with artificial and real data, which confirm the robustness of the proposed scheme.
△ Less
Submitted 25 September, 2016;
originally announced September 2016.
-
MWStat: A Modulated Web-Based Statistical System
Authors:
Francisco Louzada,
Anderson Ara
Abstract:
In this paper we present the development of a modulated web based statistical system, hereafter MWStat, which shifts the statistical paradigm of analyzing data into a real time structure. The MWStat system is useful for both online storage data and questionnaires analysis, as well as to provide real time disposal of results from analysis related to several statistical methodologies in a customizab…
▽ More
In this paper we present the development of a modulated web based statistical system, hereafter MWStat, which shifts the statistical paradigm of analyzing data into a real time structure. The MWStat system is useful for both online storage data and questionnaires analysis, as well as to provide real time disposal of results from analysis related to several statistical methodologies in a customizable fashion. Overall, it can be seem as a useful technical solution that can be applied to a large range of statistical applications, which needs of a scheme of devolution of real time results, accessible to anyone with internet access. We display here the step-by-step instructions for implementing the system. The structure is accessible, built with an easily interpretable language and it can be strategically applied to online statistical applications. We rely on the relationship of several free languages, namely, PHP, R, MySQL database and an Apache HTTP server, and on the use of software tools such as phpMyAdmin. We expose three didactical examples of the MWStat system on institutional evaluation, statistical quality control and multivariate analysis. The methodology is also illustrated in a real example on institutional evaluation.
△ Less
Submitted 29 April, 2016;
originally announced May 2016.
-
Objective Bayesian Analysis for the Lomax Distribution
Authors:
Paulo Ferreira,
Jhon Gonzales,
Vera Tomazella,
Ricardo Ehlers,
Francisco Louzada,
Eveliny Silva
Abstract:
In this paper we propose to make Bayesian inferences for the parameters of the Lomax distribution using non-informative priors, namely the Jeffreys prior and the reference prior. We assess Bayesian estimation through a Monte Carlo study with 500 simulated data sets. To evaluate the possible impact of prior specification on estimation, two criteria were considered: the bias and square root of the m…
▽ More
In this paper we propose to make Bayesian inferences for the parameters of the Lomax distribution using non-informative priors, namely the Jeffreys prior and the reference prior. We assess Bayesian estimation through a Monte Carlo study with 500 simulated data sets. To evaluate the possible impact of prior specification on estimation, two criteria were considered: the bias and square root of the mean square error. The developed procedures are illustrated on a real data set.
△ Less
Submitted 26 February, 2016;
originally announced February 2016.
-
Effective Sample Size for Importance Sampling based on discrepancy measures
Authors:
L. Martino,
V. Elvira,
F. Louzada
Abstract:
The Effective Sample Size (ESS) is an important measure of efficiency of Monte Carlo methods such as Markov Chain Monte Carlo (MCMC) and Importance Sampling (IS) techniques. In the IS context, an approximation $\widehat{ESS}$ of the theoretical ESS definition is widely applied, involving the inverse of the sum of the squares of the normalized importance weights. This formula, $\widehat{ESS}$, has…
▽ More
The Effective Sample Size (ESS) is an important measure of efficiency of Monte Carlo methods such as Markov Chain Monte Carlo (MCMC) and Importance Sampling (IS) techniques. In the IS context, an approximation $\widehat{ESS}$ of the theoretical ESS definition is widely applied, involving the inverse of the sum of the squares of the normalized importance weights. This formula, $\widehat{ESS}$, has become an essential piece within Sequential Monte Carlo (SMC) methods, to assess the convenience of a resampling step. From another perspective, the expression $\widehat{ESS}$ is related to the Euclidean distance between the probability mass described by the normalized weights and the discrete uniform probability mass function (pmf). In this work, we derive other possible ESS functions based on different discrepancy measures between these two pmfs. Several examples are provided involving, for instance, the geometric mean of the weights, the discrete entropy (including theperplexity measure, already proposed in literature) and the Gini coefficient among others. We list five theoretical requirements which a generic ESS function should satisfy, allowing us to classify different ESS measures. We also compare the most promising ones by means of numerical simulations.
△ Less
Submitted 25 September, 2016; v1 submitted 10 February, 2016;
originally announced February 2016.
-
Classification methods applied to credit scoring: A systematic review and overall comparison
Authors:
Francisco Louzada,
Anderson Ara,
Guilherme B. Fernandes
Abstract:
The need for controlling and effectively managing credit risk has led financial institutions to excel in improving techniques designed for this purpose, resulting in the development of various quantitative models by financial institutions and consulting companies. Hence, the growing number of academic studies about credit scoring shows a variety of classification methods applied to discriminate go…
▽ More
The need for controlling and effectively managing credit risk has led financial institutions to excel in improving techniques designed for this purpose, resulting in the development of various quantitative models by financial institutions and consulting companies. Hence, the growing number of academic studies about credit scoring shows a variety of classification methods applied to discriminate good and bad borrowers. This paper, therefore, aims to present a systematic literature review relating theory and application of binary classification techniques for credit scoring financial analysis. The general results show the use and importance of the main techniques for credit rating, as well as some of the scientific paradigm changes throughout the years.
△ Less
Submitted 5 February, 2016;
originally announced February 2016.
-
The Generalized Weighted Lindley Distribution: Properties, Estimation and Applications
Authors:
P. L. Ramos,
F. Louzada
Abstract:
In this paper, we proposed a new lifetime distribution namely generalized weighted Lindley (GLW) distribution. The GLW distribution is a useful generalization of the weighted Lindley distribution, which accommodates increasing, decreasing, decreasing-increasing-decreasing, bathtub, or unimodal hazard functions, making the GWL distribution a flexible model for reliability data. A significant accoun…
▽ More
In this paper, we proposed a new lifetime distribution namely generalized weighted Lindley (GLW) distribution. The GLW distribution is a useful generalization of the weighted Lindley distribution, which accommodates increasing, decreasing, decreasing-increasing-decreasing, bathtub, or unimodal hazard functions, making the GWL distribution a flexible model for reliability data. A significant account of mathematical properties of the new distribution are presented. Different estimation procedures are also given such as, maximum likelihood estimators, method of moments, ordinary and weighted least-squares, percentile, maximum product of spacings and minimum distance estimators. The different estimators are compared by an extensive numerical simulations. Finally, we analyze two data sets for illustrative purposes, proving that the GWL outperform several usual three parameters lifetime distributions.
△ Less
Submitted 18 July, 2016; v1 submitted 27 January, 2016;
originally announced January 2016.
-
The zero-inflated promotion cure rate regression model applied to fraud propensity in bank loan applications
Authors:
Francisco Louzada,
Mauro R. de Oliveira Jr,
Fernando F. Moreira
Abstract:
In this paper we extend the promotion cure rate model proposed by Chen et al (1999), by incorporating excess of zeros in the modelling. Despite allowing to relate the covariates to the fraction of cure, the current approach, which is based on a biological interpretation of the causes that trigger the event of interest, does not enable to relate the covariates to the fraction of zeros. The presence…
▽ More
In this paper we extend the promotion cure rate model proposed by Chen et al (1999), by incorporating excess of zeros in the modelling. Despite allowing to relate the covariates to the fraction of cure, the current approach, which is based on a biological interpretation of the causes that trigger the event of interest, does not enable to relate the covariates to the fraction of zeros. The presence of zeros in survival data, unusual in medical studies, can frequently occur in banking loan portfolios, as presented in Louzada et al (2015), where they deal with propensity to fraud in lending loans in a major Brazilian bank. To illustrate the new cure rate survival method, the same real dataset analyzed in Louzada et al (2015) is fitted here, and the results are compared.
△ Less
Submitted 1 October, 2015;
originally announced October 2015.
-
Parallel Metropolis chains with cooperative adaptation
Authors:
L. Martino,
V. Elvira,
D. Luengo,
F. Louzada
Abstract:
Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) algorithms, have become very popular in signal processing over the last years. In this work, we introduce a novel MCMC scheme where parallel MCMC chains interact, adapting cooperatively the parameters of their proposal functions. Furthermore, the novel algorithm distributes the computational effort adaptively, rewarding the chains which…
▽ More
Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) algorithms, have become very popular in signal processing over the last years. In this work, we introduce a novel MCMC scheme where parallel MCMC chains interact, adapting cooperatively the parameters of their proposal functions. Furthermore, the novel algorithm distributes the computational effort adaptively, rewarding the chains which are providing better performance and, possibly even stop** other ones. These extinct chains can be reactivated if the algorithm considers necessary. Numerical simulations shows the benefits of the novel scheme.
△ Less
Submitted 26 September, 2015;
originally announced September 2015.
-
Adaptive Rejection Sampling with fixed number of nodes
Authors:
L. Martino,
F. Louzada
Abstract:
The adaptive rejection sampling (ARS) algorithm is a universal random generator for drawing samples efficiently from a univariate log-concave target probability density function (pdf). ARS generates independent samples from the target via rejection sampling with high acceptance rates. Indeed, ARS yields a sequence of proposal functions that converge toward the target pdf, so that the probability o…
▽ More
The adaptive rejection sampling (ARS) algorithm is a universal random generator for drawing samples efficiently from a univariate log-concave target probability density function (pdf). ARS generates independent samples from the target via rejection sampling with high acceptance rates. Indeed, ARS yields a sequence of proposal functions that converge toward the target pdf, so that the probability of accepting a sample approaches one. However, sampling from the proposal pdf becomes more computational demanding each time it is updated. In this work, we propose a novel ARS scheme, called Cheap Adaptive Rejection Sampling (CARS), where the computational effort for drawing from the proposal remains constant, decided in advance by the user. For generating a large number of desired samples, CARS is faster than ARS.
△ Less
Submitted 8 October, 2017; v1 submitted 26 September, 2015;
originally announced September 2015.
-
The zero-inflated cure rate regression model: Applications to fraud detection in bank loan portfolios
Authors:
Francisco Louzada,
Mauro R. de Oliveira Jr,
Fernando F. Moreira
Abstract:
In this paper, we introduce a methodology based on the zero-inflated cure rate model to detect fraudsters in bank loan applications. In fact, our approach enables us to accommodate three different types of loan applicants, i.e., fraudsters, those who are susceptible to default and finally, those who are not susceptible to default. An advantage of our approach is to accommodate zero-inflated times,…
▽ More
In this paper, we introduce a methodology based on the zero-inflated cure rate model to detect fraudsters in bank loan applications. In fact, our approach enables us to accommodate three different types of loan applicants, i.e., fraudsters, those who are susceptible to default and finally, those who are not susceptible to default. An advantage of our approach is to accommodate zero-inflated times, which is not possible in the standard cure rate model. To illustrate the proposed method, a real dataset of loan survival times is fitted by the zero-inflated Weibull cure rate model. The parameter estimation is reached by maximum likelihood estimation procedure and Monte Carlo simulations are carried out to check its finite sample performance.
△ Less
Submitted 19 September, 2015; v1 submitted 17 September, 2015;
originally announced September 2015.
-
Issues in the Multiple Try Metropolis mixing
Authors:
L. Martino,
F. Louzada
Abstract:
The multiple Try Metropolis (MTM) algorithm is an advanced MCMC technique based on drawing and testing several candidates at each iteration of the algorithm. One of them is selected according to certain weights and then it is tested according to a suitable acceptance probability. Clearly, since the computational cost increases as the employed number of tries grows, one expects that the performance…
▽ More
The multiple Try Metropolis (MTM) algorithm is an advanced MCMC technique based on drawing and testing several candidates at each iteration of the algorithm. One of them is selected according to certain weights and then it is tested according to a suitable acceptance probability. Clearly, since the computational cost increases as the employed number of tries grows, one expects that the performance of an MTM scheme improves as the number of tries increases, as well. However, there are scenarios where the increase of number of tries does not produce a corresponding enhancement of the performance. In this work, we describe these scenarios and then we introduce possible solutions for solving these issues.
△ Less
Submitted 19 February, 2016; v1 submitted 18 August, 2015;
originally announced August 2015.
-
Orthogonal parallel MCMC methods for sampling and optimization
Authors:
L. Martino,
V. Elvira,
D. Luengo,
J. Corander,
F. Louzada
Abstract:
Monte Carlo (MC) methods are widely used for Bayesian inference and optimization in statistics, signal processing and machine learning. A well-known class of MC methods are Markov Chain Monte Carlo (MCMC) algorithms. In order to foster better exploration of the state space, specially in high-dimensional applications, several schemes employing multiple parallel MCMC chains have been recently introd…
▽ More
Monte Carlo (MC) methods are widely used for Bayesian inference and optimization in statistics, signal processing and machine learning. A well-known class of MC methods are Markov Chain Monte Carlo (MCMC) algorithms. In order to foster better exploration of the state space, specially in high-dimensional applications, several schemes employing multiple parallel MCMC chains have been recently introduced. In this work, we describe a novel parallel interacting MCMC scheme, called {\it orthogonal MCMC} (O-MCMC), where a set of "vertical" parallel MCMC chains share information using some "horizontal" MCMC techniques working on the entire population of current states. More specifically, the vertical chains are led by random-walk proposals, whereas the horizontal MCMC techniques employ independent proposals, thus allowing an efficient combination of global exploration and local approximation. The interaction is contained in these horizontal iterations. Within the analysis of different implementations of O-MCMC, novel schemes in order to reduce the overall computational cost of parallel multiple try Metropolis (MTM) chains are also presented. Furthermore, a modified version of O-MCMC for optimization is provided by considering parallel simulated annealing (SA) algorithms. Numerical results show the advantages of the proposed sampling scheme in terms of efficiency in the estimation, as well as robustness in terms of independence with respect to initial values and the choice of the parameters.
△ Less
Submitted 25 September, 2016; v1 submitted 30 July, 2015;
originally announced July 2015.
-
Modeling Compositional Regression with uncorrelated and correlated errors: a Bayesian approach
Authors:
Taciana K. O. Shimizu,
Francisco Louzada,
Adriano K. Suzuki,
Ricardo S. Ehlers
Abstract:
Compositional data consist of known compositions vectors whose components are positive and defined in the interval (0,1) representing proportions or fractions of a "whole". The sum of these components must be equal to one. Compositional data is present in different knowledge areas, as in geology, economy, medicine among many others. In this paper, we introduce a Bayesian analysis for compositional…
▽ More
Compositional data consist of known compositions vectors whose components are positive and defined in the interval (0,1) representing proportions or fractions of a "whole". The sum of these components must be equal to one. Compositional data is present in different knowledge areas, as in geology, economy, medicine among many others. In this paper, we introduce a Bayesian analysis for compositional regression applying additive log-ratio (ALR) transformation and assuming uncorrelated and correlated errors. The Bayesian inference procedure based on Markov Chain Monte Carlo Methods (MCMC). The methodology is illustrated on an artificial and a real data set of volleyball.
△ Less
Submitted 1 July, 2015;
originally announced July 2015.
-
Maximum Likelihood Estimation for the Weight Lindley Distribution Parameters under Different Types of Censoring
Authors:
Pedro L. Ramos,
Francisco Louzada,
Vicente G. Cancho
Abstract:
In this paper the maximum likelihood equations for the parameters of the Weight Lindley distribution are studied considering different types of censoring, such as, type I, type II and random censoring mechanism. A numerical simulation study is perform to evaluate the maximum likelihood estimates. The proposed methodology is illustrated in a real data set.
In this paper the maximum likelihood equations for the parameters of the Weight Lindley distribution are studied considering different types of censoring, such as, type I, type II and random censoring mechanism. A numerical simulation study is perform to evaluate the maximum likelihood estimates. The proposed methodology is illustrated in a real data set.
△ Less
Submitted 29 March, 2015;
originally announced March 2015.
-
The Optimised Theta Method
Authors:
José Augusto Fioruci,
Tiago Ribeiro Pellegrini,
Francisco Louzada,
Fotios Petropoulos
Abstract:
Accurate and robust forecasting methods for univariate time series are very important when the objective is to produce estimates for a large number of time series. In this context, the Theta method called researchers attention due its performance in the largest up-to-date forecasting competition, the M3-Competition. Theta method proposes the decomposition of the deseasonalised data into two "theta…
▽ More
Accurate and robust forecasting methods for univariate time series are very important when the objective is to produce estimates for a large number of time series. In this context, the Theta method called researchers attention due its performance in the largest up-to-date forecasting competition, the M3-Competition. Theta method proposes the decomposition of the deseasonalised data into two "theta lines". The first theta line removes completely the curvatures of the data, thus being a good estimator of the long-term trend component. The second theta line doubles the curvatures of the series, as to better approximate the short-term behaviour. In this paper, we propose a generalisation of the Theta method by optimising the selection of the second theta line, based on various validation schemes where the out-of-sample accuracy of the candidate variants is measured. The recomposition process of the original time series builds on the asymmetry of the decomposed theta lines. An empirical investigation through the M3-Competition data set shows improvements on the forecasting accuracy of the proposed optimised Theta method.
△ Less
Submitted 11 March, 2015;
originally announced March 2015.
-
Analyzing Volleyball Data on a Compositional Regression Model Approach: An Application to the Brazilian Men's Volleyball Super League 2011/2012 Data
Authors:
Taciana K. O. Shimizu,
Francisco Louzada,
Adriano K. Suzuki
Abstract:
Volleyball has become a competitive sport with high physical and technical performance. Matches results are based on the players and teams'skills as technical and tactical strategies to succeed in a championship. At this point, some studies are carried out on the performance analysis of different match elements, contributing to the development of this sport. In this paper, we proposed a new approa…
▽ More
Volleyball has become a competitive sport with high physical and technical performance. Matches results are based on the players and teams'skills as technical and tactical strategies to succeed in a championship. At this point, some studies are carried out on the performance analysis of different match elements, contributing to the development of this sport. In this paper, we proposed a new approach to analyze volleyball data. The study is based on the compositional data methodology modeling in regression model. The parameters are obtained through the maximum likelihood. We performed a simulation study to evaluate the estimation procedure in compositional regression model and we illustrated the proposed methodology considering real data set of volleyball.
△ Less
Submitted 18 December, 2014;
originally announced December 2014.
-
A Modified Reference Prior for the Generalized Gamma Distribution
Authors:
Pedro L. Ramos,
Francisco Louzada
Abstract:
In this paper we propose an objective Bayesian estimation approach for the parameters of the generalized gamma distribution. Various reference priors are obtained, but showing that they lead to improper posterior distributions. We overcome this problem by proposing a modification in a reference priori distribution, allowing for a proper posterior distribution for the parameters of the generalized…
▽ More
In this paper we propose an objective Bayesian estimation approach for the parameters of the generalized gamma distribution. Various reference priors are obtained, but showing that they lead to improper posterior distributions. We overcome this problem by proposing a modification in a reference priori distribution, allowing for a proper posterior distribution for the parameters of the generalized gamma distribution. We perform a simulation study in order to study the efficiency of the proposed methodology, which is also fully illustrated on a real data set.
△ Less
Submitted 18 December, 2014;
originally announced December 2014.
-
BayesDccGarch - An Implementation of Multivariate GARCH DCC Models
Authors:
Jose A. Fioruci,
Ricardo S. Ehlers,
Francisco Louzada
Abstract:
Multivariate GARCH models are important tools to describe the dynamics of multivariate times series of financial returns. Nevertheless, these models have been much less used in practice due to the lack of reliable software. This paper describes the {\tt R} package {\bf BayesDccGarch} which was developed to implement recently proposed inference procedures to estimate and compare multivariate GARCH…
▽ More
Multivariate GARCH models are important tools to describe the dynamics of multivariate times series of financial returns. Nevertheless, these models have been much less used in practice due to the lack of reliable software. This paper describes the {\tt R} package {\bf BayesDccGarch} which was developed to implement recently proposed inference procedures to estimate and compare multivariate GARCH models allowing for asymmetric and heavy tailed distributions.
△ Less
Submitted 9 December, 2014;
originally announced December 2014.
-
Recovery Risk: Application of the Latent Competing Risks Model to Non performing Loans
Authors:
Mauro R. Oliveira,
Francisco Louzada
Abstract:
This article proposes a method for measuring the latent risks involved in the recovery process of non performing loans in financial institutions and business firms that deal with collection and recovery processes. To that end, we apply the competing risks model referred to in the literature as the promotion time model. The result achieved is the probability of credit recovery for a portfolio segme…
▽ More
This article proposes a method for measuring the latent risks involved in the recovery process of non performing loans in financial institutions and business firms that deal with collection and recovery processes. To that end, we apply the competing risks model referred to in the literature as the promotion time model. The result achieved is the probability of credit recovery for a portfolio segmented into groups based on the information available. Within the context of competing risks, application of the technique yielded an estimation of the number of latent events that concur to the credit recovery event. With these results in hand, we were able to compare groups of defaulters in terms of risk or susceptibility to the recovery event during the collection process, and thereby determine where collection actions are most efficient. We specify the Poisson distribution for the number of latent causes leading to recovery, and the Weibull distribution for the time up to recovery. To estimate the model parameters, we use the maximum likelihood method. Finally, the model was applied to a sample of defaulted loans from a financial institution.
△ Less
Submitted 19 August, 2014;
originally announced August 2014.
-
An Evidence of Link between Default and Loss of Bank Loans from the Modeling of Competing Risks
Authors:
Mauro R. Oliveira,
Francisco Louzada
Abstract:
In this paper, we propose a method that provides a useful technique to compare relationship between risks involved that takes customer become defaulter and debt collection process that might make this defaulter recovered. Through estimation of competitive risks that lead to realization of the event of interest, we showed that there is a significant relation between the intensity of default and los…
▽ More
In this paper, we propose a method that provides a useful technique to compare relationship between risks involved that takes customer become defaulter and debt collection process that might make this defaulter recovered. Through estimation of competitive risks that lead to realization of the event of interest, we showed that there is a significant relation between the intensity of default and losses from defaulted loans in collection processes. To reach this goal, we investigate a competing risks model applied to whole credit risk cycle into a bank loans portfolio. We estimated competing causes related to occurrence of default, thereafter, comparing it with estimated competing causes that lead loans to write-off condition. In context of modeling competing risks, we used a specification of Poisson distribution for numbers from competing causes and Weibull distribution for failures times. The likelihood maximum estimation is used to parameters estimation and the model is applied to a real data of personal loans
△ Less
Submitted 19 August, 2014;
originally announced August 2014.
-
A modified version of the inference function for margins and interval estimation for the bivariate Clayton copula SUR Tobit model: An simulation approach
Authors:
Paulo H. Ferreira,
Francisco Louzada
Abstract:
This paper extends the analysis of bivariate seemingly unrelated regression (SUR) Tobit model by modeling its nonlinear dependence structure through the Clayton copula. The ability in capturing/modeling the lower tail dependence of the SUR Tobit model where some data are censored (generally, at zero point) is an additionally useful feature of the Clayton copula. We propose a modified version of th…
▽ More
This paper extends the analysis of bivariate seemingly unrelated regression (SUR) Tobit model by modeling its nonlinear dependence structure through the Clayton copula. The ability in capturing/modeling the lower tail dependence of the SUR Tobit model where some data are censored (generally, at zero point) is an additionally useful feature of the Clayton copula. We propose a modified version of the inference function for margins (IFM) method (Joe and Xu, 1996), which we refer to as MIFM method, to obtain the estimates of the marginal parameters and a better (satisfactory) estimate of the copula association parameter. More specifically, we employ the data augmentation technique in the second stage of the IFM method to generate the censored observations (i.e. to obtain continuous marginal distributions, which ensures the uniqueness of the copula) and then estimate the dependence parameter. Resampling procedures (bootstrap methods) are also proposed for obtaining confidence intervals for the model parameters. A simulation study is performed in order to verify the behavior of the MIFM estimates (we focus on the copula parameter estimation) and the coverage probability of different confidence intervals in datasets with different percentages of censoring and degrees of dependence. The satisfactory results from the simulation (under certain conditions) and empirical study indicate the good performance of our proposed model and methods where they are applied to model the U.S. ready-to-eat breakfast cereals and fluid milk consumption data.
△ Less
Submitted 12 April, 2014;
originally announced April 2014.