-
The Analysis of Criminal Recidivism: A Hierarchical Model-Based Approach for the Analysis of Zero-Inflated, Spatially Correlated recurrent events Data
Authors:
Alisson C. C. Silva,
Fábio N. Demarqui,
Bráulio F. Silva,
Marcos O. Prates
Abstract:
The life course perspective in criminology has become prominent last years, offering valuable insights into various patterns of criminal offending and pathways. The study of criminal trajectories aims to understand the beginning, persistence and desistence in crime, providing intriguing explanations about these moments in life. Central to this analysis is the identification of patterns in the freq…
▽ More
The life course perspective in criminology has become prominent last years, offering valuable insights into various patterns of criminal offending and pathways. The study of criminal trajectories aims to understand the beginning, persistence and desistence in crime, providing intriguing explanations about these moments in life. Central to this analysis is the identification of patterns in the frequency of criminal victimization and recidivism, along with the factors that contribute to them. Specifically, this work introduces a new class of models that overcome limitations in traditional methods used to analyze criminal recidivism. These models are designed for recurrent events data characterized by excess of zeros and spatial correlation. They extend the Non-Homogeneous Poisson Process, incorporating spatial dependence in the model through random effects, enabling the analysis of associations among individuals within the same spatial stratum. To deal with the excess of zeros in the data, a zero-inflated Poisson mixed model was incorporated. In addition to parametric models following the Power Law process for baseline intensity functions, we propose flexible semi-parametric versions approximating the intensity function using Bernstein Polynomials. The Bayesian approach offers advantages such as incorporating external evidence and modeling specific correlations between random effects and observed data. The performance of these models was evaluated in a simulation study with various scenarios, and we applied them to analyze criminal recidivism data in the Metropolitan Region of Belo Horizonte, Brazil. The results provide a detailed analysis of high-risk areas for recurrent crimes and the behavior of recidivism rates over time. This research significantly enhances our understanding of criminal trajectories, paving the way for more effective strategies in combating criminal recidivism.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Imputation of missing data using multivariate Gaussian Linear Cluster-Weighted Modeling
Authors:
Luis Alejandro Masmela-Caita,
Thais Paiva Galletti,
Marcos Oliveira Prates
Abstract:
Missing data arises when certain values are not recorded or observed for variables of interest. However, most of the statistical theory assume complete data availability. To address incomplete databases, one approach is to fill the gaps corresponding to the missing information based on specific criteria, known as imputation. In this study, we propose a novel imputation methodology for databases wi…
▽ More
Missing data arises when certain values are not recorded or observed for variables of interest. However, most of the statistical theory assume complete data availability. To address incomplete databases, one approach is to fill the gaps corresponding to the missing information based on specific criteria, known as imputation. In this study, we propose a novel imputation methodology for databases with non-response units by leveraging additional information from fully observed auxiliary variables. We assume that the variables included in the database are continuous and that the auxiliary variables, which are fully observed, help to improve the imputation capacity of the model. Within a fully Bayesian framework, our method utilizes a flexible mixture of multivariate normal distributions to jointly model the response and auxiliary variables. By employing the principles of Gaussian Cluster-Weighted modeling, we construct a predictive model to impute the missing values by leveraging information from the covariates. We present simulation studies and a real data illustration to demonstrate the imputation capacity of our method across various scenarios, comparing it to other methods in the literature
△ Less
Submitted 12 August, 2023;
originally announced August 2023.
-
Is augmentation effective to improve prediction in imbalanced text datasets?
Authors:
Gabriel O. Assunção,
Rafael Izbicki,
Marcos O. Prates
Abstract:
Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is always necessary to improve predictions on i…
▽ More
Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is always necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
An unified framework for point-level, areal, and mixed spatial data: the Hausdorff-Gaussian Process
Authors:
Lucas da Cunha Godoy,
Marcos Oliveira Prates,
Jun Yan
Abstract:
More realistic models can be built taking into account spatial dependence when analyzing areal data. Most of the models for areal data employ adjacency matrices to assess the spatial structure of the data. Such methodologies impose some limitations. Remarkably, spatial polygons of different shapes and sizes are not treated differently, and it becomes difficult, if not impractical, to compute predi…
▽ More
More realistic models can be built taking into account spatial dependence when analyzing areal data. Most of the models for areal data employ adjacency matrices to assess the spatial structure of the data. Such methodologies impose some limitations. Remarkably, spatial polygons of different shapes and sizes are not treated differently, and it becomes difficult, if not impractical, to compute predictions based on these models. Moreover, spatial misalignment (when spatial information is available at different spatial levels) becomes harder to be handled. These limitations can be circumvented by formulating models using other structures to quantify spatial dependence. In this paper, we introduce the Hausdorff-Gaussian process (HGP). The HGP relies on the Hausdorff distance, valid for both point and areal data, allowing for simultaneously accommodating geostatistical and areal models under the same modeling framework. We present the benefits of using the HGP as a random effect for Bayesian spatial generalized mixed-effects models and via a simulation study comparing the performance of the HGP to the most popular models for areal data. Finally, the HGP is applied to respiratory cancer data observed in Great Glasgow.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
Beyond Gaussian processes: Flexible Bayesian modeling and inference for geostatistical processes
Authors:
F. B. Gonçalves,
M. O. Prates,
G. A. S. Aguilar
Abstract:
This paper proposes a novel family of geostatistical models to account for features that cannot be properly accommodated by traditional Gaussian processes. The family is specified hierarchically and combines the infinite-dimensional dynamics of Gaussian processes with that of any multivariate continuous distribution. This combination is stochastically defined through a latent Poisson process and t…
▽ More
This paper proposes a novel family of geostatistical models to account for features that cannot be properly accommodated by traditional Gaussian processes. The family is specified hierarchically and combines the infinite-dimensional dynamics of Gaussian processes with that of any multivariate continuous distribution. This combination is stochastically defined through a latent Poisson process and the new family is called the Poisson-Gaussian Mixture Process - POGAMP. Whilst the attempt of defining geostatistical processes by assigning some arbitrary continuous distribution to be the finite-dimension distributions usually leads to non-valid processes, the finite-dimensional distributions of the POGAMP can be arbitrarily close to any continuous distribution and still define a valid process. Formal results to establish the existence and some important properties of the POGAMP, such as absolute continuity with respect to a Gaussian process measure, are provided. Also, an MCMC algorithm is carefully devised to perform Bayesian inference when the POGAMP is discretely observed in some space domain.
△ Less
Submitted 6 April, 2023; v1 submitted 12 March, 2022;
originally announced March 2022.
-
Imputation of Missing Data Using Linear Gaussian Cluster-Weighted Modeling
Authors:
Luis Alejandro Masmela-Caita,
Thais Paiva Galletti,
Marcos Oliveira Prates
Abstract:
Missing data theory deals with the statistical methods in the occurrence of missing data. Missing data occurs when some values are not stored or observed for variables of interest. However, most of the statistical theory assumes that data is fully observed. An alternative to deal with incomplete databases is to fill in the spaces corresponding to the missing information based on some criteria, thi…
▽ More
Missing data theory deals with the statistical methods in the occurrence of missing data. Missing data occurs when some values are not stored or observed for variables of interest. However, most of the statistical theory assumes that data is fully observed. An alternative to deal with incomplete databases is to fill in the spaces corresponding to the missing information based on some criteria, this technique is called imputation. We introduce a new imputation methodology for databases with univariate missing patterns based on additional information from fully-observed auxiliary variables. We assume that the non-observed variable is continuous, and that auxiliary variables assist to improve the imputation capacity of the model. In a fully Bayesian framework, our method uses a flexible mixture of multivariate normal distributions to model the response and the auxiliary variables jointly. Under this framework, we use the properties of Gaussian Cluster-Weighted modeling to construct a predictive model to impute the missing values using the information from the covariates. Simulations studies and a real data illustration are presented to show the method imputation capacity under a variety of scenarios and in comparison to other literature methods.
△ Less
Submitted 24 October, 2021;
originally announced October 2021.
-
Alleviating Spatial Confounding in Spatial Frailty Models
Authors:
Douglas Roberto Mesquita Azevedo,
Marcos Oliveira Prates,
Dipankar Bandyopadhyay
Abstract:
Spatial confounding is how is called the confounding between fixed and spatial random effects. It has been widely studied and it gained attention in the past years in the spatial statistics literature, as it may generate unexpected results in modeling. The projection-based approach, also known as restricted models, appears as a good alternative to overcome the spatial confounding in generalized li…
▽ More
Spatial confounding is how is called the confounding between fixed and spatial random effects. It has been widely studied and it gained attention in the past years in the spatial statistics literature, as it may generate unexpected results in modeling. The projection-based approach, also known as restricted models, appears as a good alternative to overcome the spatial confounding in generalized linear mixed models. However, when the support of fixed effects is different from the spatial effect one, this approach can no longer be applied directly. In this work, we introduce a method to alleviate the spatial confounding for the spatial frailty models family. This class of models can incorporate spatially structured effects and it is usual to observe more than one sample unit per area which means that the support of fixed and spatial effects differs. In this case, we introduce a two folded projection-based approach projecting the design matrix to the dimension of the space and then projecting the random effect to the orthogonal space of the new design matrix. To provide fast inference in our analysis we employ the integrated nested Laplace approximation methodology. The method is illustrated with an application with lung and bronchus cancer in California - US that confirms that the methodology efficiency.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
A robust nonlinear mixed-effects model for COVID-19 deaths data
Authors:
Fernanda L. Schumacher,
Clecio S. Ferreira,
Marcos O. Prates,
Alberto Lachos,
Victor H. Lachos
Abstract:
The analysis of complex longitudinal data such as COVID-19 deaths is challenging due to several inherent features: (i) Similarly-shaped profiles with different decay patterns; (ii) Unexplained variation among repeated measurements within each country, these repeated measurements may be viewed as clustered data since they are taken on the same country at roughly the same time; (iii) Skewness, outli…
▽ More
The analysis of complex longitudinal data such as COVID-19 deaths is challenging due to several inherent features: (i) Similarly-shaped profiles with different decay patterns; (ii) Unexplained variation among repeated measurements within each country, these repeated measurements may be viewed as clustered data since they are taken on the same country at roughly the same time; (iii) Skewness, outliers or skew-heavy-tailed noises are possibly embodied within response variables. This article formulates a robust nonlinear mixed-effects model based in the class of scale mixtures of skew-normal distributions for modeling COVID-19 deaths, which allows the analysts to model such data in the presence of the above described features simultaneously. An efficient EM-type algorithm is proposed to carry out maximum likelihood estimation of model parameters. The bootstrap method is used to determine inherent characteristics of the nonlinear individual profiles such as confidence interval of the predicted deaths and fitted curves. The target is to model COVID-19 deaths curves from some Latin American countries since this region is the new epicenter of the disease. Moreover, since a mixed-effect framework borrows information from the population-average effects, in our analysis we include some countries from Europe and North America that are in a more advanced stage of their COVID-19 deaths curve.
△ Less
Submitted 1 August, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Heckman selection-t model: parameter estimation via the EM-algorithm
Authors:
Victor H. Lachos Davila,
Marcos O. Prates,
Dipak K. Dey
Abstract:
Heckman selection model is perhaps the most popular econometric model in the analysis of data with sample selection. The analyses of this model are based on the normality assumption for the error terms, however, in some applications, the distribution of the error term departs significantly from normality, for instance, in the presence of heavy tails and/or atypical observation. In this paper, we e…
▽ More
Heckman selection model is perhaps the most popular econometric model in the analysis of data with sample selection. The analyses of this model are based on the normality assumption for the error terms, however, in some applications, the distribution of the error term departs significantly from normality, for instance, in the presence of heavy tails and/or atypical observation. In this paper, we explore the Heckman selection-t model where the random errors follow a bivariate Student's-t distribution. We develop an analytically tractable and efficient EM-type algorithm for iteratively computing maximum likelihood estimates of the parameters, with standard errors as a by-product. The algorithm has closed-form expressions at the E-step, that rely on formulas for the mean and variance of the truncated Student's-t distributions. Simulations studies show the vulnerability of the Heckman selection-normal model, as well as the robustness aspects of the Heckman selection-t model. Two real examples are analyzed, illustrating the usefulness of the proposed methods. The proposed algorithms and methods are implemented in the new R package HeckmanEM.
△ Less
Submitted 14 June, 2020;
originally announced June 2020.
-
Non-Separable Spatio-temporal Models via Transformed Gaussian Markov Random Fields
Authors:
Douglas R. M. Azevedo,
Marcos O. Prates,
Michael R. Willig
Abstract:
Models that capture the spatial and temporal dynamics are applicable in many science fields. Non-separable spatio-temporal models were introduced in the literature to capture these features. However, these models are generally complicated in construction and interpretation. We introduce a class of non-separable Transformed Gaussian Markov Random Fields (TGMRF) in which the dependence structure is…
▽ More
Models that capture the spatial and temporal dynamics are applicable in many science fields. Non-separable spatio-temporal models were introduced in the literature to capture these features. However, these models are generally complicated in construction and interpretation. We introduce a class of non-separable Transformed Gaussian Markov Random Fields (TGMRF) in which the dependence structure is flexible and facilitates simple interpretations concerning spatial, temporal and spatio-temporal parameters. Moreover, TGMRF models have the advantage of allowing specialists to define any desired marginal distribution in model construction without suffering from spatio-temporal confounding. Consequently, the use of spatio-temporal models under the TGMRF framework leads to a new class of general models, such as spatio-temporal Gamma random fields, that can be directly used to model Poisson intensity for space-time data. The proposed model was applied to identify important environmental characteristics that affect variation in the abundance of Nenia tridens, a dominant species of snail in a well-studied tropical ecosystem, and to characterize its spatial and temporal trends, which are particularly critical during the Anthropocene, an epoch of time characterized by human-induced environmental change associated with climate and land use.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
Objective Bayesian analysis for spatial Student-t regression models
Authors:
Jose A. Ordoñez,
Marcos O. Prates,
Larissa A. Matos,
Victor H. Lachos
Abstract:
The choice of the prior distribution is a key aspect of Bayesian analysis. For the spatial regression setting a subjective prior choice for the parameters may not be trivial, from this perspective, using the objective Bayesian analysis framework a reference is introduced for the spatial Student-t regression model with unknown degrees of freedom. The spatial Student-t regression model poses two mai…
▽ More
The choice of the prior distribution is a key aspect of Bayesian analysis. For the spatial regression setting a subjective prior choice for the parameters may not be trivial, from this perspective, using the objective Bayesian analysis framework a reference is introduced for the spatial Student-t regression model with unknown degrees of freedom. The spatial Student-t regression model poses two main challenges when eliciting priors: one for the spatial dependence parameter and the other one for the degrees of freedom. It is well-known that the propriety of the posterior distribution over objective priors is not always guaranteed, whereas the use of proper prior distributions may dominate and bias the posterior analysis. In this paper, we show the conditions under which our proposed reference prior yield to a proper posterior distribution. Simulation studies are used in order to evaluate the performance of the reference prior to a commonly used vague proper prior.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Fast Bayesian inference of Block Nearest Neighbor Gaussian process for large data
Authors:
Zaida C. Quiroz,
Marcos O. Prates,
Dipak K. Dey,
Håvard Rue
Abstract:
This paper presents the development of a spatial block-Nearest Neighbor Gaussian process (block-NNGP) for location-referenced large spatial data. The key idea behind this approach is to divide the spatial domain into several blocks which are dependent under some constraints. The cross-blocks capture the large-scale spatial dependence, while each block captures the small-scale spatial dependence. T…
▽ More
This paper presents the development of a spatial block-Nearest Neighbor Gaussian process (block-NNGP) for location-referenced large spatial data. The key idea behind this approach is to divide the spatial domain into several blocks which are dependent under some constraints. The cross-blocks capture the large-scale spatial dependence, while each block captures the small-scale spatial dependence. The resulting block-NNGP enjoys Markov properties reflected on its sparse precision matrix. It is embedded as a prior within the class of latent Gaussian models, thus Bayesian inference is obtained using the integrated nested Laplace approximation (INLA). The performance of the block-NNGP is illustrated on simulated examples and massive real data for locations in the order of $10^4$.
△ Less
Submitted 4 February, 2021; v1 submitted 18 August, 2019;
originally announced August 2019.
-
Dynamic Time Scan Forecasting
Authors:
Marcelo Azevedo Costa,
Leandro Brioschi Mineti,
Marcos Oliveira Prates,
Ramiro Ruiz Cardenas
Abstract:
The dynamic time scan forecasting method relies on the premise that the most important pattern in a time series precedes the forecasting window, i.e., the last observed values. Thus, a scan procedure is applied to identify similar patterns, or best matches, throughout the time series. As oppose to euclidean distance, or any distance function, a similarity function is dynamically estimated in order…
▽ More
The dynamic time scan forecasting method relies on the premise that the most important pattern in a time series precedes the forecasting window, i.e., the last observed values. Thus, a scan procedure is applied to identify similar patterns, or best matches, throughout the time series. As oppose to euclidean distance, or any distance function, a similarity function is dynamically estimated in order to match previous values to the last observed values. Goodness-of-fit statistics are used to find the best matches. Using the respective similarity functions, the observed values proceeding the best matches are used to create a forecasting pattern, as well as forecasting intervals. Remarkably, the proposed method outperformed statistical and machine learning approaches in a real case wind speed forecasting problem.
△ Less
Submitted 12 June, 2019;
originally announced June 2019.
-
Typed Graph Networks
Authors:
Marcelo O. R. Prates,
Pedro H. C. Avelar,
Henrique Lemos,
Marco Gori,
Luis Lamb
Abstract:
Recently, the deep learning community has given growing attention to neural architectures engineered to learn problems in relational domains. Convolutional Neural Networks employ parameter sharing over the image domain, tying the weights of neural connections on a grid topology and thus enforcing the learning of a number of convolutional kernels. By instantiating trainable neural modules and assem…
▽ More
Recently, the deep learning community has given growing attention to neural architectures engineered to learn problems in relational domains. Convolutional Neural Networks employ parameter sharing over the image domain, tying the weights of neural connections on a grid topology and thus enforcing the learning of a number of convolutional kernels. By instantiating trainable neural modules and assembling them in varied configurations (apart from grids), one can enforce parameter sharing over graphs, yielding models which can effectively be fed with relational data. In this context, vertices in a graph can be projected into a hyperdimensional real space and iteratively refined over many message-passing iterations in an end-to-end differentiable architecture. Architectures of this family have been referred to with several definitions in the literature, such as Graph Neural Networks, Message-passing Neural Networks, Relational Networks and Graph Networks. In this paper, we revisit the original Graph Neural Network model and show that it generalises many of the recent models, which in turn benefit from the insight of thinking about vertex \textbf{types}. To illustrate the generality of the original model, we present a Graph Neural Network formalisation, which partitions the vertices of a graph into a number of types. Each type represents an entity in the ontology of the problem one wants to learn. This allows - for instance - one to assign embeddings to edges, hyperedges, and any number of global attributes of the graph. As a companion to this paper we provide a Python/Tensorflow library to facilitate the development of such architectures, with which we instantiate the formalisation to reproduce a number of models proposed in the current literature.
△ Less
Submitted 24 February, 2019; v1 submitted 23 January, 2019;
originally announced January 2019.
-
Multitask Learning on Graph Neural Networks: Learning Multiple Graph Centrality Measures with a Unified Network
Authors:
Pedro H. C. Avelar,
Henrique Lemos,
Marcelo O. R. Prates,
Luis Lamb
Abstract:
The application of deep learning to symbolic domains remains an active research endeavour. Graph neural networks (GNN), consisting of trained neural modules which can be arranged in different topologies at run time, are sound alternatives to tackle relational problems which lend themselves to graph representations. In this paper, we show that GNNs are capable of multitask learning, which can be na…
▽ More
The application of deep learning to symbolic domains remains an active research endeavour. Graph neural networks (GNN), consisting of trained neural modules which can be arranged in different topologies at run time, are sound alternatives to tackle relational problems which lend themselves to graph representations. In this paper, we show that GNNs are capable of multitask learning, which can be naturally enforced by training the model to refine a single set of multidimensional embeddings $\in \mathbb{R}^d$ and decode them into multiple outputs by connecting MLPs at the end of the pipeline. We demonstrate the multitask learning capability of the model in the relevant relational problem of estimating network centrality measures, focusing primarily on producing rankings based on these measures, i.e. is vertex $v_1$ more central than vertex $v_2$ given centrality $c$?. We then show that a GNN can be trained to develop a \emph{lingua franca} of vertex embeddings from which all relevant information about any of the trained centrality measures can be decoded. The proposed model achieves $89\%$ accuracy on a test dataset of random instances with up to 128 vertices and is shown to generalise to larger problem sizes. The model is also shown to obtain reasonable accuracy on a dataset of real world instances with up to 4k vertices, vastly surpassing the sizes of the largest instances with which the model was trained ($n=128$). Finally, we believe that our contributions attest to the potential of GNNs in symbolic domains in general and in relational learning in particular.
△ Less
Submitted 28 November, 2019; v1 submitted 11 September, 2018;
originally announced September 2018.
-
Learning to Solve NP-Complete Problems - A Graph Neural Network for Decision TSP
Authors:
Marcelo O. R. Prates,
Pedro H. C. Avelar,
Henrique Lemos,
Luis Lamb,
Moshe Vardi
Abstract:
Graph Neural Networks (GNN) are a promising technique for bridging differential programming and combinatorial domains. GNNs employ trainable modules which can be assembled in different configurations that reflect the relational structure of each problem instance. In this paper, we show that GNNs can learn to solve, with very little supervision, the decision variant of the Traveling Salesperson Pro…
▽ More
Graph Neural Networks (GNN) are a promising technique for bridging differential programming and combinatorial domains. GNNs employ trainable modules which can be assembled in different configurations that reflect the relational structure of each problem instance. In this paper, we show that GNNs can learn to solve, with very little supervision, the decision variant of the Traveling Salesperson Problem (TSP), a highly relevant $\mathcal{NP}$-Complete problem. Our model is trained to function as an effective message-passing algorithm in which edges (embedded with their weights) communicate with vertices for a number of iterations after which the model is asked to decide whether a route with cost $<C$ exists. We show that such a network can be trained with sets of dual examples: given the optimal tour cost $C^{*}$, we produce one decision instance with target cost $x\%$ smaller and one with target cost $x\%$ larger than $C^{*}$. We were able to obtain $80\%$ accuracy training with $-2\%,+2\%$ deviations, and the same trained model can generalize for more relaxed deviations with increasing performance. We also show that the model is capable of generalizing for larger problem sizes. Finally, we provide a method for predicting the optimal route cost within $2\%$ deviation from the ground truth. In summary, our work shows that Graph Neural Networks are powerful enough to solve $\mathcal{NP}$-Complete problems which combine symbolic and numeric data.
△ Less
Submitted 16 November, 2018; v1 submitted 7 September, 2018;
originally announced September 2018.
-
Assessing Gender Bias in Machine Translation -- A Case Study with Google Translate
Authors:
Marcelo O. R. Prates,
Pedro H. C. Avelar,
Luis Lamb
Abstract:
Recently there has been a growing concern about machine bias, where trained statistical models grow to reflect controversial societal asymmetries, such as gender or racial bias. A significant number of AI tools have recently been suggested to be harmfully biased towards some minority, with reports of racist criminal behavior predictors, Iphone X failing to differentiate between two Asian people an…
▽ More
Recently there has been a growing concern about machine bias, where trained statistical models grow to reflect controversial societal asymmetries, such as gender or racial bias. A significant number of AI tools have recently been suggested to be harmfully biased towards some minority, with reports of racist criminal behavior predictors, Iphone X failing to differentiate between two Asian people and Google photos' mistakenly classifying black people as gorillas. Although a systematic study of such biases can be difficult, we believe that automated translation tools can be exploited through gender neutral languages to yield a window into the phenomenon of gender bias in AI.
In this paper, we start with a comprehensive list of job positions from the U.S. Bureau of Labor Statistics (BLS) and used it to build sentences in constructions like "He/She is an Engineer" in 12 different gender neutral languages such as Hungarian, Chinese, Yoruba, and several others. We translate these sentences into English using the Google Translate API, and collect statistics about the frequency of female, male and gender-neutral pronouns in the translated output. We show that GT exhibits a strong tendency towards male defaults, in particular for fields linked to unbalanced gender distribution such as STEM jobs. We ran these statistics against BLS' data for the frequency of female participation in each job position, showing that GT fails to reproduce a real-world distribution of female workers. We provide experimental evidence that even if one does not expect in principle a 50:50 pronominal gender distribution, GT yields male defaults much more frequently than what would be expected from demographic data alone.
We are hopeful that this work will ignite a debate about the need to augment current statistical translation tools with debiasing techniques which can already be found in the scientific literature.
△ Less
Submitted 11 March, 2019; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Bayesian linear regression models with flexible error distributions
Authors:
Nívea B. da Silva,
Marcos O. Prates,
Flávio B. Gonçalves
Abstract:
This work introduces a novel methodology based on finite mixtures of Student-t distributions to model the errors' distribution in linear regression models. The novelty lies on a particular hierarchical structure for the mixture distribution in which the first level models the number of modes, responsible to accommodate multimodality and skewness features, and the second level models tail behavior.…
▽ More
This work introduces a novel methodology based on finite mixtures of Student-t distributions to model the errors' distribution in linear regression models. The novelty lies on a particular hierarchical structure for the mixture distribution in which the first level models the number of modes, responsible to accommodate multimodality and skewness features, and the second level models tail behavior. Moreover, the latter is specified in a way that no degrees of freedom parameters are estimated and, therefore, the known statistical difficulties when dealing with those parameters is mitigated, and yet model flexibility is not compromised. Inference is performed via Markov chain Monte Carlo and simulation studies are conducted to evaluate the performance of the proposed methodology. The analysis of two real data sets are also presented.
△ Less
Submitted 12 November, 2017;
originally announced November 2017.
-
Robust Bayesian model selection for heavy-tailed linear regression using finite mixtures
Authors:
Flávio B Gonçalves,
Marcos O. Prates,
Victor H. Lachos
Abstract:
In this paper we present a novel methodology to perform Bayesian model selection in linear models with heavy-tailed distributions. We consider a finite mixture of distributions to model a latent variable where each component of the mixture corresponds to one possible model within the symmetrical class of normal independent distributions. Naturally, the Gaussian model is one of the possibilities. T…
▽ More
In this paper we present a novel methodology to perform Bayesian model selection in linear models with heavy-tailed distributions. We consider a finite mixture of distributions to model a latent variable where each component of the mixture corresponds to one possible model within the symmetrical class of normal independent distributions. Naturally, the Gaussian model is one of the possibilities. This allows for a simultaneous analysis based on the posterior probability of each model. Inference is performed via Markov chain Monte Carlo - a Gibbs sampler with Metropolis-Hastings steps for a class of parameters. Simulated examples highlight the advantages of this approach compared to a segregated analysis based on arbitrarily chosen model selection criteria. Examples with real data are presented and an extension to censored linear regression is introduced and discussed.
△ Less
Submitted 17 August, 2017; v1 submitted 1 September, 2015;
originally announced September 2015.
-
Where geography lives? A projection approach for spatial confounding
Authors:
Marcos O. Prates,
Erica C. Rodrigues,
Renato M. Assunção
Abstract:
Spatial confounding between the spatial random effects and fixed effects covariates has been recently discovered and showed that it may bring misleading interpretation to the model results. Solutions to alleviate this problem are based on decomposing the spatial random effect and fitting a restricted spatial regression. In this paper, we propose a different approach: a transformation of the geogra…
▽ More
Spatial confounding between the spatial random effects and fixed effects covariates has been recently discovered and showed that it may bring misleading interpretation to the model results. Solutions to alleviate this problem are based on decomposing the spatial random effect and fitting a restricted spatial regression. In this paper, we propose a different approach: a transformation of the geographic space to ensure that the unobserved spatial random effect added to the regression is orthogonal to the fixed effects covariates. Our approach, named SPOCK, has the additional benefit of providing a fast and simple computational method to estimate the parameters. Furthermore, it does not constrain the distribution class assumed for the spatial error term. A simulation study and a real data analysis are presented to better understand the advantages of the new method in comparison with the existing ones.
△ Less
Submitted 16 May, 2016; v1 submitted 20 July, 2014;
originally announced July 2014.
-
Inference on Dynamic Models for non-Gaussian Random Fields using INLA: A Homicide Rate Analysis of Brazilian Cities
Authors:
Renan Xavier Cortes,
Thiago Guerrera Martins,
Marcos Oliveira Prates,
Bráulio Figueiredo Alves da Silva
Abstract:
Robust time series analysis is an important subject in statistical modeling. Models based on Gaussian distribution are sensitive to outliers, which may imply in a significant degradation in estimation performance as well as in prediction accuracy. State-space models, also referred as Dynamic Models, is a very useful way to describe the evolution of a time series variable through a structured laten…
▽ More
Robust time series analysis is an important subject in statistical modeling. Models based on Gaussian distribution are sensitive to outliers, which may imply in a significant degradation in estimation performance as well as in prediction accuracy. State-space models, also referred as Dynamic Models, is a very useful way to describe the evolution of a time series variable through a structured latent evolution system. Integrated Nested Laplace Approximation (INLA) is a recent approach proposed to perform fast Bayesian inference in Latent Gaussian Models which naturally comprises Dynamic Models. We present how to perform fast and accurate non-Gaussian dynamic modeling with INLA and show how these models can provide a more robust time series analysis when compared with standard dynamic models based on Gaussian distributions. We formalize the framework used to fit complex non-Gaussian space-state models using the R package INLA and illustrate our approach in both a simulation study and on the brazilian homicide rate dataset.
△ Less
Submitted 13 February, 2015; v1 submitted 24 December, 2013;
originally announced December 2013.
-
Transformed Gaussian Markov Random Fields and Spatial Modeling
Authors:
Marcos O. Prates,
Dipak K. Dey,
Michael R. Willig,
Jun Yan
Abstract:
The Gaussian random field (GRF) and the Gaussian Markov random field (GMRF) have been widely used to accommodate spatial dependence under the generalized linear mixed model framework. These models have limitations rooted in the symmetry and thin tail of the Gaussian distribution. We introduce a new class of random fields, termed transformed GRF (TGRF), and a new class of Markov random fields, term…
▽ More
The Gaussian random field (GRF) and the Gaussian Markov random field (GMRF) have been widely used to accommodate spatial dependence under the generalized linear mixed model framework. These models have limitations rooted in the symmetry and thin tail of the Gaussian distribution. We introduce a new class of random fields, termed transformed GRF (TGRF), and a new class of Markov random fields, termed transformed GMRF (TGMRF). They are constructed by transforming the margins of GRFs and GMRFs, respectively, to desired marginal distributions to accommodate asymmetry and heavy tail as needed in practice. The Gaussian copula that characterizes the dependence structure facilitates inferences and applications in modeling spatial dependence. This construction leads to new models such as gamma or beta Markov fields with Gaussian copulas, which can be used to model Poisson intensity or Bernoulli rate in a spatial generalized linear mixed model. The method is naturally implemented in a Bayesian framework. We illustrate the utility of the methodology in an ecological application with spatial count data and spatial presence/absence data of some snail species, where the new models are shown to outperform the traditional spatial models. The validity of Bayesian inferences and model selection are assessed through simulation studies for both spatial Poisson regression and spatial Bernoulli regression.
△ Less
Submitted 24 May, 2012;
originally announced May 2012.