Search | arXiv e-print repository

Identifying bias in cluster quality metrics

Authors: Martí Renedo-Mirambell, Argimiro Arratia

Abstract: We study potential biases of popular cluster quality metrics, such as conductance or modularity. We propose a method that uses both stochastic and preferential attachment block models construction to generate networks with preset community structures, to which quality metrics will be applied. These models also allow us to generate multi-level structures of varying strength, which will show if metr… ▽ More We study potential biases of popular cluster quality metrics, such as conductance or modularity. We propose a method that uses both stochastic and preferential attachment block models construction to generate networks with preset community structures, to which quality metrics will be applied. These models also allow us to generate multi-level structures of varying strength, which will show if metrics favour partitions into a larger or smaller number of clusters. Additionally, we propose another quality metric, the density ratio. We observed that most of the studied metrics tend to favour partitions into a smaller number of big clusters, even when their relative internal and external connectivity are the same. The metrics found to be less biased are modularity and density ratio. △ Less

Submitted 12 December, 2021; originally announced December 2021.

ACM Class: I.5.3; I.5.2

arXiv:2105.01557 [pdf, other]

Good distribution modelling with the R package good

Authors: Jordi Tur, David Moriña, Pedro Puig, Alejandra Cabaña, Argimiro Arratia, Amanda Fernández-Fontelo

Abstract: Although models for count data with over-dispersion have been widely considered in the literature, models for under-dispersion -- the opposite phenomenon -- have received less attention as it is only relatively common in particular research fields such as biodosimetry and ecology. The Good distribution is a flexible alternative for modelling count data showing either over-dispersion or under-dispe… ▽ More Although models for count data with over-dispersion have been widely considered in the literature, models for under-dispersion -- the opposite phenomenon -- have received less attention as it is only relatively common in particular research fields such as biodosimetry and ecology. The Good distribution is a flexible alternative for modelling count data showing either over-dispersion or under-dispersion, although no R packages are still available to the best of our knowledge. We aim to present in the following the R package good that computes the standard probabilistic functions (i.e., probability density function, cumulative distribution function, and quantile function) and generates random samples from a population following a Good distribution. The package also considers a function for Good regression, including covariates in a similar way to that of the standard glm function. We finally show the use of such a package with some real-world data examples addressing both over-dispersion and especially under-dispersion. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: 15 pages, 2 figures

arXiv:2104.07575 [pdf, other]

Bayesian Synthetic Likelihood Estimation for Underreported Non-Stationary Time Series: Covid-19 Incidence in Spain

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Argimiro Arratia, Pedro Puig

Abstract: The problem of dealing with misreported data is very common in a wide range of contexts for different reasons. The current situation caused by the Covid-19 worldwide pandemic is a clear example, where the data provided by official sources were not always reliable due to data collection issues and to the high proportion of asymptomatic cases. In this work, we explore the performance of Bayesian Syn… ▽ More The problem of dealing with misreported data is very common in a wide range of contexts for different reasons. The current situation caused by the Covid-19 worldwide pandemic is a clear example, where the data provided by official sources were not always reliable due to data collection issues and to the high proportion of asymptomatic cases. In this work, we explore the performance of Bayesian Synthetic Likelihood to estimate the parameters of a model capable of dealing with misreported information and to reconstruct the most likely evolution of the phenomenon. The performance of the proposed methodology is evaluated through a comprehensive simulation study and illustrated by reconstructing the weekly Covid-19 incidence in each Spanish Autonomous Community in 2020. △ Less

Submitted 19 July, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

arXiv:2008.00262 [pdf, other]

doi 10.1371/journal.pone.0242956

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

Authors: Amanda Fernández-Fontelo, David Moriña, Alejandra Cabaña, Argimiro Arratia, Pere Puig

Abstract: The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelt… ▽ More The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelty of the model is that the expectation of the innovations in the unobserved process is a time-dependent function defined in such a way that information about the spread of an epidemic, as modelled through a Susceptible-Infectious-Removed dynamical system, is incorporated into the model. In addition, the parameter controlling the intensity of the under-reporting is also made to vary with time to adjust to possible seasonality or trend in the data. Maximum likelihood methods are used to estimate the parameters of the model. △ Less

Submitted 1 August, 2020; originally announced August 2020.

Comments: 18 pages, 4 figures

arXiv:2007.15727 [pdf, other]

doi 10.1093/eurpub/ckab118

Cumulated burden of Covid-19 in Spain from a Bayesian perspective

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Argimiro Arratia, Gustavo Ávalos, Pedro Puig

Abstract: The main goal of this work is to estimate the actual number of cases of Covid-19 in Spain in the period 01-31-2020 / 06-01-2020 by Autonomous Communities. Based on these estimates, this work allows us to accurately re-estimate the lethality of the disease in Spain, taking into account unreported cases. A hierarchical Bayesian model recently proposed in the literature has been adapted to model the… ▽ More The main goal of this work is to estimate the actual number of cases of Covid-19 in Spain in the period 01-31-2020 / 06-01-2020 by Autonomous Communities. Based on these estimates, this work allows us to accurately re-estimate the lethality of the disease in Spain, taking into account unreported cases. A hierarchical Bayesian model recently proposed in the literature has been adapted to model the actual number of Covid-19 cases in Spain. The results of this work show that the real load of Covid-19 in Spain in the period considered is well above the data registered by the public health system. Specifically, the model estimates show that, cumulatively until June 1st, 2020, there were 2,425,930 cases of Covid-19 in Spain with characteristics similar to those reported (95\% credibility interval: 2,148,261 - 2,813,864), from which were actually registered only 518,664. Considering the results obtained from the second wave of the Spanish seroprevalence study, which estimates 2,350,324 cases of Covid-19 produced in Spain, in the period of time considered, it can be seen that the estimates provided by the model are quite good. This work clearly shows the key importance of having good quality data to optimize decision-making in the critical context of dealing with a pandemic. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:1711.09708 [pdf, other]

Classifier Selection with Permutation Tests

Authors: Marta Arias, Argimiro Arratia, Ariel Duarte-Lopez

Abstract: This work presents a content-based recommender system for machine learning classifier algorithms. Given a new data set, a recommendation of what classifier is likely to perform best is made based on classifier performance over similar known data sets. This similarity is measured according to a data set characterization that includes several state-of-the-art metrics taking into account physical str… ▽ More This work presents a content-based recommender system for machine learning classifier algorithms. Given a new data set, a recommendation of what classifier is likely to perform best is made based on classifier performance over similar known data sets. This similarity is measured according to a data set characterization that includes several state-of-the-art metrics taking into account physical structure, statis- tics, and information theory. A novelty with respect to prior work is the use of a robust approach based on permutation tests to directly assess whether a given learning algorithm is able to exploit the attributes in a data set to predict class labels, and compare it to the more commonly used F-score metric for evalu- ating classifier performance. To evaluate our approach, we have conducted an extensive experimentation including 8 of the main machine learning classification methods with varying configurations and 65 bi- nary data sets, leading to over 2331 experiments. Our results show that using the information from the permutation test clearly improves the quality of the recommendations. △ Less

Submitted 27 November, 2017; originally announced November 2017.

Comments: 20th International Conference of the Catalan Association for Artificial Intelligence (CCIA 2017)

arXiv:1511.02175 [pdf, ps, other]

Methods of Class Field Theory to Separate Logics over Finite Residue Classes and Circuit Complexity

Authors: Argimiro Arratia, Carlos E. Ortiz

Abstract: Separations among the first order logic ${\cal R}ing(0,+,*)$ of finite residue class rings, its extensions with generalized quantifiers, and in the presence of a built-in order are shown, using algebraic methods from class field theory. These methods include classification of spectra of sentences over finite residue classes as systems of congruences, and the study of their $h$-densities over the s… ▽ More Separations among the first order logic ${\cal R}ing(0,+,*)$ of finite residue class rings, its extensions with generalized quantifiers, and in the presence of a built-in order are shown, using algebraic methods from class field theory. These methods include classification of spectra of sentences over finite residue classes as systems of congruences, and the study of their $h$-densities over the set of all prime numbers, for various functions $h$ on the natural numbers. Over ordered structures the logic of finite residue class rings and extensions are known to capture DLOGTIME-uniform circuit complexity classes ranging from $AC^0$ to $TC^0$. Separating these circuit complexity classes is directly related to classifying the $h$-density of spectra of sentences in the corresponding logics of finite residue classes. We further give general conditions under which a logic over the finite residue class rings has a sentence whose spectrum has no $h$-density. One application of this result is that in ${\cal R}ing(0,+,*,<) + M$, the logic of finite residue class rings with built-in order and extended with the majority quantifier $M$, there are sentences whose spectrum have no exponential density. △ Less

Submitted 6 November, 2015; originally announced November 2015.

arXiv:1210.0312 [pdf, other]

Modeling stationary data by a class of generalised Ornstein-Uhlenbeck processes

Authors: Argimiro Arratia, Alejandra Cabaña, Enrique M. Cabaña

Abstract: An Ornstein-Uhlenbeck (OU) process can be considered as a continuous time interpolation of the discrete time AR$(1)$ process. Departing from this fact, we analyse in this work the effect of iterating OU treated as a linear operator that maps a Wiener process onto Ornstein-Uhlenbeck process, so as to build a family of higher order Ornstein-Uhlenbeck processes, OU$(p)$, in a similar spirit as the hi… ▽ More An Ornstein-Uhlenbeck (OU) process can be considered as a continuous time interpolation of the discrete time AR$(1)$ process. Departing from this fact, we analyse in this work the effect of iterating OU treated as a linear operator that maps a Wiener process onto Ornstein-Uhlenbeck process, so as to build a family of higher order Ornstein-Uhlenbeck processes, OU$(p)$, in a similar spirit as the higher order autoregressive processes AR$(p)$. We show that for $p \ge 2$ we obtain in general a process with covariances different than those of an AR$(p)$, and that for various continuous time processes, sampled from real data at equally spaced time instants, the OU$(p)$ model outperforms the appropriate AR$(p)$ model. Technically our composition of the OU operator is easy to manipulate and its parameters can be computed efficiently because, as we show, the iteration of OU operators leads to a process that can be expressed as a linear combination of basic OU processes. Using this expression we obtain a closed formula for the covariance of the iterated OU process, and consequently estimate the parameters of an OU$(p)$ process by maximum likelihood or, as an alternative, by matching correlations, the latter being a procedure resembling the method of moments. △ Less

Submitted 1 October, 2012; originally announced October 2012.

Comments: 23 pages, 39 figures, original work

MSC Class: 60G10; 60G15; 60G20

arXiv:1111.3127 [pdf, other]

doi 10.1007/s10614-012-9327-x

Tracing the temporal evolution of clusters in a financial stock market

Authors: Argimiro Arratia, Alejandra Cabaña

Abstract: We propose a methodology for clustering financial time series of stocks' returns, and a graphical set-up to quantify and visualise the evolution of these clusters through time. The proposed graphical representation allows for the application of well known algorithms for solving classical combinatorial graph problems, which can be interpreted as problems relevant to portfolio design and investment… ▽ More We propose a methodology for clustering financial time series of stocks' returns, and a graphical set-up to quantify and visualise the evolution of these clusters through time. The proposed graphical representation allows for the application of well known algorithms for solving classical combinatorial graph problems, which can be interpreted as problems relevant to portfolio design and investment strategies. We illustrate this graph representation of the evolution of clusters in time and its use on real data from the Madrid Stock Exchange market. △ Less

Submitted 14 November, 2011; originally announced November 2011.

Comments: 22 pages, 3 figures (submitted for publication)

MSC Class: 62P05; 68R10

arXiv:1105.1595 [pdf, ps, other]

Ranking pages and the topology of the web

Authors: Argimiro Arratia, Carlos Marijuán

Abstract: This paper presents our studies on the rearrangement of links from the structure of websites for the purpose of improving the valuation of a page or group of pages as established by a ranking function as Google's PageRank. We build our topological taxonomy starting from unidirectional and bidirectional rooted trees, and up to more complex hierarchical structures as cyclical rooted trees (obtained… ▽ More This paper presents our studies on the rearrangement of links from the structure of websites for the purpose of improving the valuation of a page or group of pages as established by a ranking function as Google's PageRank. We build our topological taxonomy starting from unidirectional and bidirectional rooted trees, and up to more complex hierarchical structures as cyclical rooted trees (obtained by closing cycles on bidirectional trees) and PR--digraph rooted trees (digraphs whose condensation digraph is a rooted tree that behave like cyclical rooted trees). We give different modifications on the structure of these trees and its effect on the valuation given by the PageRank function. We derive closed formulas for the PageRank of the root of various types of trees, and establish a hierarchy of these topologies in terms of PageRank. We show that the PageRank of the root of cyclical and PR--digraph trees basically depends on the number of vertices per level and the number of cycles of distinct lengths among levels, and we give a closed vector formula to compute PageRank. △ Less

Submitted 23 April, 2012; v1 submitted 9 May, 2011; originally announced May 2011.

Comments: 27 pages, 5 figures. Revised version. Corrected some typos, and improve the presentation on the bidirectional case and further complex structures (section 8 and on): we extend the fmla for PR to any general bidirectional trees by considering the contribution to PR of the additional structure hanging from the end nodes of bidirectional arcs (the subtrees)

MSC Class: 05C99; 68R10; 94C15

Showing 1–10 of 10 results for author: Arratia, A