-
A Principled Approach for a New Bias Measure
Authors:
Bruno Scarone,
Alfredo Viola,
Ricardo Baeza-Yates
Abstract:
The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. The areas in which this is happening are diverse: healthcare, employment, finance, education, the legal system to name a few; and the associated negative side effects are being increasingly harmful for society. Negative data \emph{bias} is one of those, which tends to…
▽ More
The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. The areas in which this is happening are diverse: healthcare, employment, finance, education, the legal system to name a few; and the associated negative side effects are being increasingly harmful for society. Negative data \emph{bias} is one of those, which tends to result in harmful consequences for specific groups of people. Any mitigation strategy or effective policy that addresses the negative consequences of bias must start with awareness that bias exists, together with a way to understand and quantify it. However, there is a lack of consensus on how to measure data bias and oftentimes the intended meaning is context dependent and not uniform within the research community. The main contributions of our work are: (1) a general algorithmic framework for defining and efficiently quantifying the bias level of a dataset with respect to a protected group; and (2) the definition of a new bias measure. Our results are experimentally validated using nine publicly available datasets and theoretically analyzed, which provide novel insights about the problem. Based on our approach, we also derive a bias mitigation algorithm that might be useful to policymakers.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Qubo model for the Closest Vector Problem
Authors:
Eduardo Canale,
Claudio Qureshi,
Alfredo Viola
Abstract:
In this paper we consider the closest vector problem (CVP) for lattices $Λ\subseteq \mathbb{Z}^n$ given by a generator matrix $A\in \mathcal{M}_{n\times n}(\mathbb{Z})$. Let $b>0$ be the maximum of the absolute values of the entries of the matrix $A$. We prove that the CVP can be reduced in polynomial time to a quadratic unconstrained binary optimization (QUBO) problem in…
▽ More
In this paper we consider the closest vector problem (CVP) for lattices $Λ\subseteq \mathbb{Z}^n$ given by a generator matrix $A\in \mathcal{M}_{n\times n}(\mathbb{Z})$. Let $b>0$ be the maximum of the absolute values of the entries of the matrix $A$. We prove that the CVP can be reduced in polynomial time to a quadratic unconstrained binary optimization (QUBO) problem in $O(n^2(\log(n)+\log(b)))$ binary variables, where the length of the coefficients in the corresponding quadratic form is $O(n(\log(n)+\log(b)))$.
△ Less
Submitted 7 April, 2023;
originally announced April 2023.
-
Asymptotic analysis and efficient random sampling of directed ordered acyclic graphs
Authors:
Martin Pépin,
Alfredo Viola
Abstract:
Directed acyclic graphs (DAGs) are directed graphs in which there is no path from a vertex to itself. DAGs are an omnipresent data structure in computer science and the problem of counting the DAGs of given number of vertices and to sample them uniformly at random has been solved respectively in the 70's and the 00's. In this paper, we propose to explore a new variation of this model where DAGs ar…
▽ More
Directed acyclic graphs (DAGs) are directed graphs in which there is no path from a vertex to itself. DAGs are an omnipresent data structure in computer science and the problem of counting the DAGs of given number of vertices and to sample them uniformly at random has been solved respectively in the 70's and the 00's. In this paper, we propose to explore a new variation of this model where DAGs are endowed with an independent ordering of the out-edges of each vertex, thus allowing to model a wide range of existing data structures. We provide efficient algorithms for sampling objects of this new class, both with or without control on the number of edges, and obtain an asymptotic equivalent of their number. We also show the applicability of our method by providing an effective algorithm for the random generation of classical labelled DAGs with a prescribed number of vertices and edges, based on a similar approach. This is the first known algorithm for sampling labelled DAGs with full control on the number of edges, and it meets a need in terms of applications, that had already been acknowledged in the literature.
△ Less
Submitted 9 July, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Beyond series-parallel concurrent systems: the case of arch processes
Authors:
Olivier Bodini,
Matthieu Dien,
Antoine Genitrini,
Alfredo Viola
Abstract:
In this paper we focus on concurrent processes built on synchronization by means of futures. This concept is an abstraction for processes based on a main execution thread but allowing to delay some computations. The structure of a general concurrent process with futures is more or less a directed acyclic graph. Since the quantitative study of such increasingly labeled graphs (directly related to p…
▽ More
In this paper we focus on concurrent processes built on synchronization by means of futures. This concept is an abstraction for processes based on a main execution thread but allowing to delay some computations. The structure of a general concurrent process with futures is more or less a directed acyclic graph. Since the quantitative study of such increasingly labeled graphs (directly related to processes) seems out of reach, we restrict ourselves to the study of arch processes, a simplistic model of processes with futures. They are based on two parameters related to their sizes and their numbers of arches. The increasingly labeled structures seems not to be specifiable in the sense of Analytic Combinatorics, but we manage to derive a recurrence equation for the enumeration. For this model we first exhibit an exact and an asymptotic formula for the number of runs of a given process. The second main contribution is composed of an uniform random sampler algorithm and an unranking one that allow efficient generation and exhaustive enumeration of the runs of a given arch process.
△ Less
Submitted 2 March, 2018;
originally announced March 2018.
-
Analysis of the Continued Logarithm Algorithm
Authors:
Pablo Rotondo,
Brigitte Vallee,
Alfredo Viola
Abstract:
The Continued Logarithm Algorithm - CL for short- introduced by Gosper in 1978 computes the gcd of two integers; it seems very efficient, as it only performs shifts and subtractions. Shallit has studied its worst-case complexity in 2016 and showed it to be linear. We here perform the average-case analysis of the algorithm: we study its main parameters (number of iterations, total number of shifts)…
▽ More
The Continued Logarithm Algorithm - CL for short- introduced by Gosper in 1978 computes the gcd of two integers; it seems very efficient, as it only performs shifts and subtractions. Shallit has studied its worst-case complexity in 2016 and showed it to be linear. We here perform the average-case analysis of the algorithm: we study its main parameters (number of iterations, total number of shifts) and obtain precise asymptotics for their mean values. Our 'dynamical' analysis involves the dynamical system underlying the algorithm, that produces continued fraction expansions whose quotients are powers of 2. Even though this CL system has already been studied by Chan (around 2005), the presence of powers of 2 in the quotients ingrains into the central parameters a dyadic flavour that cannot be grasped solely by studying the CL system. We thus introduce a dyadic component and deal with a two-component system. With this new mixed system at hand, we then provide a complete average-case analysis of the CL algorithm, with explicit constants.
△ Less
Submitted 1 February, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
Robin Hood Hashing really has constant average search cost and variance in full tables
Authors:
Patricio V. Poblete,
Alfredo Viola
Abstract:
Thirty years ago, the Robin Hood collision resolution strategy was introduced for open addressing hash tables, and a recurrence equation was found for the distribution of its search cost. Although this recurrence could not be solved analytically, it allowed for numerical computations that, remarkably, suggested that the variance of the search cost approached a value of $1.883$ when the table was f…
▽ More
Thirty years ago, the Robin Hood collision resolution strategy was introduced for open addressing hash tables, and a recurrence equation was found for the distribution of its search cost. Although this recurrence could not be solved analytically, it allowed for numerical computations that, remarkably, suggested that the variance of the search cost approached a value of $1.883$ when the table was full. Furthermore, by using a non-standard mean-centered search algorithm, this would imply that searches could be performed in expected constant time even in a full table.
In spite of the time elapsed since these observations were made, no progress has been made in proving them. In this paper we introduce a technique to work around the intractability of the recurrence equation by solving instead an associated differential equation. While this does not provide an exact solution, it is sufficiently powerful to prove a bound for the variance, and thus obtain a proof that the variance of Robin Hood is bounded by a small constant for load factors arbitrarily close to 1. As a corollary, this proves that the mean-centered search algorithm runs in expected constant time.
We also use this technique to study the performance of Robin Hood hash tables under a long sequence of insertions and deletions, where deletions are implemented by marking elements as deleted. We prove that, in this case, the variance is bounded by $1/(1-α)+O(1)$, where $α$ is the load factor.
To model the behavior of these hash tables, we use a unified approach that can be applied also to study the First-Come-First-Served and Last-Come-First-Served collision resolution disciplines, both with and without deletions.
△ Less
Submitted 12 May, 2016;
originally announced May 2016.
-
A unified approach to linear probing hashing with buckets
Authors:
Svante Janson,
Alfredo Viola
Abstract:
We give a unified analysis of linear probing hashing with a general bucket size. We use both a combinatorial approach, giving exact formulas for generating functions, and a probabilistic approach, giving simple derivations of asymptotic results. Both approaches complement nicely, and give a good insight in the relation between linear probing and random walks. A key methodological contribution, at…
▽ More
We give a unified analysis of linear probing hashing with a general bucket size. We use both a combinatorial approach, giving exact formulas for generating functions, and a probabilistic approach, giving simple derivations of asymptotic results. Both approaches complement nicely, and give a good insight in the relation between linear probing and random walks. A key methodological contribution, at the core of Analytic Combinatorics, is the use of the symbolic method (based on q-calculus) to directly derive the generating functions to analyze.
△ Less
Submitted 22 October, 2014;
originally announced October 2014.
-
Optimal prefix codes for pairs of geometrically-distributed random variables
Authors:
Frédérique Bassino,
Julien Clément,
Gadiel Seroussi,
Alfredo Viola
Abstract:
Optimal prefix codes are studied for pairs of independent, integer-valued symbols emitted by a source with a geometric probability distribution of parameter $q$, $0{<}q{<}1$. By encoding pairs of symbols, it is possible to reduce the redundancy penalty of symbol-by-symbol encoding, while preserving the simplicity of the encoding and decoding procedures typical of Golomb codes and their variants. I…
▽ More
Optimal prefix codes are studied for pairs of independent, integer-valued symbols emitted by a source with a geometric probability distribution of parameter $q$, $0{<}q{<}1$. By encoding pairs of symbols, it is possible to reduce the redundancy penalty of symbol-by-symbol encoding, while preserving the simplicity of the encoding and decoding procedures typical of Golomb codes and their variants. It is shown that optimal codes for these so-called two-dimensional geometric distributions are \emph{singular}, in the sense that a prefix code that is optimal for one value of the parameter $q$ cannot be optimal for any other value of $q$. This is in sharp contrast to the one-dimensional case, where codes are optimal for positive-length intervals of the parameter $q$. Thus, in the two-dimensional case, it is infeasible to give a compact characterization of optimal codes for all values of the parameter $q$, as was done in the one-dimensional case. Instead, optimal codes are characterized for a discrete sequence of values of $q$ that provide good coverage of the unit interval. Specifically, optimal prefix codes are described for $q=2^{-1/k}$ ($k\ge 1$), covering the range $q\ge 1/2$, and $q=2^{-k}$ ($k>1$), covering the range $q<1/2$. The described codes produce the expected reduction in redundancy with respect to the one-dimensional case, while maintaining low complexity coding operations.
△ Less
Submitted 6 January, 2013; v1 submitted 11 February, 2011;
originally announced February 2011.