Search | arXiv e-print repository

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

Abstract: Large Language Models have been demonstrating the ability to solve complex tasks by delivering answers that are positively evaluated by humans due in part to the intensive use of human feedback that refines responses. However, the suggestibility transmitted through human feedback increases the inclination to produce responses that correspond to the users' beliefs or misleading prompts as opposed t… ▽ More Large Language Models have been demonstrating the ability to solve complex tasks by delivering answers that are positively evaluated by humans due in part to the intensive use of human feedback that refines responses. However, the suggestibility transmitted through human feedback increases the inclination to produce responses that correspond to the users' beliefs or misleading prompts as opposed to true facts, a behaviour known as sycophancy. This phenomenon decreases the bias, robustness, and, consequently, their reliability. In this paper, we shed light on the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, demonstrating these tendencies via human-influenced prompts over different tasks. Our investigation reveals that LLMs show sycophantic tendencies when responding to queries involving subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when confronted with mathematical tasks or queries that have an objective answer, these models at various scales seem not to follow the users' hints by demonstrating confidence in delivering the correct answers. △ Less

Submitted 28 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.08097 [pdf, other]

Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts

Authors: Leonardo Ranaldi, Giulia Pucci, Federico Ranaldi, Elena Sofia Ruzzetti, Fabio Massimo Zanzotto

Abstract: Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, whic… ▽ More Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance. △ Less

Submitted 21 June, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Findings of the Association for Computational Linguistics: NAACL 2024

Report number: 2024.findings-naacl.78

Journal ref: 2024.findings-naacl.78

arXiv:2308.14186 [pdf, other]

Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations

Authors: Leonardo Ranaldi, Giulia Pucci, Andre Freitas

Abstract: The language ability of Large Language Models (LLMs) is often unbalanced towards English because of the imbalance in the distribution of the pre-training data. This disparity is demanded in further fine-tuning and affecting the cross-lingual abilities of LLMs. In this paper, we propose to empower Instructiontuned LLMs (It-LLMs) in languages other than English by building semantic alignment between… ▽ More The language ability of Large Language Models (LLMs) is often unbalanced towards English because of the imbalance in the distribution of the pre-training data. This disparity is demanded in further fine-tuning and affecting the cross-lingual abilities of LLMs. In this paper, we propose to empower Instructiontuned LLMs (It-LLMs) in languages other than English by building semantic alignment between them. Hence, we propose CrossAlpaca, an It-LLM with cross-lingual instruction-following and Translation-following demonstrations to improve semantic alignment between languages. We validate our approach on the multilingual Question Answering (QA) benchmarks XQUAD and MLQA and adapted versions of MMLU and BBH. Our models, tested over six different languages, outperform the It-LLMs tuned on monolingual data. The final results show that instruction tuning on non-English data is not enough and that semantic alignment can be further improved by Translation-following demonstrations. △ Less

Submitted 27 August, 2023; originally announced August 2023.

arXiv:2302.07771 [pdf, ps, other]

doi 10.1007/978-3-031-38906-1_41

Fully dynamic clustering and diversity maximization in doubling metrics

Authors: Paolo Pellizzoni, Andrea Pietracaprina, Geppino Pucci

Abstract: We present approximation algorithms for some variants of center-based clustering and related problems in the fully dynamic setting, where the pointset evolves through an arbitrary sequence of insertions and deletions. Specifically, we target the following problems: $k$-center (with and without outliers), matroid-center, and diversity maximization. All algorithms employ a coreset-based strategy and… ▽ More We present approximation algorithms for some variants of center-based clustering and related problems in the fully dynamic setting, where the pointset evolves through an arbitrary sequence of insertions and deletions. Specifically, we target the following problems: $k$-center (with and without outliers), matroid-center, and diversity maximization. All algorithms employ a coreset-based strategy and rely on the use of the cover tree data structure, which we crucially augment to maintain, at any time, some additional information enabling the efficient extraction of the solution for the specific problem. For all of the aforementioned problems our algorithms yield $(α+\varepsilon)$-approximations, where $α$ is the best known approximation attainable in polynomial time in the standard off-line setting (except for $k$-center with $z$ outliers where $α= 2$ but we get a $(3+\varepsilon)$-approximation) and $\varepsilon>0$ is a user-provided accuracy parameter. The analysis of the algorithms is performed in terms of the doubling dimension of the underlying metric. Remarkably, and unlike previous works, the data structure and the running times of the insertion and deletion procedures do not depend in any way on the accuracy parameter $\varepsilon$ and, for the two $k$-center variants, on the parameter $k$. For spaces of bounded doubling dimension, the running times are dramatically smaller than those that would be required to compute solutions on the entire pointset from scratch. To the best of our knowledge, ours are the first solutions for the matroid-center and diversity maximization problems in the fully dynamic setting. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Journal ref: WADS 2023. Lecture Notes in Computer Science, vol 14079. Springer, Cham

arXiv:2202.08173 [pdf, other]

Distributed k-Means with Outliers in General Metrics

Authors: Enrico Dandolo, Andrea Pietracaprina, Geppino Pucci

Abstract: Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set $P$ of points from a metric space and a parameter $k<|P|$, requires to determine a subset $S$ of $k$ centers minimizing the sum of all squared distances of points in $P$ from their closest center. A more general formulation, known as k… ▽ More Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set $P$ of points from a metric space and a parameter $k<|P|$, requires to determine a subset $S$ of $k$ centers minimizing the sum of all squared distances of points in $P$ from their closest center. A more general formulation, known as k-means with $z$ outliers, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $P$ (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term $O(γ)$ away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where $γ$ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics. △ Less

Submitted 18 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

arXiv:2201.02448 [pdf, other]

doi 10.3390/a15020052

k-Center Clustering with Outliers in Sliding Windows

Authors: Paolo Pellizzoni, Andrea Pietracaprina, Geppino Pucci

Abstract: Metric $k$-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so that a more sensible variant seeks for the best solution that disregards a given number $z$ of points of the dataset, called outliers. We provide efficient algorithms for this important variant in the streaming model under the sliding wind… ▽ More Metric $k$-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so that a more sensible variant seeks for the best solution that disregards a given number $z$ of points of the dataset, called outliers. We provide efficient algorithms for this important variant in the streaming model under the sliding window setting, where, at each time step, the dataset to be clustered is the window $W$ of the most recent data items. Our algorithms achieve $O(1)$ approximation and, remarkably, require a working memory linear in $k+z$ and only logarithmic in $|W|$. As a by-product, we show how to estimate the effective diameter of the window $W$, which is a measure of the spread of the window points, disregarding a given fraction of noisy distances. We also provide experimental evidence of the practical viability of our theoretical results. △ Less

Submitted 7 January, 2022; originally announced January 2022.

Journal ref: Algorithms. 2022; 15(2):52

arXiv:2003.01430 [pdf, other]

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

Authors: Federico Altieri, Andrea Pietracaprina, Geppino Pucci, Fabio Vandin

Abstract: The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we… ▽ More The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm hinges on a Probability Proportional to Size (PPS) sampling scheme, and, for any fixed $\varepsilon, δ\in (0,1)$, it approximates the silhouette coefficient within a mere additive error $O(\varepsilon)$ with probability $1-δ$, using a very small number of distance calculations. We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation. Importantly, we provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, which makes our estimation approach applicable to big data scenarios. We perform an extensive experimental evaluation of our silhouette approximation algorithm, comparing its performance to a number of baseline heuristics on real and synthetic datasets. The experiments provide evidence that, unlike other heuristics, our estimation strategy not only provides tight theoretical guarantees but is also able to return highly accurate estimations while running in a fraction of the time required by the exact computation, and that its distributed implementation is highly scalable, thus enabling the computation of internal measures for very large datasets for which the exact computation is prohibitive. △ Less

Submitted 20 January, 2021; v1 submitted 3 March, 2020; originally announced March 2020.

Comments: 16 pages, 4 tables, 1 figure

ACM Class: I.5.3; I.5.4; I.5.5

arXiv:2002.07463 [pdf, ps, other]

Coreset-based Strategies for Robust Center-type Problems

Authors: Andrea Pietracaprina, Geppino Pucci, Federico Soldà

Abstract: Given a dataset $V$ of points from some metric space, the popular $k$-center problem requires to identify a subset of $k$ points (centers) in $V$ minimizing the maximum distance of any point of $V$ from its closest center. The \emph{robust} formulation of the problem features a further parameter $z$ and allows up to $z$ points of $V$ (outliers) to be disregarded when computing the maximum distance… ▽ More Given a dataset $V$ of points from some metric space, the popular $k$-center problem requires to identify a subset of $k$ points (centers) in $V$ minimizing the maximum distance of any point of $V$ from its closest center. The \emph{robust} formulation of the problem features a further parameter $z$ and allows up to $z$ points of $V$ (outliers) to be disregarded when computing the maximum distance from the centers. In this paper, we focus on two important constrained variants of the robust $k$-center problem, namely, the Robust Matroid Center (RMC) problem, where the set of returned centers are constrained to be an independent set of a matroid of rank $k$ built on $V$, and the Robust Knapsack Center (RKC) problem, where each element $i\in V$ is given a positive weight $w_i<1$ and the aggregate weight of the returned centers must be at most 1. We devise coreset-based strategies for the two problems which yield efficient sequential, MapReduce, and Streaming algorithms. More specifically, for any fixed $ε>0$, the algorithms return solutions featuring a $(3+ε)$-approximation ratio, which is a mere additive term $ε$ away from the 3-approximations achievable by the best known polynomial-time sequential algorithms for the two problems. Moreover, the algorithms obliviously adapt to the intrinsic complexity of the dataset, captured by its doubling dimension $D$. For wide ranges of the parameters $k,z,ε, D$, we obtain a sequential algorithm with running time linear in $|V|$, and MapReduce/Streaming algorithms with few rounds/passes and substantially sublinear local/working memory. △ Less

Submitted 18 February, 2020; originally announced February 2020.

Comments: 16 pages

arXiv:2002.03175 [pdf, ps, other]

A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci

Abstract: Diversity maximization is a fundamental problem in web search and data mining. For a given dataset $S$ of $n$ elements, the problem requires to determine a subset of $S$ containing $k\ll n$ "representatives" which minimize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy a… ▽ More Diversity maximization is a fundamental problem in web search and data mining. For a given dataset $S$ of $n$ elements, the problem requires to determine a subset of $S$ containing $k\ll n$ "representatives" which minimize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size $k$ of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of $S$ containing a feasible solution which is no more than a factor $1-ε$ away from the optimal solution for $S$. While our algorithms are fully general, for the partition and transversal matroids, if $ε$ is a constant in $(0,1)$ and $S$ has bounded doubling dimension, the coreset size is independent of $n$ and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario. △ Less

Submitted 8 February, 2020; originally announced February 2020.

arXiv:1904.12728 [pdf, ps, other]

Accurate MapReduce Algorithms for $k$-median and $k$-means in General Metric Spaces

Authors: Alessio Mazzetto, Andrea Pietracaprina, Geppino Pucci

Abstract: Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-median and $k$-means variants which, given a set $P$ of points from a metric space and a parameter $k<|P|$, require to identify a set $S$ of $k$ centers minimizing, respectively, the sum of the distances and of the squared distances of all… ▽ More Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-median and $k$-means variants which, given a set $P$ of points from a metric space and a parameter $k<|P|$, require to identify a set $S$ of $k$ centers minimizing, respectively, the sum of the distances and of the squared distances of all points in $P$ from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., $S \subseteq P$). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small $D$, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces. △ Less

Submitted 29 September, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

arXiv:1802.09205 [pdf, other]

Solving $k$-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci

Abstract: Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-center variant which, given a set $S$ of points from some metric space and a parameter $k<|S|$, requires to identify a subset of $k$ centers in $S$ minimizing the maximum distance of any point of $S$ from its closest center. A more general… ▽ More Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-center variant which, given a set $S$ of points from some metric space and a parameter $k<|S|$, requires to identify a subset of $k$ centers in $S$ minimizing the maximum distance of any point of $S$ from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $S$ (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed $ε>0$, the algorithms yield solutions whose approximation ratios are a mere additive term $ε$ away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) $D$. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones. △ Less

Submitted 1 June, 2021; v1 submitted 26 February, 2018; originally announced February 2018.

arXiv:1612.06675 [pdf, other]

Clustering Uncertain Graphs

Authors: Matteo Ceccarello, Carlo Fantozzi, Andrea Pietracaprina, Geppino Pucci, Fabio Vandin

Abstract: An uncertain graph $\mathcal{G} = (V, E, p : E \rightarrow (0,1])$ can be viewed as a probability space whose outcomes (referred to as \emph{possible worlds}) are subgraphs of $\mathcal{G}$ where any edge $e\in E$ occurs with probability $p(e)$, independently of the other edges. These graphs naturally arise in many application domains where data management systems are required to cope with uncerta… ▽ More An uncertain graph $\mathcal{G} = (V, E, p : E \rightarrow (0,1])$ can be viewed as a probability space whose outcomes (referred to as \emph{possible worlds}) are subgraphs of $\mathcal{G}$ where any edge $e\in E$ occurs with probability $p(e)$, independently of the other edges. These graphs naturally arise in many application domains where data management systems are required to cope with uncertainty in interrelated data, such as computational biology, social network analysis, network reliability, and privacy enforcement, among the others. For this reason, it is important to devise fundamental querying and mining primitives for uncertain graphs. This paper contributes to this endeavor with the development of novel strategies for clustering uncertain graphs. Specifically, given an uncertain graph $\mathcal{G}$ and an integer $k$, we aim at partitioning its nodes into $k$ clusters, each featuring a distinguished center node, so to maximize the minimum/average connection probability of any node to its cluster's center, in a random possible world. We assess the NP-hardness of maximizing the minimum connection probability, even in the presence of an oracle for the connection probabilities, and develop efficient approximation algorithms for both problems and some useful variants. Unlike previous works in the literature, our algorithms feature provable approximation guarantees and are capable to keep the granularity of the returned clustering under control. Our theoretical findings are complemented with several experiments that compare our algorithms against some relevant competitors, with respect to both running-time and quality of the returned clusterings. △ Less

Submitted 16 October, 2017; v1 submitted 20 December, 2016; originally announced December 2016.

arXiv:1605.05590 [pdf, other]

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in… ▽ More Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an $(α+ε)$-approximation ratio, for any constant $ε>0$, where $α$ is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points. △ Less

Submitted 23 January, 2017; v1 submitted 18 May, 2016; originally announced May 2016.

Comments: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 2017

arXiv:1506.03265 [pdf, other]

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our algorithm uses linear space and yields a polylogarith… ▽ More We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; moreover, for important practical classes of graphs, it runs in a number of rounds asymptotically smaller than those required by the natural approximation provided by the state-of-the-art $Δ$-step** SSSP algorithm, which is its only practical linear-space competitor in the aforementioned computational scenario. We complement our theoretical findings with an extensive experimental analysis on large benchmark graphs, which demonstrates that our algorithm attains substantial improvements on a number of key performance indicators with respect to the aforementioned competitor, while featuring a similar approximation ratio (a small constant less than 1.4, as opposed to the polylogarithmic theoretical bound). △ Less

Submitted 9 November, 2015; v1 submitted 10 June, 2015; originally announced June 2015.

arXiv:1407.3144 [pdf, other]

Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our pa… ▽ More We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) $k$-center clustering approximation; and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches. △ Less

Submitted 6 February, 2015; v1 submitted 11 July, 2014; originally announced July 2014.

Comments: 14 pages

arXiv:1404.3318 [pdf, other]

Network-Oblivious Algorithms

Authors: Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato, Francesco Silvestri

Abstract: A framework is proposed for the design and analysis of \emph{network-oblivious algorithms}, namely, algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the p… ▽ More A framework is proposed for the design and analysis of \emph{network-oblivious algorithms}, namely, algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem's input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the Decomposable BSP model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed. △ Less

Submitted 12 April, 2014; originally announced April 2014.

Comments: 34 pages

arXiv:1306.2552 [pdf, other]

doi 10.4230/LIPIcs.STACS.2014.627

Space-Efficient Parallel Algorithms for Combinatorial Search Problems

Authors: Andrea Pietracaprina, Geppino Pucci, Francesco Silvestri, Fabio Vandin

Abstract: We present space-efficient parallel strategies for two fundamental combinatorial search problems, namely, backtrack search and branch-and-bound, both involving the visit of an $n$-node tree of height $h$ under the assumption that a node can be accessed only through its father or its children. For both problems we propose efficient algorithms that run on a $p$-processor distributed-memory machine.… ▽ More We present space-efficient parallel strategies for two fundamental combinatorial search problems, namely, backtrack search and branch-and-bound, both involving the visit of an $n$-node tree of height $h$ under the assumption that a node can be accessed only through its father or its children. For both problems we propose efficient algorithms that run on a $p$-processor distributed-memory machine. For backtrack search, we give a deterministic algorithm running in $O(n/p+h\log p)$ time, and a Las Vegas algorithm requiring optimal $O(n/p+h)$ time, with high probability. Building on the backtrack search algorithm, we also derive a Las Vegas algorithm for branch-and-bound which runs in $O((n/p+h\log p \log n)h\log^2 n)$ time, with high probability. A remarkable feature of our algorithms is the use of only constant space per processor, which constitutes a significant improvement upon previous algorithms whose space requirements per processor depend on the (possibly huge) tree to be explored. △ Less

Submitted 26 March, 2014; v1 submitted 11 June, 2013; originally announced June 2013.

Comments: Extended version of the paper in the Proc. of 38th International Symposium on Mathematical Foundations of Computer Science (MFCS)

ACM Class: F.2.2

arXiv:1111.2228 [pdf, ps, other]

doi 10.1145/2304576.2304607

Space-Round Tradeoffs for MapReduce Computations

Authors: Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, Eli Upfal

Abstract: This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and… ▽ More This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and local memory constraints, thus favoring a more data-centric view. Second, we apply the model to the fundamental computation task of matrix multiplication presenting upper and lower bounds for both dense and sparse matrix multiplication, which highlight interesting tradeoffs between space and round complexity. Finally, building on the matrix multiplication results, we derive further space-round tradeoffs on matrix inversion and matching. △ Less

Submitted 9 November, 2011; originally announced November 2011.

Journal ref: Final version in Proc. of the 26th ACM international conference on Supercomputing, pages 235-244, 2012

arXiv:1101.4609 [pdf, ps, other]

Tight Bounds on Information Dissemination in Sparse Mobile Networks

Authors: Alberto Pettarin, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider $k$ mobile agents performing independent random walks on an $n$-node grid. At time $0$, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communicati… ▽ More Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider $k$ mobile agents performing independent random walks on an $n$-node grid. At time $0$, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communication graph process ${G_t(r) | t \geq 0}$, where two agents are connected by an edge in $G_t(r)$ iff their distance at time $t$ is within their transmission radius $r$. Modeling the physical reality that the speed of radio transmission is much faster than the motion of the agents, we assume that the rumor can travel throughout a connected component of $G_t$ before the graph is altered by the motion. We study the broadcast time $T_B$ of the system, which is the time it takes for all agents to know the rumor. We focus on the sparse case (below the percolation point $r_c \approx \sqrt{n/k}$) where, with high probability, no connected component in $G_t$ has more than a logarithmic number of agents and the broadcast time is dominated by the time it takes for many independent random walks to meet each other. Quite surprisingly, we show that for a system below the percolation point the broadcast time does not depend on the relation between the mobility speed and the transmission radius. In fact, we prove that $T_B = \tilde{O}(n / \sqrt{k})$ for any $0 \leq r < r_c$, even when the transmission range is significantly larger than the mobility range in one step, giving a tight characterization up to logarithmic factors. Our result complements a recent result of Peres et al. (SODA 2011) who showed that above the percolation point the broadcast time is polylogarithmic in $k$. △ Less

Submitted 1 February, 2011; v1 submitted 24 January, 2011; originally announced January 2011.

Comments: 19 pages; we rewrote Lemma 4, fixing a claim which was not fully justified in the first version of the draft

arXiv:1007.1604 [pdf, other]

Infectious Random Walks

Authors: Alberto Pettarin, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We study the dynamics of information (or virus) dissemination by $m$ mobile agents performing independent random walks on an $n$-node grid. We formulate our results in terms of two scenarios: broadcasting and gossi**. In the broadcasting scenario, the mobile agents are initially placed uniformly at random among the grid nodes. At time 0, one agent is informed of a rumor and starts a random walk.… ▽ More We study the dynamics of information (or virus) dissemination by $m$ mobile agents performing independent random walks on an $n$-node grid. We formulate our results in terms of two scenarios: broadcasting and gossi**. In the broadcasting scenario, the mobile agents are initially placed uniformly at random among the grid nodes. At time 0, one agent is informed of a rumor and starts a random walk. When an informed agent meets an uninformed agent, the latter becomes informed and starts a new random walk. We study the broadcasting time of the system, that is, the time it takes for all agents to know the rumor. In the gossi** scenario, each agent is given a distinct rumor at time 0 and all agents start random walks. When two agents meet, they share all rumors they are aware of. We study the gossi** time of the system, that is, the time it takes for all agents to know all rumors. We prove that both the broadcasting and the gossi** times are $\tildeΘ(n/\sqrt{m})$ w.h.p., thus achieving a tight characterization up to logarithmic factors. Previous results for the grid provided bounds which were weaker and only concerned average times. In the context of virus infection, a corollary of our results is that static and dynamically moving agents are infected at about the same speed. △ Less

Submitted 25 January, 2011; v1 submitted 9 July, 2010; originally announced July 2010.

Comments: 21 pages, 3 figures --- The results presented in this paper have been extended in: Pettarin et al., Tight Bounds on Information Dissemination in Sparse Mobile Networks, http://arxiv.longhoe.net/abs/1101.4609

arXiv:1002.1104 [pdf, ps, other]

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Authors: Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, Fabio Vandin

Abstract: As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support thres… ▽ More As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology. △ Less

Submitted 4 February, 2010; originally announced February 2010.

Comments: A preliminary version of this work was presented in ACM PODS 2009. 20 pages, 0 figures

ACM Class: H.2.8

arXiv:1002.0874 [pdf, ps, other]

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

Authors: Roberto Grossi, Andrea Pietracaprina, Nadia Pisanti, Geppino Pucci, Eli Upfal, Fabio Vandin

Abstract: We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only… ▽ More We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX △ Less

Submitted 3 February, 2010; originally announced February 2010.

Comments: A preliminary version of this work was presented in WABI 2009. 10 pages, 0 figures

Showing 1–22 of 22 results for author: Pucci, G