Search | arXiv e-print repository

An Adaptive Method for Weak Supervision with Drifting Data

Authors: Alessio Mazzetto, Reza Esfandiarpoor, Eli Upfal, Stephen H. Bach

Abstract: We introduce an adaptive method with formal quality guarantees for weak supervision in a non-stationary setting. Our goal is to infer the unknown labels of a sequence of data by using weak supervision sources that provide independent noisy signals of the correct classification for each data point. This setting includes crowdsourcing and programmatic weak supervision. We focus on the non-stationary… ▽ More We introduce an adaptive method with formal quality guarantees for weak supervision in a non-stationary setting. Our goal is to infer the unknown labels of a sequence of data by using weak supervision sources that provide independent noisy signals of the correct classification for each data point. This setting includes crowdsourcing and programmatic weak supervision. We focus on the non-stationary case, where the accuracy of the weak supervision sources can drift over time, e.g., because of changes in the underlying data distribution. Due to the drift, older data could provide misleading information to infer the label of the current data point. Previous work relied on a priori assumptions on the magnitude of the drift to decide how much data to use from the past. Comparatively, our algorithm does not require any assumptions on the drift, and it adapts based on the input. In particular, at each step, our algorithm guarantees an estimation of the current accuracies of the weak supervision sources over a window of past observations that minimizes a trade-off between the error due to the variance of the estimation and the error due to the drift. Experiments on synthetic and real-world labelers show that our approach indeed adapts to the drift. Unlike fixed-window-size strategies, it dynamically chooses a window size that allows it to consistently maintain good performance. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.02252 [pdf, ps, other]

An Adaptive Algorithm for Learning with Unknown Distribution Drift

Authors: Alessio Mazzetto, Eli Upfal

Abstract: We develop and analyze a general technique for learning with an unknown distribution drift. Given a sequence of independent observations from the last $T$ steps of a drifting distribution, our algorithm agnostically learns a family of functions with respect to the current distribution at time $T$. Unlike previous work, our technique does not require prior knowledge about the magnitude of the drift… ▽ More We develop and analyze a general technique for learning with an unknown distribution drift. Given a sequence of independent observations from the last $T$ steps of a drifting distribution, our algorithm agnostically learns a family of functions with respect to the current distribution at time $T$. Unlike previous work, our technique does not require prior knowledge about the magnitude of the drift. Instead, the algorithm adapts to the sample data. Without explicitly estimating the drift, the algorithm learns a family of functions with almost the same error as a learning algorithm that knows the magnitude of the drift in advance. Furthermore, since our algorithm adapts to the data, it can guarantee a better learning error than an algorithm that relies on loose bounds on the drift. We demonstrate the application of our technique in two fundamental learning scenarios: binary classification and linear regression. △ Less

Submitted 27 October, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: Updated version for Camera-ready with minor changes in text for readability, and including a new small section on linear regression

arXiv:2302.02460 [pdf, other]

Nonparametric Density Estimation under Distribution Drift

Authors: Alessio Mazzetto, Eli Upfal

Abstract: We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum is over all possible estimates and the maximum is o… ▽ More We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum is over all possible estimates and the maximum is over all possible distributions that satisfy the drift constraints. Our technique handles a broad class of drift models, and generalizes previous results on agnostic learning under drift. △ Less

Submitted 27 October, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: Camera Ready version

arXiv:2205.13068 [pdf, other]

Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes

Authors: Alessio Mazzetto, Cristina Menghini, Andrew Yuan, Eli Upfal, Stephen H. Bach

Abstract: We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attri… ▽ More We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attributes to classes for this setting, even with perfect attribute detectors. The lower bound characterizes the theoretical intrinsic difficulty of the zero-shot problem based on the available information -- the class-attribute matrix -- and the bound is practically computable from it. Our lower bound is tight, as we show that we can always find a randomized map from attributes to classes whose expected error is upper bounded by the value of the lower bound. We show that our analysis can be predictive of how standard zero-shot methods behave in practice, including which classes will likely be confused with others. △ Less

Submitted 28 November, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

arXiv:2111.08118 [pdf, other]

Brain Functional Connectivity Estimation Utilizing Diffusion Kernels on a Structural Connectivity Graph

Authors: Nathan Tung, Jerome Sanes, Eli Upfal, Ani Eloyan

Abstract: Functional connectivity (FC) refers to the investigation of interactions between brain regions to understand integration of neural activity in several regions. FC is often estimated using functional magnetic resonance images (fMRI). There has been increasing interest in the potential of multi-modal imaging to obtain robust estimates of FC in high-dimensional settings. We develop novel algorithms a… ▽ More Functional connectivity (FC) refers to the investigation of interactions between brain regions to understand integration of neural activity in several regions. FC is often estimated using functional magnetic resonance images (fMRI). There has been increasing interest in the potential of multi-modal imaging to obtain robust estimates of FC in high-dimensional settings. We develop novel algorithms adapting graphical methods incorporating diffusion tensor imaging (DTI) to estimate FC with computational efficiency and scalability. We propose leveraging a graphical random walk on DTI to define a new measure of structural connectivity highlighting spurious connected components. Our proposed approach is based on finding appropriate subnetwork topology using permutation testing before selection of subnetwork components comprising FC. Extensive simulations demonstrate that the performance of our methods is comparable to or better than currently used approaches in estimation accuracy, with the advantage of greater speed and simpler implementation. We analyze task-based fMRI data obtained from the Human Connectome Project database using our proposed methods and reveal novel insights into brain interactions during performance of a motor task. We expect that the transparency and flexibility of our approach will prove valuable as further understanding of the structure-function relationship informs future network estimation. △ Less

Submitted 21 January, 2023; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 29 pages, 6 figures, 1 table, 2 algorithms

arXiv:2111.07372 [pdf, other]

Fast Doubly-Adaptive MCMC to Estimate the Gibbs Partition Function with Weak Mixing Time Bounds

Authors: Shahrzad Haddadan, Yue Zhuang, Cyrus Cousins, Eli Upfal

Abstract: We present a novel method for reducing the computational complexity of rigorously estimating the partition functions (normalizing constants) of Gibbs (Boltzmann) distributions, which arise ubiquitously in probabilistic graphical models. A major obstacle to practical applications of Gibbs distributions is the need to estimate their partition functions. The state of the art in addressing this proble… ▽ More We present a novel method for reducing the computational complexity of rigorously estimating the partition functions (normalizing constants) of Gibbs (Boltzmann) distributions, which arise ubiquitously in probabilistic graphical models. A major obstacle to practical applications of Gibbs distributions is the need to estimate their partition functions. The state of the art in addressing this problem is multi-stage algorithms, which consist of a cooling schedule, and a mean estimator in each step of the schedule. While the cooling schedule in these algorithms is adaptive, the mean estimation computations use MCMC as a black-box to draw approximate samples. We develop a doubly adaptive approach, combining the adaptive cooling schedule with an adaptive MCMC mean estimator, whose number of Markov chain steps adapts dynamically to the underlying chain. Through rigorous theoretical analysis, we prove that our method outperforms the state of the art algorithms in several factors: (1) The computational complexity of our method is smaller; (2) Our method is less sensitive to loose bounds on mixing times, an inherent component in these algorithms; and (3) The improvement obtained by our method is particularly significant in the most challenging regime of high-precision estimation. We demonstrate the advantage of our method in experiments run on classic factor graphs, such as voting models and Ising models. △ Less

Submitted 14 November, 2021; originally announced November 2021.

Comments: A short version of this paper will appear inthe 35th Conference on NeuralInformation Processing Systems, NeurIPS 2021

arXiv:2101.04751 [pdf, other]

doi 10.1145/3437963.3441825

RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions

Authors: Shahrzad Haddadan, Cristina Menghini, Matteo Riondato, Eli Upfal

Abstract: The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it to a page of different opinion is large. The struc… ▽ More The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it to a page of different opinion is large. The structural bias of a graph is the sum of the radii of highly-polarized bubbles. We study the problem of decreasing the structural bias through edge insertions. Healing all nodes with high polarized bubble radius is hard to approximate within a logarithmic factor, so we focus on finding the best $k$ edges to insert to maximally reduce the structural bias. We present RePBubLik, an algorithm that leverages a variant of the random walk closeness centrality to select the edges to insert. RePBubLik obtains, under mild conditions, a constant-factor approximation. It reduces the structural bias faster than existing edge-recommendation methods, including some designed to reduce the polarization of a graph. △ Less

Submitted 12 January, 2021; originally announced January 2021.

arXiv:2011.11129 [pdf, ps, other]

Making mean-estimation more efficient using an MCMC trace variance approach: DynaMITE

Authors: Cyrus Cousins, Shahrzad Haddadan, Eli Upfal

Abstract: We introduce a novel statistical measure for MCMC-mean estimation, the inter-trace variance ${\rm trv}^{(τ_{rel})}({\cal M},f)$, which depends on a Markov chain ${\cal M}$ and a function $f:S\to [a,b]$. The inter-trace variance can be efficiently estimated from observed data and leads to a more efficient MCMC-mean estimator. Prior MCMC mean-estimators receive, as input, upper-bounds on $τ_{mix}$ o… ▽ More We introduce a novel statistical measure for MCMC-mean estimation, the inter-trace variance ${\rm trv}^{(τ_{rel})}({\cal M},f)$, which depends on a Markov chain ${\cal M}$ and a function $f:S\to [a,b]$. The inter-trace variance can be efficiently estimated from observed data and leads to a more efficient MCMC-mean estimator. Prior MCMC mean-estimators receive, as input, upper-bounds on $τ_{mix}$ or $τ_{rel}$, and often also the stationary variance, and their performance is highly dependent to the sharpness of these bounds. In contrast, we introduce DynaMITE, which dynamically adjusts the sample size, it is less sensitive to the looseness of input upper-bounds on $τ_{rel}$, and requires no bound on $v_π$. Receiving only an upper-bound ${\cal T}_{rel}$ on $τ_{rel}$, DynaMITE estimates the mean of $f$ in $\tilde{\cal{O}}\bigl(\smash{\frac{{\cal T}_{rel} R}{\varepsilon}}+\frac{τ_{rel}\cdot {\rm trv}^{(τ{rel})}}{\varepsilon^{2}}\bigr)$ steps, without a priori bounds on the stationary variance $v_π$ or the inter-trace variance ${\rm trv}^{(τrel)}$. Thus we depend minimally on the tightness of ${\cal T}_{mix}$, as the complexity is dominated by $τ_{rel}\rm{trv}^{(τ{rel})}$ as $\varepsilon \to 0$. Note that bounding $τ_{\rm rel}$ is known to be prohibitively difficult, however, DynaMITE is able to reduce its principal dependence on ${\cal T}_{rel}$ to $τ_{rel}$, simply by exploiting properties of the inter-trace variance. To compare our method to known variance-aware bounds, we show ${\rm trv}^{(τ{rel})}({\cal M},f) \leq v_π$. Furthermore, we show when $f$'s image is distributed (semi)symmetrically on ${\cal M}$'s traces, we have ${\rm trv}^{({τ{rel}})}({\cal M},f)=o(v_π(f))$, thus DynaMITE outperforms prior methods in these cases. △ Less

Submitted 4 August, 2021; v1 submitted 22 November, 2020; originally announced November 2020.

arXiv:2007.08197 [pdf, other]

doi 10.1109/BigData52589.2021.9671943

How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing Topics?

Authors: Cristina Menghini, Aris Anagnostopoulos, Eli Upfal

Abstract: Wikipedia relies on an extensive review process to verify that the content of each individual page is unbiased and presents a neutral point of view. Less attention has been paid to possible biases in the hyperlink structure of Wikipedia, which has a significant influence on the user's exploration process when visiting more than one page. The evaluation of hyperlink bias is challenging because it d… ▽ More Wikipedia relies on an extensive review process to verify that the content of each individual page is unbiased and presents a neutral point of view. Less attention has been paid to possible biases in the hyperlink structure of Wikipedia, which has a significant influence on the user's exploration process when visiting more than one page. The evaluation of hyperlink bias is challenging because it depends on the global view rather than the text of individual pages. In this paper, we focus on the influence of the interconnect topology between articles describing complementary aspects of polarizing topics. We introduce a novel measure of exposure to diverse information to quantify users' exposure to different aspects of a topic throughout an entire surfing session, rather than just one click ahead. We apply this measure to six polarizing topics (e.g., gun control and gun right), and we identify cases in which the network topology significantly limits the exposure of users to diverse information on the topic, encouraging users to remain in a knowledge bubble. Our findings demonstrate the importance of evaluating Wikipedia's network structure in addition to the extensive review of individual articles. △ Less

Submitted 31 March, 2022; v1 submitted 16 July, 2020; originally announced July 2020.

arXiv:1910.03493 [pdf, other]

A Rademacher Complexity Based Method fo rControlling Power and Confidence Level in Adaptive Statistical Analysis

Authors: Lorenzo De Stefani, Eli Upfal

Abstract: While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previou… ▽ More While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RadaBound a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. △ Less

Submitted 4 October, 2019; originally announced October 2019.

arXiv:1905.13379 [pdf, other]

Learning Equilibria of Simulation-Based Games

Authors: Enrique Areyan Viqueira, Cyrus Cousins, Eli Upfal, Amy Greenwald

Abstract: We tackle a fundamental problem in empirical game-theoretic analysis (EGTA), that of learning equilibria of simulation-based games. Such games cannot be described in analytical form; instead, a black-box simulator can be queried to obtain noisy samples of utilities. Our approach to EGTA is in the spirit of probably approximately correct learning. We design algorithms that learn so-called empirical… ▽ More We tackle a fundamental problem in empirical game-theoretic analysis (EGTA), that of learning equilibria of simulation-based games. Such games cannot be described in analytical form; instead, a black-box simulator can be queried to obtain noisy samples of utilities. Our approach to EGTA is in the spirit of probably approximately correct learning. We design algorithms that learn so-called empirical games, which uniformly approximate the utilities of simulation-based games with finite-sample guarantees. These algorithms can be instantiated with various concentration inequalities. Building on earlier work, we first apply Hoeffding's bound, but as the size of the game grows, this bound eventually becomes statistically intractable; hence, we also use the Rademacher complexity. Our main results state: with high probability, all equilibria of the simulation-based game are approximate equilibria in the empirical game (perfect recall); and conversely, all approximate equilibria in the empirical game are approximate equilibria in the simulation-based game (approximately perfect precision). We evaluate our algorithms on several synthetic games, showing that they make frugal use of data, produce accurate estimates more often than the theory predicts, and are robust to different forms of noise. △ Less

Submitted 30 May, 2019; originally announced May 2019.

arXiv:1902.06306 [pdf, ps, other]

On the Complexity of Anonymous Communication Through Public Networks

Authors: Megumi Ando, Anna Lysyanskaya, Eli Upfal

Abstract: Onion routing is the most widely used approach to anonymous communication online. The idea is that Alice wraps her message to Bob in layers of encryption to form an "onion," and routes it through a series of intermediaries. Each intermediary's job is to decrypt ("peel") the onion it receives to obtain instructions for where to send it next, and what to send. The intuition is that, by the time it g… ▽ More Onion routing is the most widely used approach to anonymous communication online. The idea is that Alice wraps her message to Bob in layers of encryption to form an "onion," and routes it through a series of intermediaries. Each intermediary's job is to decrypt ("peel") the onion it receives to obtain instructions for where to send it next, and what to send. The intuition is that, by the time it gets to Bob, the onion will have mixed with so many other onions, that its origin will be hard to trace even for an adversary that observes the entire network and controls a fraction of the participants, possibly including Bob. In spite of its widespread use in practice, until now no onion routing protocol was known that simultaneously achieved, in the presence of an active adversary that observes all network traffic and controls a constant fraction of the participants, (a) fault-tolerance, where even if a few of the onions are dropped, the protocol still delivers the rest; (b) reasonable communication and computational complexity as a function of the security parameter and the number of participants; and (c) anonymity. In this paper, we give the first onion routing protocol that meets these goals: our protocol (a) tolerates a polylogarithmic (in the security parameter) number of dropped onions and still delivers the rest; (b) requires a polylogarithmic number of rounds and a polylogarithmic number of onions sent per participant per round; and (c) achieves anonymity. We also show that to achieve anonymity in a fault-tolerant fashion via onion routing, this number of onions and rounds is necessary. Of independent interest, our analysis introduces two new security properties of onion routing -- mixing and equalizing -- and we show that together they imply anonymity. △ Less

Submitted 28 July, 2021; v1 submitted 17 February, 2019; originally announced February 2019.

arXiv:1812.07568 [pdf, other]

Uniform Convergence Bounds for Codec Selection

Authors: Clayton Sanford, Cyrus Cousins, Eli Upfal

Abstract: We frame the problem of selecting an optimal audio encoding scheme as a supervised learning task. Through uniform convergence theory, we guarantee approximately optimal codec selection while controlling for selection bias. We present rigorous statistical guarantees for the codec selection problem that hold for arbitrary distributions over audio sequences and for arbitrary quality metrics. Our tech… ▽ More We frame the problem of selecting an optimal audio encoding scheme as a supervised learning task. Through uniform convergence theory, we guarantee approximately optimal codec selection while controlling for selection bias. We present rigorous statistical guarantees for the codec selection problem that hold for arbitrary distributions over audio sequences and for arbitrary quality metrics. Our techniques can thus balance sound quality and compression ratio, and use audio samples from the distribution to select a codec that performs well on that particular type of data. The applications of our technique are immense, as it can be used to optimize for quality and bandwidth usage of streaming and other digital media, while significantly outperforming approaches that apply a fixed codec to all data sources. △ Less

Submitted 17 December, 2018; originally announced December 2018.

arXiv:1811.00602 [pdf, other]

VizRec: A framework for secure data exploration via visual representation

Authors: Lorenzo De Stefani, Leonhard F. Spiegelberg, Tim Kraska, Eli Upfal

Abstract: Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. However, since visualizations tools are applied to sample data, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem… ▽ More Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. However, since visualizations tools are applied to sample data, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem is even more significant when visualization is used to identify interesting patterns among many possible possibilities, or to identify an interesting deviation in a pair of observations among many possible pairs, as commonly done in visual recommendation systems. We present VizRec, a framework for improving the performance of visual recommendation systems by quantifying the statistical significance of recommended visualizations. The proposed methodology allows to control the probability of misleading visual recommendations using both classical statistical testing procedures and a novel application of the Vapnik Chervonenkis (VC) dimension method which is a fundamental concept in statistical learning theory. △ Less

Submitted 1 November, 2018; originally announced November 2018.

arXiv:1808.08294 [pdf, other]

Unknown Examples & Machine Learning Model Generalization

Authors: Yeounoh Chung, Peter J. Haas, Eli Upfal, Tim Kraska

Abstract: Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, m… ▽ More Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples---unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better. △ Less

Submitted 11 October, 2019; v1 submitted 24 August, 2018; originally announced August 2018.

arXiv:1807.02876 [pdf, other]

Machine Learning in High Energy Physics Community White Paper

Authors: Kim Albertsson, Piero Altoe, Dustin Anderson, John Anderson, Michael Andrews, Juan Pedro Araque Espinosa, Adam Aurisano, Laurent Basara, Adrian Bevan, Wahid Bhimji, Daniele Bonacorsi, Bjorn Burkle, Paolo Calafiura, Mario Campanelli, Louis Capps, Federico Carminati, Stefano Carrazza, Yi-fan Chen, Taylor Childers, Yann Coadou, Elias Coniavitis, Kyle Cranmer, Claire David, Douglas Davis, Andrea De Simone , et al. (103 additional authors not shown)

Abstract: Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We d… ▽ More Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We detail a roadmap for their implementation, software and hardware resource requirements, collaborative initiatives with the data science community, academia and industry, and training the particle physics community in data science. The main objective of the document is to connect and motivate these areas of research and development with the physics drivers of the High-Luminosity Large Hadron Collider and future neutrino experiments and identify the resource needs for their implementation. Additionally we identify areas where collaboration with external communities will be of great benefit. △ Less

Submitted 16 May, 2019; v1 submitted 8 July, 2018; originally announced July 2018.

Comments: Editors: Sergei Gleyzer, Paul Seyfert and Steven Schramm

arXiv:1710.02108 [pdf, ps, other]

Tiered Sampling: An Efficient Method for Approximate Counting Sparse Motifs in Massive Graph Streams

Authors: Lorenzo De Stefani, Erisa Terolli, Eli Upfal

Abstract: We introduce Tiered Sampling, a novel technique for approximate counting sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size $M$, which can be magnitudes smaller than the number of edges. Our methods addresses the challenging task of counting sparse motifs - sub-graph patterns that have low pr… ▽ More We introduce Tiered Sampling, a novel technique for approximate counting sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size $M$, which can be magnitudes smaller than the number of edges. Our methods addresses the challenging task of counting sparse motifs - sub-graph patterns that have low probability to appear in a sample of $M$ edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count we partition the available memory to tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. We demonstrate the advantage of our method in the specific applications of counting sparse 4 and 5-cliques in massive graphs. We present a complete analytical analysis and extensive experimental results using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs. △ Less

Submitted 5 October, 2017; originally announced October 2017.

Comments: 28 pages

MSC Class: 68W20 ACM Class: G.1.2; G.2.2

arXiv:1706.05367 [pdf, ps, other]

Practical and Provably Secure Onion Routing

Authors: Megumi Ando, Anna Lysyanskaya, Eli Upfal

Abstract: In an onion routing protocol, messages travel through several intermediaries before arriving at their destinations, they are wrapped in layers of encryption (hence they are called "onions"). The goal is to make it hard to establish who sent the message. It is a practical and widespread tool for creating anonymous channels. For the standard adversary models -- network, passive, and active -- we p… ▽ More In an onion routing protocol, messages travel through several intermediaries before arriving at their destinations, they are wrapped in layers of encryption (hence they are called "onions"). The goal is to make it hard to establish who sent the message. It is a practical and widespread tool for creating anonymous channels. For the standard adversary models -- network, passive, and active -- we present practical and provably secure onion routing protocols. Akin to Tor, in our protocols each party independently chooses the routing paths for his onions. For security parameter $λ$, our differentially private solution for the active adversary takes $O(\log^2λ)$ rounds and requires every participant to transmit $O(\log^{4} λ)$ onions in every round. △ Less

Submitted 29 July, 2021; v1 submitted 16 June, 2017; originally announced June 2017.

arXiv:1612.01040 [pdf, other]

Controlling False Discoveries During Interactive Data Exploration

Authors: Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska

Abstract: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypot… ▽ More Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypothesis testing control into interactive data exploration tools. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable for interactive data exploration. We therefore discuss a set of new control procedures that are better suited and integrated them in our system called Aware. By means of extensive experiments using both real-world and synthetic data sets we demonstrate how Aware can help experts and novice users alike to efficiently control false discoveries. △ Less

Submitted 3 December, 2016; originally announced December 2016.

arXiv:1609.00790 [pdf, other]

Scalable Betweenness Centrality Maximization via Sampling

Authors: Ahmad Mahmoody, Charalampos E. Tsourakakis, Eli Upfal

Abstract: Betweenness centrality is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of key importance to numerous important applications that rely on betweenness centrality, including community detection and understanding graph vulnerability. Despite the large amount of work on designing scalable approximation… ▽ More Betweenness centrality is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of key importance to numerous important applications that rely on betweenness centrality, including community detection and understanding graph vulnerability. Despite the large amount of work on designing scalable approximation algorithms for betweenness centrality, estimating it on large-scale networks remains a computational challenge. In this paper, we study the Betweenness Centrality Maximization problem: given a graph $G=(V,E)$ and a positive integer $k$, find a set $S^* \subseteq V$ that maximizes betweenness centrality subject to the cardinality constraint $|S^*| \leq k$. We present an efficient randomized algorithm that provides a $(1-1/e-ε)$-approximation with high probability, where $ε>0$. Our results improve the current state-of-the-art result by Yoshida~\cite{yoshida2014almost}. Furthermore, we provide theoretical evidence for the validity of a crucial assumption in the literature of betweenness centrality estimation, namely that in real-world networks $O(|V|^2)$ shortest paths pass through the top-$k$ central nodes, where $k$ is a constant. On the experimental side, we perform an extensive experimental analysis of our method on real-world networks, demonstrate its accuracy and scalability, and study different properties of central nodes. Finally, we provide three graph mining applications of our method. △ Less

Submitted 3 September, 2016; originally announced September 2016.

Comments: Accepted in KDD 2016

arXiv:1605.05590 [pdf, other]

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in… ▽ More Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an $(α+ε)$-approximation ratio, for any constant $ε>0$, where $α$ is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points. △ Less

Submitted 23 January, 2017; v1 submitted 18 May, 2016; originally announced May 2016.

Comments: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 2017

arXiv:1602.08298 [pdf, ps, other]

doi 10.1137/1.9781611974331.ch48

Balanced Allocation: Patience is not a Virtue

Authors: John Augustine, William K. Moses Jr., Amanda Redlich, Eli Upfal

Abstract: Load balancing is a well-studied problem, with balls-in-bins being the primary framework. The greedy algorithm $\mathsf{Greedy}[d]$ of Azar et al. places each ball by probing $d > 1$ random bins and placing the ball in the least loaded of them. With high probability, the maximum load under $\mathsf{Greedy}[d]$ is exponentially lower than the result when balls are placed uniformly randomly. Vöcking… ▽ More Load balancing is a well-studied problem, with balls-in-bins being the primary framework. The greedy algorithm $\mathsf{Greedy}[d]$ of Azar et al. places each ball by probing $d > 1$ random bins and placing the ball in the least loaded of them. With high probability, the maximum load under $\mathsf{Greedy}[d]$ is exponentially lower than the result when balls are placed uniformly randomly. Vöcking showed that a slightly asymmetric variant, $\mathsf{Left}[d]$, provides a further significant improvement. However, this improvement comes at an additional computational cost of imposing structure on the bins. Here, we present a fully decentralized and easy-to-implement algorithm called $\mathsf{FirstDiff}[d]$ that combines the simplicity of $\mathsf{Greedy}[d]$ and the improved balance of $\mathsf{Left}[d]$. The key idea in $\mathsf{FirstDiff}[d]$ is to probe until a different bin size from the first observation is located, then place the ball. Although the number of probes could be quite large for some of the balls, we show that $\mathsf{FirstDiff}[d]$ requires only at most $d$ probes on average per ball (in both the standard and the heavily-loaded settings). Thus the number of probes is no greater than either that of $\mathsf{Greedy}[d]$ or $\mathsf{Left}[d]$. More importantly, we show that $\mathsf{FirstDiff}[d]$ closely matches the improved maximum load ensured by $\mathsf{Left}[d]$ in both the standard and heavily-loaded settings. We further provide a tight lower bound on the maximum load up to $O(\log \log \log n)$ terms. We additionally give experimental data that $\mathsf{FirstDiff}[d]$ is indeed as good as $\mathsf{Left}[d]$, if not better, in practice. △ Less

Submitted 22 January, 2018; v1 submitted 26 February, 2016; originally announced February 2016.

Comments: 26 pages, preliminary version accepted at SODA 2016

ACM Class: F.2.2; G.2.0; G.3

Journal ref: In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2016), 655-671

arXiv:1602.07424 [pdf, other]

TRIÈST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size

Authors: Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, Eli Upfal

Abstract: We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all… ▽ More We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches which use hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they will use. We show a full analysis of the variance of the estimations and novel concentration bounds for these quantities. Our experimental results on very large graphs show that TRIÈST outperforms state-of-the-art approaches in accuracy and exhibits a small update time. △ Less

Submitted 28 June, 2016; v1 submitted 24 February, 2016; originally announced February 2016.

Comments: 49 pages, 7 figures, extended version of the paper appeared at ACM KDD'16

ACM Class: G.2.2; H.2.8

arXiv:1602.05866 [pdf, other]

ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages

Authors: Matteo Riondato, Eli Upfal

Abstract: We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowled… ▽ More We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. The results of our experimental evaluation show that our approach is much faster than exact methods, and vastly outperforms, in both speed and number of samples, current state-of-the-art algorithms with the same quality guarantees. △ Less

Submitted 18 February, 2016; originally announced February 2016.

ACM Class: G.2.2; H.2.8

arXiv:1509.02487 [pdf, ps, other]

Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection

Authors: Ahmad Mahmoody, Evgenios M. Kornaropoulos, Eli Upfal

Abstract: We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines. We consider a large system consists of many nodes, where each node has its own rate of generating n… ▽ More We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines. We consider a large system consists of many nodes, where each node has its own rate of generating new events, or items. A monitoring application can probe a small number of nodes at each step, and our goal is to compute a probing schedule that minimizes the expected number of undiscovered items at the system, or equivalently, minimizes the expected time to discover a new item in the system. We study the Schedule Optimization problem both for deterministic and randomized memoryless algorithms. We provide lower bounds on the cost of an optimal schedule and construct close to optimal schedules with rigorous mathematical guarantees. Finally, we present an adaptive algorithm that starts with no prior information on the system and converges to the optimal memoryless algorithms by adapting to observed data. △ Less

Submitted 9 September, 2015; v1 submitted 8 September, 2015; originally announced September 2015.

arXiv:1506.03265 [pdf, other]

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our algorithm uses linear space and yields a polylogarith… ▽ More We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a weighted graph decomposition strategy generating disjoint clusters of bounded weighted radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; moreover, for important practical classes of graphs, it runs in a number of rounds asymptotically smaller than those required by the natural approximation provided by the state-of-the-art $Δ$-step** SSSP algorithm, which is its only practical linear-space competitor in the aforementioned computational scenario. We complement our theoretical findings with an extensive experimental analysis on large benchmark graphs, which demonstrates that our algorithm attains substantial improvements on a number of key performance indicators with respect to the aforementioned competitor, while featuring a similar approximation ratio (a small constant less than 1.4, as opposed to the polylogarithmic theoretical bound). △ Less

Submitted 9 November, 2015; v1 submitted 10 June, 2015; originally announced June 2015.

arXiv:1504.03275 [pdf, other]

Wiggins: Detecting Valuable Information in Dynamic Networks Using Limited Resources

Authors: Ahmad Mahmoody, Matteo Riondato, Eli Upfal

Abstract: Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes a… ▽ More Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes at all times. In this work we study the constrained setting when the observer can only probe a small set of nodes at each time step to check whether new pieces of information (items) have reached those nodes. We formally define the problem through an infinite time generating process that places new items in subsets of nodes according to an unknown probability distribution. Items have an exponentially decaying novelty, modeling their decreasing value. The observer uses a probing schedule (i.e., a probability distribution over the set of nodes) to choose, at each time step, a small set of nodes to check for new items. The goal is to compute a schedule that minimizes the average novelty of undetected items. We present an algorithm, WIGGINS, to compute the optimal schedule through convex optimization, and then show how it can be adapted when the parameters of the problem must be learned or change over time. We also present a scalable variant of WIGGINS for the MapReduce framework. The results of our experimental evaluation on real social networks demonstrate the practicality of our approach. △ Less

Submitted 29 July, 2015; v1 submitted 13 April, 2015; originally announced April 2015.

arXiv:1407.3144 [pdf, other]

Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

Authors: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our pa… ▽ More We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) $k$-center clustering approximation; and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches. △ Less

Submitted 6 February, 2015; v1 submitted 11 July, 2014; originally announced July 2014.

Comments: 14 pages

arXiv:1402.5524 [pdf, other]

The Melbourne Shuffle: Improving Oblivious Storage in the Cloud

Authors: Olga Ohrimenko, Michael T. Goodrich, Roberto Tamassia, Eli Upfal

Abstract: We present a simple, efficient, and secure data-oblivious randomized shuffle algorithm. This is the first secure data-oblivious shuffle that is not based on sorting. Our method can be used to improve previous oblivious storage solutions for network-based outsourcing of data. We present a simple, efficient, and secure data-oblivious randomized shuffle algorithm. This is the first secure data-oblivious shuffle that is not based on sorting. Our method can be used to improve previous oblivious storage solutions for network-based outsourcing of data. △ Less

Submitted 22 February, 2014; originally announced February 2014.

arXiv:1312.1277 [pdf, ps, other]

Bandits and Experts in Metric Spaces

Authors: Robert Kleinberg, Aleksandrs Slivkins, Eli Upfal

Abstract: In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications su… ▽ More In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions. In this work we study a very general setting for the multi-armed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the "Lipschitz MAB problem". We present a solution for the multi-armed bandit problem in this setting. That is, for every metric space we define an isometry invariant which bounds from below the performance of Lipschitz MAB algorithms for this metric space, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions. We also address the full-feedback ("best expert") version of the problem, where after every round the payoffs from all arms are revealed. △ Less

Submitted 15 April, 2019; v1 submitted 4 December, 2013; originally announced December 2013.

Comments: This manuscript is a merged and definitive version of (R. Kleinberg, Slivkins, Upfal: STOC 2008) and (R. Kleinberg, Slivkins: SODA 2010), with a significantly revised presentation

arXiv:1309.4286 [pdf, other]

Accurate Computation of Survival Statistics in Genome-wide Studies

Authors: Fabio Vandin, Alexandra Papoutsaki, Benjamin J. Raphael, Eli Upfal

Abstract: A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a ge… ▽ More A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations. △ Less

Submitted 17 September, 2013; originally announced September 2013.

Comments: Full version of RECOMB 2013 paper

arXiv:1305.1121 [pdf, ps, other]

Storage and Search in Dynamic Peer-to-Peer Networks

Authors: John Augustine, Anisur Rahaman Molla, Ehab Morsy, Gopal Pandurangan, Peter Robinson, Eli Upfal

Abstract: We study robust and efficient distributed algorithms for searching, storing, and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to guarantee, despite high node churn rate, that a large number of nodes in the network can store, retrieve, an… ▽ More We study robust and efficient distributed algorithms for searching, storing, and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to guarantee, despite high node churn rate, that a large number of nodes in the network can store, retrieve, and maintain a large number of data items. Our main contributions are fast randomized distributed algorithms that guarantee the above with high probability (whp) even under high adversarial churn: 1. A randomized distributed search algorithm that (whp) guarantees that searches from as many as $n - o(n)$ nodes ($n$ is the stable network size) succeed in ${O}(\log n)$-rounds despite ${O}(n/\log^{1+δ} n)$ churn, for any small constant $δ> 0$, per round. We assume that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time, but is oblivious to the random choices made by the algorithm). 2. A storage and maintenance algorithm that guarantees (whp) data items can be efficiently stored (with only $Θ(\log{n})$ copies of each data item) and maintained in a dynamic P2P network with churn rate up to ${O}(n/\log^{1+δ} n)$ per round. Our search algorithm together with our storage and maintenance algorithm guarantees that as many as $n - o(n)$ nodes can efficiently store, maintain, and search even under ${O}(n/\log^{1+δ} n)$ churn per round. Our algorithms require only polylogarithmic in $n$ bits to be processed and sent (per round) by each node. To the best of our knowledge, our algorithms are the first-known, fully-distributed storage and search algorithms that provably work under highly dynamic settings (i.e., high churn rates per step). △ Less

Submitted 6 May, 2013; originally announced May 2013.

Comments: to appear at SPAA 2013

ACM Class: F.2.2

arXiv:1301.2277 [pdf]

A Clustering Approach to Solving Large Stochastic Matching Problems

Authors: Milos Hauskrecht, Eli Upfal

Abstract: In this work we focus on efficient heuristics for solving a class of stochastic planning problems that arise in a variety of business, investment, and industrial applications. The problem is best described in terms of future buy and sell contracts. By buying less reliable, but less expensive, buy (supply) contracts, a company or a trader can cover a position of more reliable and more expensive sel… ▽ More In this work we focus on efficient heuristics for solving a class of stochastic planning problems that arise in a variety of business, investment, and industrial applications. The problem is best described in terms of future buy and sell contracts. By buying less reliable, but less expensive, buy (supply) contracts, a company or a trader can cover a position of more reliable and more expensive sell contracts. The goal is to maximize the expected net gain (profit) by constructing a dose to optimum portfolio out of the available buy and sell contracts. This stochastic planning problem can be formulated as a two-stage stochastic linear programming problem with recourse. However, this formalization leads to solutions that are exponential in the number of possible failure combinations. Thus, this approach is not feasible for large scale problems. In this work we investigate heuristic approximation techniques alleviating the efficiency problem. We primarily focus on the clustering approach and devise heuristics for finding clusterings leading to good approximations. We illustrate the quality and feasibility of the approach through experimental data. △ Less

Submitted 10 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

Report number: UAI-P-2001-PG-219-226

arXiv:1208.3071 [pdf, other]

doi 10.1016/j.tcs.2014.04.003

Fast Distributed PageRank Computation

Authors: Atish Das Sarma, Anisur Rahaman Molla, Gopal Pandurangan, Eli Upfal

Abstract: Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google's search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging… ▽ More Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google's search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging from determining important nodes, load balancing, search, and identifying connectivity structures. Surprisingly, however, there has been little work towards designing provably efficient fully-distributed algorithms for computing PageRank. The difficulty is that traditional matrix-vector multiplication style iterative methods may not always adapt well to the distributed setting owing to communication bandwidth restrictions and convergence rates. In this paper, we present fast random walk-based distributed algorithms for computing PageRanks in general graphs and prove strong bounds on the round complexity. We first present a distributed algorithm that takes $O\big(\log n/\eps \big)$ rounds with high probability on any graph (directed or undirected), where $n$ is the network size and $\eps$ is the reset probability used in the PageRank computation (typically $\eps$ is a fixed constant). We then present a faster algorithm that takes $O\big(\sqrt{\log n}/\eps \big)$ rounds in undirected graphs. Both of the above algorithms are scalable, as each node sends only small ($\polylog n$) number of bits over each edge per round. To the best of our knowledge, these are the first fully distributed algorithms for computing PageRank vector with provably efficient running time. △ Less

Submitted 25 November, 2015; v1 submitted 15 August, 2012; originally announced August 2012.

Comments: 14 pages

Journal ref: Theoretical Computer Science, Volume 561, Pages 113-121, 2015

arXiv:1111.6937 [pdf, other]

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Authors: Matteo Riondato, Eli Upfal

Abstract: The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practic… ▽ More The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-$K$) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer $d$ such that the dataset contains at least $d$ transactions of length at least $d$ such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets. △ Less

Submitted 22 February, 2013; v1 submitted 29 November, 2011; originally announced November 2011.

Comments: 19 pages, 7 figures. A shorter version of this paper appeared in the proceedings of ECML PKDD 2012

ACM Class: H.2.8

arXiv:1111.2228 [pdf, ps, other]

doi 10.1145/2304576.2304607

Space-Round Tradeoffs for MapReduce Computations

Authors: Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, Eli Upfal

Abstract: This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and… ▽ More This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and local memory constraints, thus favoring a more data-centric view. Second, we apply the model to the fundamental computation task of matrix multiplication presenting upper and lower bounds for both dense and sparse matrix multiplication, which highlight interesting tradeoffs between space and round complexity. Finally, building on the matrix multiplication results, we derive further space-round tradeoffs on matrix inversion and matching. △ Less

Submitted 9 November, 2011; originally announced November 2011.

Journal ref: Final version in Proc. of the 26th ACM international conference on Supercomputing, pages 235-244, 2012

arXiv:1108.0809 [pdf, ps, other]

Distributed Agreement in Dynamic Peer-to-Peer Networks

Authors: John Augustine, Gopal Pandurangan, Peter Robinson, Eli Upfal

Abstract: Motivated by the need for robust and fast distributed computation in highly dynamic Peer-to-Peer (P2P) networks, we study algorithms for the fundamental distributed agreement problem. P2P networks are highly dynamic networks that experience heavy node {\em churn}. Our goal is to design fast algorithms (running in a small number of rounds) that guarantee, despite high node churn rate, that almost a… ▽ More Motivated by the need for robust and fast distributed computation in highly dynamic Peer-to-Peer (P2P) networks, we study algorithms for the fundamental distributed agreement problem. P2P networks are highly dynamic networks that experience heavy node {\em churn}. Our goal is to design fast algorithms (running in a small number of rounds) that guarantee, despite high node churn rate, that almost all nodes reach a stable agreement. Our main contributions are randomized distributed algorithms that guarantee {\em stable almost-everywhere agreement} with high probability even under high adversarial churn in a polylogarithmic number of rounds: 1. An $O(\log^2 n)$-round ($n$ is the stable network size) randomized algorithm that achieves almost-everywhere agreement with high probability under up to {\em linear} churn {\em per round} (i.e., $εn$, for some small constant $ε> 0$), assuming that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time and has unlimited computational power, but is oblivious to the random choices made by the algorithm). Our algorithm requires only polylogarithmic in $n$ bits to be processed and sent (per round) by each node. 2. An $O(\log m\log^3 n)$-round randomized algorithm that achieves almost-everywhere agreement with high probability under up to $ε\sqrt{n}$ churn per round (for some small $ε> 0$), where $m$ is the size of the input value domain, that works even under an adaptive adversary (that also knows the past random choices made by the algorithm). This algorithm requires up to polynomial in $n$ bits (and up to $O(\log m)$ bits) to be processed and sent (per round) by each node. △ Less

Submitted 10 September, 2014; v1 submitted 3 August, 2011; originally announced August 2011.

Comments: to appear at the Journal of Computer and System Sciences; preliminary version appeared at SODA 2012

arXiv:1101.5805 [pdf, ps, other]

The VC-Dimension of Queries and Selectivity Estimation Through Sampling

Authors: Matteo Riondato, Mert Akdere, Ugur Cetintemel, Stanley B. Zdonik, Eli Upfal

Abstract: We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space… ▽ More We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. We leverage on this result and develop a method that, given a class of queries, builds a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server. △ Less

Submitted 11 August, 2011; v1 submitted 30 January, 2011; originally announced January 2011.

Comments: 20 pages, 3 figures

ACM Class: H.2.4; G.3

arXiv:1101.4609 [pdf, ps, other]

Tight Bounds on Information Dissemination in Sparse Mobile Networks

Authors: Alberto Pettarin, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider $k$ mobile agents performing independent random walks on an $n$-node grid. At time $0$, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communicati… ▽ More Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider $k$ mobile agents performing independent random walks on an $n$-node grid. At time $0$, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communication graph process ${G_t(r) | t \geq 0}$, where two agents are connected by an edge in $G_t(r)$ iff their distance at time $t$ is within their transmission radius $r$. Modeling the physical reality that the speed of radio transmission is much faster than the motion of the agents, we assume that the rumor can travel throughout a connected component of $G_t$ before the graph is altered by the motion. We study the broadcast time $T_B$ of the system, which is the time it takes for all agents to know the rumor. We focus on the sparse case (below the percolation point $r_c \approx \sqrt{n/k}$) where, with high probability, no connected component in $G_t$ has more than a logarithmic number of agents and the broadcast time is dominated by the time it takes for many independent random walks to meet each other. Quite surprisingly, we show that for a system below the percolation point the broadcast time does not depend on the relation between the mobility speed and the transmission radius. In fact, we prove that $T_B = \tilde{O}(n / \sqrt{k})$ for any $0 \leq r < r_c$, even when the transmission range is significantly larger than the mobility range in one step, giving a tight characterization up to logarithmic factors. Our result complements a recent result of Peres et al. (SODA 2011) who showed that above the percolation point the broadcast time is polylogarithmic in $k$. △ Less

Submitted 1 February, 2011; v1 submitted 24 January, 2011; originally announced January 2011.

Comments: 19 pages; we rewrote Lemma 4, fixing a claim which was not fully justified in the first version of the draft

arXiv:1007.1604 [pdf, other]

Infectious Random Walks

Authors: Alberto Pettarin, Andrea Pietracaprina, Geppino Pucci, Eli Upfal

Abstract: We study the dynamics of information (or virus) dissemination by $m$ mobile agents performing independent random walks on an $n$-node grid. We formulate our results in terms of two scenarios: broadcasting and gossi**. In the broadcasting scenario, the mobile agents are initially placed uniformly at random among the grid nodes. At time 0, one agent is informed of a rumor and starts a random walk.… ▽ More We study the dynamics of information (or virus) dissemination by $m$ mobile agents performing independent random walks on an $n$-node grid. We formulate our results in terms of two scenarios: broadcasting and gossi**. In the broadcasting scenario, the mobile agents are initially placed uniformly at random among the grid nodes. At time 0, one agent is informed of a rumor and starts a random walk. When an informed agent meets an uninformed agent, the latter becomes informed and starts a new random walk. We study the broadcasting time of the system, that is, the time it takes for all agents to know the rumor. In the gossi** scenario, each agent is given a distinct rumor at time 0 and all agents start random walks. When two agents meet, they share all rumors they are aware of. We study the gossi** time of the system, that is, the time it takes for all agents to know all rumors. We prove that both the broadcasting and the gossi** times are $\tildeΘ(n/\sqrt{m})$ w.h.p., thus achieving a tight characterization up to logarithmic factors. Previous results for the grid provided bounds which were weaker and only concerned average times. In the context of virus infection, a corollary of our results is that static and dynamically moving agents are infected at about the same speed. △ Less

Submitted 25 January, 2011; v1 submitted 9 July, 2010; originally announced July 2010.

Comments: 21 pages, 3 figures --- The results presented in this paper have been extended in: Pettarin et al., Tight Bounds on Information Dissemination in Sparse Mobile Networks, http://arxiv.longhoe.net/abs/1101.4609

arXiv:1006.5235 [pdf, ps, other]

doi 10.1007/s10618-010-0185-7

Mining Top-K Frequent Itemsets Through Progressive Sampling

Authors: Andrea Pietracaprina, Matteo Riondato, Eli Upfal, Fabio Vandin

Abstract: We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an uppe… ▽ More We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stop** conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stop** conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved. △ Less

Submitted 27 June, 2010; originally announced June 2010.

Comments: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and publication in the ECML PKDD 2010 special issue of the Data Mining and Knowledge Discovery journal

arXiv:1002.1104 [pdf, ps, other]

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Authors: Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, Fabio Vandin

Abstract: As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support thres… ▽ More As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology. △ Less

Submitted 4 February, 2010; originally announced February 2010.

Comments: A preliminary version of this work was presented in ACM PODS 2009. 20 pages, 0 figures

ACM Class: H.2.8

arXiv:1002.0874 [pdf, ps, other]

MADMX: A Novel Strategy for Maximal Dense Motif Extraction

Authors: Roberto Grossi, Andrea Pietracaprina, Nadia Pisanti, Geppino Pucci, Eli Upfal, Fabio Vandin

Abstract: We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only… ▽ More We develop, analyze and experiment with a new tool, called MADMX, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX △ Less

Submitted 3 February, 2010; originally announced February 2010.

Comments: A preliminary version of this work was presented in WABI 2009. 10 pages, 0 figures

arXiv:0809.4882 [pdf, ps, other]

Multi-Armed Bandits in Metric Spaces

Authors: Robert Kleinberg, Aleksandrs Slivkins, Eli Upfal

Abstract: In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications… ▽ More In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions. In this work we study a very general setting for the multi-armed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the "Lipschitz MAB problem". We present a complete solution for the multi-armed problem in this setting. That is, for every metric space (L,X) we define an isometry invariant which bounds from below the performance of Lipschitz MAB algorithms for X, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions. △ Less

Submitted 28 September, 2008; originally announced September 2008.

Comments: 16 pages, 0 figures

arXiv:math/0209357 [pdf, ps, other]

Steady state analysis of balanced-allocation routing

Authors: Aris Anagnostopoulos, Ioannis Kontoyiannis, Eli Upfal

Abstract: We compare the long-term, steady-state performance of a variant of the standard Dynamic Alternative Routing (DAR) technique commonly used in telephone and ATM networks, to the performance of a path-selection algorithm based on the "balanced-allocation" principle; we refer to this new algorithm as the Balanced Dynamic Alternative Routing (BDAR) algorithm. While DAR checks alternative routes seque… ▽ More We compare the long-term, steady-state performance of a variant of the standard Dynamic Alternative Routing (DAR) technique commonly used in telephone and ATM networks, to the performance of a path-selection algorithm based on the "balanced-allocation" principle; we refer to this new algorithm as the Balanced Dynamic Alternative Routing (BDAR) algorithm. While DAR checks alternative routes sequentially until available bandwidth is found, the BDAR algorithm compares and chooses the best among a small number of alternatives. We show that, at the expense of a minor increase in routing overhead, the BDAR algorithm gives a substantial improvement in network performance, in terms both of network congestion and of bandwidth requirement. △ Less

Submitted 25 September, 2002; originally announced September 2002.

Comments: 22 pages, 1 figure

Showing 1–45 of 45 results for author: Upfal, E