Search | arXiv e-print repository

arXiv:2405.19977 [pdf, other]

Consistent Submodular Maximization

Authors: Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Abstract: Maximizing monotone submodular functions under cardinality constraints is a classic optimization task with several applications in data mining and machine learning. In this paper we study this problem in a dynamic environment with consistency constraints: elements arrive in a streaming fashion and the goal is maintaining a constant approximation to the optimal solution while having a stable soluti… ▽ More Maximizing monotone submodular functions under cardinality constraints is a classic optimization task with several applications in data mining and machine learning. In this paper we study this problem in a dynamic environment with consistency constraints: elements arrive in a streaming fashion and the goal is maintaining a constant approximation to the optimal solution while having a stable solution (i.e., the number of changes between two consecutive solutions is bounded). We provide algorithms in this setting with different trade-offs between consistency and approximation quality. We also complement our theoretical results with an experimental analysis showing the effectiveness of our algorithms in real-world instances. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: To appear at ICML 24

arXiv:2312.14299 [pdf, ps, other]

Fairness in Submodular Maximization over a Matroid Constraint

Authors: Marwa El Halabi, Jakub Tarnawski, Ashkan Norouzi-Fard, Thuy-Duong Vuong

Abstract: Submodular maximization over a matroid constraint is a fundamental problem with various applications in machine learning. Some of these applications involve decision-making over datapoints with sensitive attributes such as gender or race. In such settings, it is crucial to guarantee that the selected solution is fairly distributed with respect to this attribute. Recently, fairness has been investi… ▽ More Submodular maximization over a matroid constraint is a fundamental problem with various applications in machine learning. Some of these applications involve decision-making over datapoints with sensitive attributes such as gender or race. In such settings, it is crucial to guarantee that the selected solution is fairly distributed with respect to this attribute. Recently, fairness has been investigated in submodular maximization under a cardinality constraint in both the streaming and offline settings, however the more general problem with matroid constraint has only been considered in the streaming setting and only for monotone objectives. This work fills this gap. We propose various algorithms and impossibility results offering different trade-offs between quality, fairness, and generality. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2305.19918 [pdf, ps, other]

Fully Dynamic Submodular Maximization over Matroids

Authors: Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Abstract: Maximizing monotone submodular functions under a matroid constraint is a classic algorithmic problem with multiple applications in data mining and machine learning. We study this classic problem in the fully dynamic setting, where elements can be both inserted and deleted in real-time. Our main result is a randomized algorithm that maintains an efficient data structure with an $\tilde{O}(k^2)$ amo… ▽ More Maximizing monotone submodular functions under a matroid constraint is a classic algorithmic problem with multiple applications in data mining and machine learning. We study this classic problem in the fully dynamic setting, where elements can be both inserted and deleted in real-time. Our main result is a randomized algorithm that maintains an efficient data structure with an $\tilde{O}(k^2)$ amortized update time (in the number of additions and deletions) and yields a $4$-approximate solution, where $k$ is the rank of the matroid. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: Accepted at ICML 2023

arXiv:2305.15118 [pdf, other]

Fairness in Streaming Submodular Maximization over a Matroid Constraint

Authors: Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

Abstract: Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in develo** fair machine learning algorithms. Recently, such algorithms have been develope… ▽ More Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in develo** fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks. △ Less

Submitted 19 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted to ICML 23

arXiv:2208.07582 [pdf, ps, other]

Deletion Robust Non-Monotone Submodular Maximization over Matroids

Authors: Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Abstract: Maximizing a submodular function is a fundamental task in machine learning and in this paper we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose spa… ▽ More Maximizing a submodular function is a fundamental task in machine learning and in this paper we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid and the number $d$ of deleted elements. In the centralized setting we present a $(4.597+O(\varepsilon))$-approximation algorithm with summary size $O( \frac{k+d}{\varepsilon^2}\log \frac{k}{\varepsilon})$ that is improved to a $(3.582+O(\varepsilon))$-approximation with $O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$ summary size when the objective is monotone. In the streaming setting we provide a $(9.435 + O(\varepsilon))$-approximation algorithm with summary size and memory $O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$; the approximation factor is then improved to $(5.582+O(\varepsilon))$ in the monotone case. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: Preliminary versions of this work appeared as arXiv:2201.13128 and in ICML'22. The main difference with respect to these versions consists in extending our results to non-monotone submodular functions

arXiv:2204.05154 [pdf, ps, other]

Submodular Maximization Subject to Matroid Intersection on the Fly

Authors: Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, Rico Zenklusen

Abstract: Despite a surge of interest in submodular maximization in the data stream model, there remain significant gaps in our knowledge about what can be achieved in this setting, especially when dealing with multiple constraints. In this work, we nearly close several basic gaps in submodular maximization subject to $k$ matroid constraints in the data stream model. We present a new hardness result showing… ▽ More Despite a surge of interest in submodular maximization in the data stream model, there remain significant gaps in our knowledge about what can be achieved in this setting, especially when dealing with multiple constraints. In this work, we nearly close several basic gaps in submodular maximization subject to $k$ matroid constraints in the data stream model. We present a new hardness result showing that super polynomial memory in $k$ is needed to obtain an $o(k / \log k)$-approximation. This implies near optimality of prior algorithms. For the same setting, we show that one can nevertheless obtain a constant-factor approximation by maintaining a set of elements whose size is independent of the stream size. Finally, for bipartite matching constraints, a well-known special case of matroid intersection, we present a new technique to obtain hardness bounds that are significantly stronger than those obtained with prior approaches. Prior results left it open whether a $2$-approximation may exist in this setting, and only a complexity-theoretic hardness of $1.91$ was known. We prove an unconditional hardness of $2.69$. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: 41 pages, 1 figure. arXiv admin note: text overlap with arXiv:2107.07183

MSC Class: 68R05 (Primary) 68W27; 68Q25; 90C27 (Secondary) ACM Class: F.2.2; G.2.1

arXiv:2203.01440 [pdf, ps, other]

Near-Optimal Correlation Clustering with Privacy

Authors: Vincent Cohen-Addad, Chenglin Fan, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Abstract: Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labelling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes'… ▽ More Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labelling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our approximation guarantees are stronger than those shown in prior work and are optimal up to logarithmic factors. △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2201.13128 [pdf, other]

Deletion Robust Submodular Maximization over Matroids

Authors: Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Abstract: Maximizing a monotone submodular function is a fundamental task in machine learning. In this paper, we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, wh… ▽ More Maximizing a monotone submodular function is a fundamental task in machine learning. In this paper, we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid and the number $d$ of deleted elements. In the centralized setting we present a $(3.582+O(\varepsilon))$-approximation algorithm with summary size $O(k + \frac{d \log k}{\varepsilon^2})$. In the streaming setting we provide a $(5.582+O(\varepsilon))$-approximation algorithm with summary size and memory $O(k + \frac{d \log k}{\varepsilon^2})$. We complement our theoretical results with an in-depth experimental analysis showing the effectiveness of our algorithms on real-world datasets. △ Less

Submitted 31 January, 2022; originally announced January 2022.

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:5671-5693, 2022

arXiv:2107.07183 [pdf, other]

Streaming Submodular Maximization under Matroid Constraints

Authors: Moran Feldman, Paul Liu, Ashkan Norouzi-Fard, Ola Svensson, Rico Zenklusen

Abstract: Recent progress in (semi-)streaming algorithms for monotone submodular function maximization has led to tight results for a simple cardinality constraint. However, current techniques fail to give a similar understanding for natural generalizations, including matroid constraints. This paper aims at closing this gap. For a single matroid of rank $k$ (i.e., any solution has cardinality at most $k$),… ▽ More Recent progress in (semi-)streaming algorithms for monotone submodular function maximization has led to tight results for a simple cardinality constraint. However, current techniques fail to give a similar understanding for natural generalizations, including matroid constraints. This paper aims at closing this gap. For a single matroid of rank $k$ (i.e., any solution has cardinality at most $k$), our main results are: 1) a single-pass streaming algorithm that uses $\widetilde{O}(k)$ memory and achieves an approximation guarantee of $0.3178$, and 2) a multi-pass streaming algorithm that uses $\widetilde{O}(k)$ memory and achieves an approximation guarantee of $(1-1/e - \varepsilon)$ by taking a constant (depending on $\varepsilon$) number of passes over the stream. This improves on the previously best approximation guarantees of $1/4$ and $1/2$ for single-pass and multi-pass streaming algorithms, respectively. In fact, our multi-pass streaming algorithm is tight in that any algorithm with a better guarantee than $1/2$ must make several passes through the stream and any algorithm that beats our guarantee of $1-1/e$ must make linearly many passes (as well as an exponential number of value oracle queries). Moreover, we show how the approach we use for multi-pass streaming can be further strengthened if the elements of the stream arrive in uniformly random order, implying an improved result for $p$-matchoid constraints. △ Less

Submitted 16 February, 2022; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: 44 pages

MSC Class: 68W27 (Primary) 68R05; 68Q11 (Secondary) ACM Class: F.2.2; G.2.1

arXiv:2106.08448 [pdf, other]

Correlation Clustering in Constant Many Parallel Rounds

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Abstract: Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular,… ▽ More Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem on graphs using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental analysis of our techniques. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: ICML 2021 (long talk)

arXiv:2106.04805 [pdf, other]

Streaming Belief Propagation for Community Detection

Authors: Yuchen Wu, MohammadHossein Bateni, Andre Linhares, Filipe Miguel Goncalves de Almeida, Andrea Montanari, Ashkan Norouzi-Fard, Jakab Tardos

Abstract: The community detection problem requires to cluster the nodes of a network into a small number of well-connected "communities". There has been substantial recent progress in characterizing the fundamental statistical limits of community detection under simple stochastic block models. However, in real-world applications, the network structure is typically dynamic, with nodes that join over time. In… ▽ More The community detection problem requires to cluster the nodes of a network into a small number of well-connected "communities". There has been substantial recent progress in characterizing the fundamental statistical limits of community detection under simple stochastic block models. However, in real-world applications, the network structure is typically dynamic, with nodes that join over time. In this setting, we would like a detection algorithm to perform only a limited number of updates at each node arrival. While standard voting approaches satisfy this constraint, it is unclear whether they exploit the network information optimally. We introduce a simple model for networks growing over time which we refer to as streaming stochastic block model (StSBM). Within this model, we prove that voting algorithms have fundamental limitations. We also develop a streaming belief-propagation (StreamBP) approach, for which we prove optimality in certain regimes. We validate our theoretical findings on synthetic and real data. △ Less

Submitted 10 June, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: 36 pages, 13 figures

arXiv:2012.11891 [pdf, ps, other]

Fast and Accurate $k$-means++ via Rejection Sampling

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

Abstract: $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k… ▽ More $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2011.06888 [pdf, other]

Consistent k-Clustering for General Metrics

Authors: Hendrik Fichtenberger, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson

Abstract: Given a stream of points in a metric space, is it possible to maintain a constant approximate clustering by changing the cluster centers only a small number of times during the entire execution of the algorithm? This question received attention in recent years in the machine learning literature and, before our work, the best known algorithm performs $\widetilde{O}(k^2)$ center swaps (the… ▽ More Given a stream of points in a metric space, is it possible to maintain a constant approximate clustering by changing the cluster centers only a small number of times during the entire execution of the algorithm? This question received attention in recent years in the machine learning literature and, before our work, the best known algorithm performs $\widetilde{O}(k^2)$ center swaps (the $\widetilde{O}(\cdot)$ notation hides polylogarithmic factors in the number of points $n$ and the aspect ratio $Δ$ of the input instance). This is a quadratic increase compared to the offline case -- the whole stream is known in advance and one is interested in kee** a constant approximation at any point in time -- for which $\widetilde{O}(k)$ swaps are known to be sufficient and simple examples show that $Ω(k \log(n Δ))$ swaps are necessary. We close this gap by develo** an algorithm that, perhaps surprisingly, matches the guarantees in the offline setting. Specifically, we show how to maintain a constant-factor approximation for the $k$-median problem by performing an optimal (up to polylogarithimic factors) number $\widetilde{O}(k)$ of center swaps. To obtain our result we leverage new structural properties of $k$-median clustering that may be of independent interest. △ Less

Submitted 13 November, 2020; originally announced November 2020.

arXiv:2010.07431 [pdf, other]

Fairness in Streaming Submodular Maximization: Algorithms and Hardness

Authors: Marwa El Halabi, Slobodan Mitrović, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

Abstract: Submodular maximization has become established as the method of choice for the task of selecting representative and diverse summaries of data. However, if datapoints have sensitive attributes such as gender or age, such machine learning algorithms, left unchecked, are known to exhibit bias: under- or over-representation of particular groups. This has made the design of fair machine learning algori… ▽ More Submodular maximization has become established as the method of choice for the task of selecting representative and diverse summaries of data. However, if datapoints have sensitive attributes such as gender or age, such machine learning algorithms, left unchecked, are known to exhibit bias: under- or over-representation of particular groups. This has made the design of fair machine learning algorithms increasingly important. In this work we address the question: Is it possible to create fair summaries for massive datasets? To this end, we develop the first streaming approximation algorithms for submodular maximization under fairness constraints, for both monotone and non-monotone functions. We validate our findings empirically on exemplar-based clustering, movie recommendation, DPP-based summarization, and maximum coverage in social networks, showing that fairness constraints do not significantly impact utility. △ Less

Submitted 18 October, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

Comments: Accepted to NeurIPS 2020

arXiv:2006.04704 [pdf, ps, other]

doi 10.5555/3495724.3496808

Fully Dynamic Algorithm for Constrained Submodular Optimization

Authors: Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Jakub Tarnawski, Morteza Zadimoghaddam

Abstract: The task of maximizing a monotone submodular function under a cardinality constraint is at the core of many machine learning and data mining applications, including data summarization, sparse regression and coverage problems. We study this classic problem in the fully dynamic setting, where elements can be both inserted and removed. Our main result is a randomized algorithm that maintains an effic… ▽ More The task of maximizing a monotone submodular function under a cardinality constraint is at the core of many machine learning and data mining applications, including data summarization, sparse regression and coverage problems. We study this classic problem in the fully dynamic setting, where elements can be both inserted and removed. Our main result is a randomized algorithm that maintains an efficient data structure with a poly-logarithmic amortized update time and yields a $(1/2-ε)$-approximate solution. We complement our theoretical analysis with an empirical study of the performance of our algorithm. △ Less

Submitted 24 May, 2023; v1 submitted 8 June, 2020; originally announced June 2020.

Journal ref: NeurIPS 2020

arXiv:2003.13459 [pdf, ps, other]

The One-way Communication Complexity of Submodular Maximization with Applications to Streaming and Robustness

Authors: Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, Rico Zenklusen

Abstract: We consider the classical problem of maximizing a monotone submodular function subject to a cardinality constraint, which, due to its numerous applications, has recently been studied in various computational models. We consider a clean multi-player model that lies between the offline and streaming model, and study it under the aspect of one-way communication complexity. Our model captures the stre… ▽ More We consider the classical problem of maximizing a monotone submodular function subject to a cardinality constraint, which, due to its numerous applications, has recently been studied in various computational models. We consider a clean multi-player model that lies between the offline and streaming model, and study it under the aspect of one-way communication complexity. Our model captures the streaming setting (by considering a large number of players), and, in addition, two player approximation results for it translate into the robust setting. We present tight one-way communication complexity results for our model, which, due to the above-mentioned connections, have multiple implications in the data stream and robust setting. Even for just two players, a prior information-theoretic hardness result implies that no approximation factor above $1/2$ can be achieved in our model, if only queries to feasible sets are allowed. We show that the possibility of querying infeasible sets can actually be exploited to beat this bound, by presenting a tight $2/3$-approximation taking exponential time, and an efficient $0.514$-approximation. To the best of our knowledge, this is the first example where querying a submodular function on infeasible sets leads to provably better results. Through the above-mentioned link to the robust setting, both of these algorithms improve on the current state-of-the-art for robust submodular maximization, showing that approximation factors beyond $1/2$ are possible. Moreover, exploiting the link of our model to streaming, we settle the approximability for streaming algorithms by presenting a tight $1/2+\varepsilon$ hardness result, based on the construction of a new family of coverage functions. This improves on a prior $1-1/e+\varepsilon$ hardness and matches, up to an arbitrarily small margin, the best known approximation algorithm. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: 56 pages, no figures, to appear in STOC 2020 in the form of an extended abstract

MSC Class: 68R05 (Primary) 68W27; 68Q25 (Secondary) ACM Class: F.2.2; G.2.1

arXiv:1907.05725 [pdf, other]

Space Efficient Approximation to Maximum Matching Size from Uniform Edge Samples

Authors: Michael Kapralov, Slobodan Mitrović, Ashkan Norouzi-Fard, Jakab Tardos

Abstract: Given a source of iid samples of edges of an input graph $G$ with $n$ vertices and $m$ edges, how many samples does one need to compute a constant factor approximation to the maximum matching size in $G$? Moreover, is it possible to obtain such an estimate in a small amount of space? We show that, on the one hand, this problem cannot be solved using a nontrivially sublinear (in $m$) number of samp… ▽ More Given a source of iid samples of edges of an input graph $G$ with $n$ vertices and $m$ edges, how many samples does one need to compute a constant factor approximation to the maximum matching size in $G$? Moreover, is it possible to obtain such an estimate in a small amount of space? We show that, on the one hand, this problem cannot be solved using a nontrivially sublinear (in $m$) number of samples: $m^{1-o(1)}$ samples are needed. On the other hand, a surprisingly space efficient algorithm for processing the samples exists: $O(\log^2 n)$ bits of space suffice to compute an estimate. Our main technical tool is a new peeling type algorithm for matching that we simulate using a recursive sampling process that crucially ensures that local neighborhood information from `dense' regions of the graph is provided at appropriately higher sampling rates. We show that a delicate balance between exploration depth and sampling rate allows our simulation to not lose precision over a logarithmic number of levels of recursion and achieve a constant factor approximation. The previous best result on matching size estimation from random samples was a $\log^{O(1)} n$ approximation [Kapralov et al'14]. Our algorithm also yields a constant factor approximate local computation algorithm (LCA) for matching with $O(d\log n)$ exploration starting from any vertex. Previous approaches were based on local simulations of randomized greedy, which take $O(d)$ time {\em in expectation over the starting vertex or edge} (Yoshida et al'09, Onak et al'12), and could not achieve a better than $d^2$ runtime. Interestingly, we also show that unlike our algorithm, the local simulation of randomized greedy that is the basis of the most efficient prior results does take $\wtΩ(d^2)\gg O(d\log n)$ time for a worst case edge even for $d=\exp(Θ(\sqrt{\log n}))$. △ Less

Submitted 12 July, 2019; originally announced July 2019.

arXiv:1808.01842 [pdf, other]

Beyond $1/2$-Approximation for Submodular Maximization on Massive Data Streams

Authors: Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrović, Amir Zandieh, Aida Mousavifar, Ola Svensson

Abstract: Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive… ▽ More Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive over time at a fast pace and thus we need to design an efficient, low-memory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a $0.5$-approximate solution. Can this approximation factor be improved? We answer this question affirmatively by designing a new algorithm SALSA for streaming submodular maximization. It is the first low-memory, single-pass algorithm that improves the factor $0.5$, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i.e., that there is no such algorithm with better than $0.5$-approximation when elements arrive in arbitrary order. Our experiments demonstrate that SALSA significantly outperforms the state of the art in applications related to exemplar-based clustering, social graph analysis, and recommender systems. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Journal ref: Proc. of 35th International Conference on Machine Learning (ICML), 2018, pages 3829-3838

arXiv:1711.02598 [pdf, other]

Streaming Robust Submodular Maximization: A Partitioned Thresholding Approach

Authors: Slobodan Mitrović, Ilija Bogunovic, Ashkan Norouzi-Fard, Jakub Tarnawski, Volkan Cevher

Abstract: We study the classical problem of maximizing a monotone submodular function subject to a cardinality constraint k, with two additional twists: (i) elements arrive in a streaming fashion, and (ii) m items from the algorithm's memory are removed after the stream is finished. We develop a robust submodular algorithm STAR-T. It is based on a novel partitioning structure and an exponentially decreasing… ▽ More We study the classical problem of maximizing a monotone submodular function subject to a cardinality constraint k, with two additional twists: (i) elements arrive in a streaming fashion, and (ii) m items from the algorithm's memory are removed after the stream is finished. We develop a robust submodular algorithm STAR-T. It is based on a novel partitioning structure and an exponentially decreasing thresholding rule. STAR-T makes one pass over the data and retains a short but robust summary. We show that after the removal of any m elements from the obtained summary, a simple greedy algorithm STAR-T-GREEDY that runs on the remaining elements achieves a constant-factor approximation guarantee. In two different data summarization tasks, we demonstrate that it matches or outperforms existing greedy and streaming methods, even if they are allowed the benefit of knowing the removed subset in advance. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: To appear in NIPS 2017

Journal ref: Proc. of 30th Advances in Neural Information Processing Systems (NIPS) 2017, pages 4558-4567

arXiv:1612.07925 [pdf, ps, other]

Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms

Authors: Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, Justin Ward

Abstract: Clustering is a classic topic in optimization with $k$-means being one of the most fundamental such problems. In the absence of any restrictions on the input, the best known algorithm for $k$-means with a provable guarantee is a simple local search heuristic yielding an approximation guarantee of $9+ε$, a ratio that is known to be tight with respect to such methods. We overcome this barrier by p… ▽ More Clustering is a classic topic in optimization with $k$-means being one of the most fundamental such problems. In the absence of any restrictions on the input, the best known algorithm for $k$-means with a provable guarantee is a simple local search heuristic yielding an approximation guarantee of $9+ε$, a ratio that is known to be tight with respect to such methods. We overcome this barrier by presenting a new primal-dual approach that allows us to (1) exploit the geometric structure of $k$-means and (2) to satisfy the hard constraint that at most $k$ clusters are selected without deteriorating the approximation guarantee. Our main result is a $6.357$-approximation algorithm with respect to the standard LP relaxation. Our techniques are quite general and we also show improved guarantees for the general version of $k$-means where the underlying metric is not required to be Euclidean and for $k$-median in Euclidean metrics. △ Less

Submitted 10 April, 2017; v1 submitted 23 December, 2016; originally announced December 2016.

arXiv:1611.08574 [pdf, other]

An Efficient Streaming Algorithm for the Submodular Cover Problem

Authors: Ashkan Norouzi-Fard, Abbas Bazzi, Marwa El Halabi, Ilija Bogunovic, Ya-** Hsieh, Volkan Cevher

Abstract: We initiate the study of the classical Submodular Cover (SC) problem in the data streaming model which we refer to as the Streaming Submodular Cover (SSC). We show that any single pass streaming algorithm using sublinear memory in the size of the stream will fail to provide any non-trivial approximation guarantees for SSC. Hence, we consider a relaxed version of SSC, where we only seek to find a p… ▽ More We initiate the study of the classical Submodular Cover (SC) problem in the data streaming model which we refer to as the Streaming Submodular Cover (SSC). We show that any single pass streaming algorithm using sublinear memory in the size of the stream will fail to provide any non-trivial approximation guarantees for SSC. Hence, we consider a relaxed version of SSC, where we only seek to find a partial cover. We design the first Efficient bicriteria Submodular Cover Streaming (ESC-Streaming) algorithm for this problem, and provide theoretical guarantees for its performance supported by numerical evidence. Our algorithm finds solutions that are competitive with the near-optimal offline greedy algorithm despite requiring only a single pass over the data stream. In our numerical experiments, we evaluate the performance of ESC-Streaming on active set selection and large-scale graph cover problems. △ Less

Submitted 25 November, 2016; originally announced November 2016.

Comments: To appear in NIPS'16

arXiv:1507.01906 [pdf, other]

Towards Tight Lower Bounds for Scheduling Problems

Authors: Abbas Bazzi, Ashkan Norouzi-Fard

Abstract: We show a close connection between structural hardness for $k$-partite graphs and tight inapproximability results for scheduling problems with precedence constraints. Assuming a natural but nontrivial generalisation of the bipartite structural hardness result of Bansal and Khot, we obtain a hardness of $2-ε$ for the problem of minimising the makespan for scheduling precedence-constrained jobs with… ▽ More We show a close connection between structural hardness for $k$-partite graphs and tight inapproximability results for scheduling problems with precedence constraints. Assuming a natural but nontrivial generalisation of the bipartite structural hardness result of Bansal and Khot, we obtain a hardness of $2-ε$ for the problem of minimising the makespan for scheduling precedence-constrained jobs with preemption on identical parallel machines. This matches the best approximation guarantee for this problem. Assuming the same hypothesis, we also obtain a super constant inapproximability result for the problem of scheduling precedence-constrained jobs on related parallel machines, making progress towards settling an open question in both lists of ten open questions by Williamson and Shmoys, and by Schuurman and Woeginger. The study of structural hardness of $k$-partite graphs is of independent interest, as it captures the intrinsic hardness for a large family of scheduling problems. Other than the ones already mentioned, this generalisation also implies tight inapproximability to the problem of minimising the weighted completion time for precedence-constrained jobs on a single machine, and the problem of minimising the makespan of precedence-constrained jobs on identical parallel machine, and hence unifying the results of Bansal and Khot, and Svensson, respectively. △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: 25 pages, 3 figures, To appear in the Proceedings of the 23rd Annual European Symposium on Algorithms 2015

arXiv:1411.4476 [pdf, other]

Dynamic Facility Location via Exponential Clocks

Authors: Hyung-Chan An, Ashkan Norouzi-Fard, Ola Svensson

Abstract: The \emph{dynamic facility location problem} is a generalization of the classic facility location problem proposed by Eisenstat, Mathieu, and Schabanel to model the dynamics of evolving social/infrastructure networks. The generalization lies in that the distance metric between clients and facilities changes over time. This leads to a trade-off between optimizing the classic objective function and… ▽ More The \emph{dynamic facility location problem} is a generalization of the classic facility location problem proposed by Eisenstat, Mathieu, and Schabanel to model the dynamics of evolving social/infrastructure networks. The generalization lies in that the distance metric between clients and facilities changes over time. This leads to a trade-off between optimizing the classic objective function and the "stability" of the solution: there is a switching cost charged every time a client changes the facility to which it is connected. While the standard linear program (LP) relaxation for the classic problem naturally extends to this problem, traditional LP-rounding techniques do not, as they are often sensitive to small changes in the metric resulting in frequent switches. We present a new LP-rounding algorithm for facility location problems, which yields the first constant approximation algorithm for the dynamic facility location problem. Our algorithm installs competing exponential clocks on the clients and facilities, and connect every client by the path that repeatedly follows the smallest clock in the neighborhood. The use of exponential clocks gives rise to several properties that distinguish our approach from previous LP-roundings for facility location problems. In particular, we use \emph{no clustering} and we allow clients to connect through paths of \emph{arbitrary lengths}. In fact, the clustering-free nature of our algorithm is crucial for applying our LP-rounding approach to the dynamic problem. △ Less

Submitted 17 November, 2014; originally announced November 2014.

Showing 1–23 of 23 results for author: Norouzi-Fard, A