Search | arXiv e-print repository

Why the Metric Backbone Preserves Community Structure

Authors: Maximilien Dreveton, Charbel Chucri, Matthias Grossglauser, Patrick Thiran

Abstract: The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges $(u,v)$ that are not the shortest path between $u$ and $v$. In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edge… ▽ More The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges $(u,v)$ that are not the shortest path between $u$ and $v$. In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edges because the communities are dense. This suggests that the metric backbone would dilute or destroy the community structure of the network. However, this is not borne out by prior empirical work, which instead showed that the metric backbone of real networks preserves the community structure of the original network well. In this work, we analyze the metric backbone of a broad class of weighted random graphs with communities, and we formally prove the robustness of the community structure with respect to the deletion of all the edges that are not in the metric backbone. An empirical comparison of several graph sparsification techniques confirms our theoretical finding and shows that the metric backbone is an efficient sparsifier in the presence of communities. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.14547 [pdf, other]

Causal Effect Identification in a Sub-Population with Latent Variables

Authors: Amir Mohammad Abouei, Ehsan Mokhtarian, Negar Kiyavash, Matthias Grossglauser

Abstract: The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by t… ▽ More The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by the presence of latent variables in a sub-population, we first extend the classical relevant graphical definitions, such as c-components and Hedges, initially defined for the so-called ID problem (Pearl, 1995; Tian & Pearl, 2002), to their new counterparts. Subsequently, we propose a sound algorithm for the s-ID problem with latent variables. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 19 pages, 5 figures

arXiv:2402.15432 [pdf, ps, other]

Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models

Authors: Maximilien Dreveton, Alperen Gözeten, Matthias Grossglauser, Patrick Thiran

Abstract: Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal… ▽ More Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal. △ Less

Submitted 23 February, 2024; originally announced February 2024.

MSC Class: 62H30; 62F12; 62B10

arXiv:2311.08914 [pdf, other]

Efficiently Esca** Saddle Points for Non-Convex Policy Optimization

Authors: Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Niao He, Matthias Grossglauser

Abstract: Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(ε^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algori… ▽ More Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(ε^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(ε^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(ε^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: arXiv admin note: text overlap with arXiv:2205.08253

MSC Class: ACM-class:I.2.6

arXiv:2309.11381 [pdf, other]

Studying Lobby Influence in the European Parliament

Authors: Aswin Suresh, Lazar Radojevic, Francesco Salvi, Antoine Magron, Victor Kristof, Matthias Grossglauser

Abstract: We present a method based on natural language processing (NLP), for studying the influence of interest groups (lobbies) in the law-making process in the European Parliament (EP). We collect and analyze novel datasets of lobbies' position papers and speeches made by members of the EP (MEPs). By comparing these texts on the basis of semantic similarity and entailment, we are able to discover interpr… ▽ More We present a method based on natural language processing (NLP), for studying the influence of interest groups (lobbies) in the law-making process in the European Parliament (EP). We collect and analyze novel datasets of lobbies' position papers and speeches made by members of the EP (MEPs). By comparing these texts on the basis of semantic similarity and entailment, we are able to discover interpretable links between MEPs and lobbies. In the absence of a ground-truth dataset of such links, we perform an indirect validation by comparing the discovered links with a dataset, which we curate, of retweet links between MEPs and lobbies, and with the publicly disclosed meetings of MEPs. Our best method achieves an AUC score of 0.77 and performs significantly better than several baselines. Moreover, an aggregate analysis of the discovered links, between groups of related lobbies and political groups of MEPs, correspond to the expectations from the ideology of the groups (e.g., center-left groups are associated with social causes). We believe that this work, which encompasses the methodology, datasets, and results, is a step towards enhancing the transparency of the intricate decision-making processes within democratic institutions. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 11 pages, 5 figures. Under review for presentation at ICWSM 2024

arXiv:2307.08139 [pdf, other]

It's All Relative: Interpretable Models for Scoring Bias in Documents

Authors: Aswin Suresh, Chi-Hsuan Wu, Matthias Grossglauser

Abstract: We propose an interpretable model to score the bias present in web documents, based only on their textual content. Our model incorporates assumptions reminiscent of the Bradley-Terry axioms and is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. While prior approaches based on absolute bias classification have struggled to obtain a high… ▽ More We propose an interpretable model to score the bias present in web documents, based only on their textual content. Our model incorporates assumptions reminiscent of the Bradley-Terry axioms and is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. While prior approaches based on absolute bias classification have struggled to obtain a high accuracy for the task, we are able to develop a useful model for scoring bias by learning to perform pairwise comparisons of bias accurately. We show that we can interpret the parameters of the trained model to discover the words most indicative of bias. We also apply our model in three different settings - studying the temporal evolution of bias in Wikipedia articles, comparing news sources based on bias, and scoring bias in law amendments. In each case, we demonstrate that the outputs of the model can be explained and validated, even for the two domains that are outside the training-data domain. We also use the model to compare the general level of bias between domains, where we see that legal texts are the least biased and news media are the most biased, with Wikipedia articles in between. Given its high performance, simplicity, interpretability, and wide applicability, we hope the model will be useful for a large community, including Wikipedia and news editors, political and social scientists, and the general public. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: 12 pages

arXiv:2306.01814 [pdf, other]

Fast Interactive Search with a Scale-Free Comparison Oracle

Authors: Daniyar Chumbalov, Lars Klein, Lucas Maystre, Matthias Grossglauser

Abstract: A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form, ``Which of items $i$ and $j$ is closer to $t$?'' Instead of formulating an explicit query (such as one or several keywords), the user navigates towards the target via a sequence of such (typically noisy) queries. We propose a scale-free probabilistic oracle model called $γ$-CKL… ▽ More A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form, ``Which of items $i$ and $j$ is closer to $t$?'' Instead of formulating an explicit query (such as one or several keywords), the user navigates towards the target via a sequence of such (typically noisy) queries. We propose a scale-free probabilistic oracle model called $γ$-CKL for such similarity triplets $(i,j;t)$, which generalizes the CKL triplet model proposed in the literature. The generalization affords independent control over the discriminating power of the oracle and the dimension of the feature space containing the items. We develop a search algorithm with provably exponential rate of convergence under the $γ$-CKL oracle, thanks to a backtracking strategy that deals with the unavoidable errors in updating the belief region around the target. We evaluate the performance of the algorithm both over the posited oracle and over several real-world triplet datasets. We also report on a comprehensive user study, where human subjects navigate a database of face portraits. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2306.00833 [pdf, other]

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

Authors: Maximilien Dreveton, Daichi Kuroda, Matthias Grossglauser, Patrick Thiran

Abstract: Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive ($\textit{top-down}$) algorithms recursively partition the nodes into two communities, until a stop** rule indicates that no further split is needed. In contrast,… ▽ More Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive ($\textit{top-down}$) algorithms recursively partition the nodes into two communities, until a stop** rule indicates that no further split is needed. In contrast, agglomerative ($\textit{bottom-up}$) algorithms first identify the smallest community structure and then repeatedly merge the communities using a $\textit{linkage}$ method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2006.11325 [pdf, other]

Self-Supervised Prototypical Transfer Learning for Few-Shot Classification

Authors: Carlos Medina, Arnout Devos, Matthias Grossglauser

Abstract: Most approaches in few-shot learning rely on costly annotated data related to the goal task domain during (pre-)training. Recently, unsupervised meta-learning methods have exchanged the annotation requirement for a reduction in few-shot classification performance. Simultaneously, in settings with realistic domain shift, common transfer learning has been shown to outperform supervised meta-learning… ▽ More Most approaches in few-shot learning rely on costly annotated data related to the goal task domain during (pre-)training. Recently, unsupervised meta-learning methods have exchanged the annotation requirement for a reduction in few-shot classification performance. Simultaneously, in settings with realistic domain shift, common transfer learning has been shown to outperform supervised meta-learning. Building on these insights and on advances in self-supervised learning, we propose a transfer learning approach which constructs a metric embedding that clusters unlabeled prototypical samples and their augmentations closely together. This pre-trained embedding is a starting point for few-shot classification by summarizing class clusters and fine-tuning. We demonstrate that our self-supervised prototypical transfer learning approach ProtoTransfer outperforms state-of-the-art unsupervised meta-learning methods on few-shot tasks from the mini-ImageNet dataset. In few-shot experiments with domain shift, our approach even has comparable performance to supervised methods, but requires orders of magnitude fewer labels. △ Less

Submitted 19 June, 2020; originally announced June 2020.

Comments: Extended version of work presented at the 7th ICML Workshop on Automated Machine Learning (2020). Code available at https://github.com/indy-lab/ProtoTransfer ; 17 pages, 3 figures, 12 tables

arXiv:1911.11658 [pdf, other]

A User Study of Perceived Carbon Footprint

Authors: Victor Kristof, Valentin Quelquejay-Leclère, Robin Zbinden, Lucas Maystre, Matthias Grossglauser, Patrick Thiran

Abstract: We propose a statistical model to understand people's perception of their carbon footprint. Driven by the observation that few people think of CO2 impact in absolute terms, we design a system to probe people's perception from simple pairwise comparisons of the relative carbon footprint of their actions. The formulation of the model enables us to take an active-learning approach to selecting the pa… ▽ More We propose a statistical model to understand people's perception of their carbon footprint. Driven by the observation that few people think of CO2 impact in absolute terms, we design a system to probe people's perception from simple pairwise comparisons of the relative carbon footprint of their actions. The formulation of the model enables us to take an active-learning approach to selecting the pairs of actions that are maximally informative about the model parameters. We define a set of 18 actions and collect a dataset of 2183 comparisons from 176 users on a university campus. The early results reveal promising directions to improve climate communication and enhance climate mitigation. △ Less

Submitted 4 December, 2019; v1 submitted 26 November, 2019; originally announced November 2019.

arXiv:1911.00292 [pdf, other]

Learning Hawkes Processes from a Handful of Events

Authors: Farnood Salehi, William Trouleau, Matthias Grossglauser, Patrick Thiran

Abstract: Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of… ▽ More Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: Appearing at NeurIPS 2019

arXiv:1905.13613 [pdf, other]

Regression Networks for Meta-Learning Few-Shot Classification

Authors: Arnout Devos, Matthias Grossglauser

Abstract: We propose regression networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each class. In high dimensional embedding spaces the direction of data generally contains richer information than magnitude. Next to this, state-of-the-art few-shot metric methods that compare distances w… ▽ More We propose regression networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each class. In high dimensional embedding spaces the direction of data generally contains richer information than magnitude. Next to this, state-of-the-art few-shot metric methods that compare distances with aggregated class representations, have shown superior performance. Combining these two insights, we propose to meta-learn classification of embedded points by regressing the closest approximation in every class subspace while using the regression error as a distance metric. Similarly to recent approaches for few-shot learning, regression networks reflect a simple inductive bias that is beneficial in this limited-data regime and they achieve excellent results, especially when more aggregate class representations can be formed with multiple shots. △ Less

Submitted 18 June, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

Comments: 7th ICML Workshop on Automated Machine Learning (2020)

Journal ref: ICML Workshop on Automated Machine Learning (2020)

arXiv:1905.05049 [pdf, other]

Scalable and Efficient Comparison-based Search without Features

Authors: Daniyar Chumbalov, Lucas Maystre, Matthias Grossglauser

Abstract: We consider the problem of finding a target object $t$ using pairwise comparisons, by asking an oracle questions of the form \emph{"Which object from the pair $(i,j)$ is more similar to $t$?"}. Objects live in a space of latent features, from which the oracle generates noisy answers. First, we consider the {\em non-blind} setting where these features are accessible. We propose a new Bayesian compa… ▽ More We consider the problem of finding a target object $t$ using pairwise comparisons, by asking an oracle questions of the form \emph{"Which object from the pair $(i,j)$ is more similar to $t$?"}. Objects live in a space of latent features, from which the oracle generates noisy answers. First, we consider the {\em non-blind} setting where these features are accessible. We propose a new Bayesian comparison-based search algorithm with noisy answers; it has low computational complexity yet is efficient in the number of queries. We provide theoretical guarantees, deriving the form of the optimal query and proving almost sure convergence to the target $t$. Second, we consider the \emph{blind} setting, where the object features are hidden from the search algorithm. In this setting, we combine our search method and a new distributional triplet embedding algorithm into one scalable learning framework called \textsc{Learn2Search}. We show that the query complexity of our approach on two real-world datasets is on par with the non-blind setting, which is not achievable using any of the current state-of-the-art embedding methods. Finally, we demonstrate the efficacy of our framework by conducting an experiment with users searching for movie actors. △ Less

Submitted 3 September, 2020; v1 submitted 13 May, 2019; originally announced May 2019.

arXiv:1903.07746 [pdf, other]

Pairwise Comparisons with Flexible Time-Dynamics

Authors: Lucas Maystre, Victor Kristof, Matthias Grossglauser

Abstract: Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a probabilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of the… ▽ More Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a probabilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of these processes enables expressive dynamics. We develop an efficient inference algorithm that computes an approximate Bayesian posterior distribution. Despite the flexbility of our model, our inference algorithm requires only a few linear-time iterations over the data and can take advantage of modern multiprocessor computer architectures. We apply our model to several historical databases of sports outcomes and find that our approach outperforms competing approaches in terms of predictive performance, scales to millions of observations, and generates compelling visualizations that help in understanding and interpreting the data. △ Less

Submitted 17 May, 2019; v1 submitted 18 March, 2019; originally announced March 2019.

Comments: Accepted at KDD 2019

arXiv:1804.10029 [pdf, other]

doi 10.1109/TCBB.2019.2914050

MPGM: Scalable and Accurate Multiple Network Alignment

Authors: Ehsan Kazemi, Matthias Grossglauser

Abstract: Protein-protein interaction (PPI) network alignment is a canonical operation to transfer biological knowledge among species. The alignment of PPI-networks has many applications, such as the prediction of protein function, detection of conserved network motifs, and the reconstruction of species' phylogenetic relationships. A good multiple-network alignment (MNA), by considering the data related to… ▽ More Protein-protein interaction (PPI) network alignment is a canonical operation to transfer biological knowledge among species. The alignment of PPI-networks has many applications, such as the prediction of protein function, detection of conserved network motifs, and the reconstruction of species' phylogenetic relationships. A good multiple-network alignment (MNA), by considering the data related to several species, provides a deep understanding of biological networks and system-level cellular processes. With the massive amounts of available PPI data and the increasing number of known PPI networks, the problem of MNA is gaining more attention in the systems-biology studies. In this paper, we introduce a new scalable and accurate algorithm, called MPGM, for aligning multiple networks. The MPGM algorithm has two main steps: (i) SEEDGENERATION and (ii) MULTIPLEPERCOLATION. In the first step, to generate an initial set of seed tuples, the SEEDGENERATION algorithm uses only protein sequence similarities. In the second step, to align remaining unmatched nodes, the MULTIPLEPERCOLATION algorithm uses network structures and the seed tuples generated from the first step. We show that, with respect to different evaluation criteria, MPGM outperforms the other state-of-the-art algorithms. In addition, we guarantee the performance of MPGM under certain classes of network models. We introduce a sampling-based stochastic model for generating k correlated networks. We prove that for this model, if a sufficient number of seed tuples are available, the MULTIPLEPERCOLATION algorithm correctly aligns almost all the nodes. Our theoretical results are supported by experimental evaluations over synthetic networks. △ Less

Submitted 13 May, 2019; v1 submitted 26 April, 2018; originally announced April 2018.

Journal ref: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2019

arXiv:1804.09758 [pdf, other]

Analysis of a Canonical Labeling Algorithm for the Alignment of Correlated Erdős-Rényi Graphs

Authors: Osman Emre Dai, Daniel Cullina, Negar Kiyavash, Matthias Grossglauser

Abstract: Graph alignment in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph alignment in correlated Erdős-Rényi graphs. However, very little is known about the existence of efficient algorithms to achieve graph alignment without seeds. In this work we… ▽ More Graph alignment in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph alignment in correlated Erdős-Rényi graphs. However, very little is known about the existence of efficient algorithms to achieve graph alignment without seeds. In this work we identify a region in which a straightforward $O(n^{11/5} \log n )$-time canonical labeling algorithm, initially introduced in the context of graph isomorphism, succeeds in aligning correlated Erdős-Rényi graphs. The algorithm has two steps. In the first step, all vertices are labeled by their degrees and a trivial minimum distance alignment (i.e., sorting vertices according to their degrees) matches a fixed number of highest degree vertices in the two graphs. Having identified this subset of vertices, the remaining vertices are matched using a alignment algorithm for bipartite graphs. △ Less

Submitted 1 September, 2019; v1 submitted 25 April, 2018; originally announced April 2018.

arXiv:1801.04159 [pdf, other]

doi 10.1145/3219819.3219979

Can Who-Edits-What Predict Edit Survival?

Authors: Ali Batuhan Yardım, Victor Kristof, Lucas Maystre, Matthias Grossglauser

Abstract: As the number of contributors to online peer-production systems grows, it becomes increasingly important to predict whether the edits that users make will eventually be beneficial to the project. Existing solutions either rely on a user reputation system or consist of a highly specialized predictor that is tailored to a specific peer-production system. In this work, we explore a different point in… ▽ More As the number of contributors to online peer-production systems grows, it becomes increasingly important to predict whether the edits that users make will eventually be beneficial to the project. Existing solutions either rely on a user reputation system or consist of a highly specialized predictor that is tailored to a specific peer-production system. In this work, we explore a different point in the solution space that goes beyond user reputation but does not involve any content-based feature of the edits. We view each edit as a game between the editor and the component of the project. We posit that the probability that an edit is accepted is a function of the editor's skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. We apply our model on Wikipedia and the Linux kernel, two examples of large-scale peer-production systems, and we seek to understand whether it can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. It is simple to implement, computationally inexpensive, and in addition it enables us to discover interesting structure in the data. △ Less

Submitted 5 July, 2018; v1 submitted 12 January, 2018; originally announced January 2018.

Comments: Accepted at KDD 2018

arXiv:1610.06525 [pdf, other]

ChoiceRank: Identifying Preferences from Node Traffic in Networks

Authors: Lucas Maystre, Matthias Grossglauser

Abstract: Understanding how users navigate in a network is of high interest in many applications. We consider a setting where only aggregate node-level traffic is observed and tackle the task of learning edge transition probabilities. We cast it as a preference learning problem, and we study a model where choices follow Luce's axiom. In this case, the $O(n)$ marginal counts of node visits are a sufficient s… ▽ More Understanding how users navigate in a network is of high interest in many applications. We consider a setting where only aggregate node-level traffic is observed and tackle the task of learning edge transition probabilities. We cast it as a preference learning problem, and we study a model where choices follow Luce's axiom. In this case, the $O(n)$ marginal counts of node visits are a sufficient statistic for the $O(n^2)$ transition probabilities. We show how to make the inference problem well-posed regardless of the network's structure, and we present ChoiceRank, an iterative algorithm that scales to networks that contains billions of nodes and edges. We apply the model to two clickstream datasets and show that it successfully recovers the transition probabilities using only the network structure and marginal (node-level) traffic data. Finally, we also consider an application to mobility networks and apply the model to one year of rides on New York City's bicycle-sharing system. △ Less

Submitted 15 June, 2017; v1 submitted 20 October, 2016; originally announced October 2016.

Comments: Accepted at ICML 2017

arXiv:1609.01176 [pdf, other]

The Player Kernel: Learning Team Strengths Based on Implicit Player Contributions

Authors: Lucas Maystre, Victor Kristof, Antonio J. González Ferrer, Matthias Grossglauser

Abstract: In this work, we draw attention to a connection between skill-based models of game outcomes and Gaussian process classification models. The Gaussian process perspective enables a) a principled way of dealing with uncertainty and b) rich models, specified through kernel functions. Using this connection, we tackle the problem of predicting outcomes of football matches between national teams. We deve… ▽ More In this work, we draw attention to a connection between skill-based models of game outcomes and Gaussian process classification models. The Gaussian process perspective enables a) a principled way of dealing with uncertainty and b) rich models, specified through kernel functions. Using this connection, we tackle the problem of predicting outcomes of football matches between national teams. We develop a player kernel that relates any two football matches through the players lined up on the field. This makes it possible to share knowledge gained from observing matches between clubs (available in large quantities) and matches between national teams (available only in limited quantities). We evaluate our approach on the Euro 2008, 2012 and 2016 final tournaments. △ Less

Submitted 5 September, 2016; originally announced September 2016.

arXiv:1602.00668 [pdf, other]

On the Structure and Efficient Computation of IsoRank Node Similarities

Authors: Ehsan Kazemi, Matthias Grossglauser

Abstract: The alignment of protein-protein interaction (PPI) networks has many applications, such as the detection of conserved biological network motifs, the prediction of protein interactions, and the reconstruction of phylogenetic trees [1, 2, 3]. IsoRank is one of the first global network alignment algorithms [4, 5, 6], where the goal is to match all (or most) of the nodes of two PPI networks. The IsoRa… ▽ More The alignment of protein-protein interaction (PPI) networks has many applications, such as the detection of conserved biological network motifs, the prediction of protein interactions, and the reconstruction of phylogenetic trees [1, 2, 3]. IsoRank is one of the first global network alignment algorithms [4, 5, 6], where the goal is to match all (or most) of the nodes of two PPI networks. The IsoRank algorithm first computes a pairwise node similarity metric, and then generates a matching between the two node sets based on this metric. The metric is a convex combination of a structural similarity score (with weight $ α$) and an extraneous amino-acid sequence similarity score for two proteins (with weight $ 1 - α$). In this short paper, we make two contributions. First, we show that when IsoRank similarity depends only on network structure ($α= 1$), the similarity of two nodes is only a function of their degrees. In other words, IsoRank similarity is invariant to any network rewiring that does not affect the node degrees. This result suggests a reason for the poor performance of IsoRank in structure-only ($ α= 1 $) alignment. Second, using ideas from [7, 8], we develop an approximation algorithm that outperforms IsoRank (including recent versions with better scaling, e.g., [9]) by several orders of magnitude in time and memory complexity, despite only a negligible loss in precision. △ Less

Submitted 24 February, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

Comments: 8 pages and 1 figure

arXiv:1502.05556 [pdf, other]

Just Sort It! A Simple and Effective Approach to Active Preference Learning

Authors: Lucas Maystre, Matthias Grossglauser

Abstract: We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsis… ▽ More We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost. △ Less

Submitted 15 June, 2017; v1 submitted 19 February, 2015; originally announced February 2015.

Comments: Accepted at ICML 2017

arXiv:1307.2084 [pdf, other]

Mitigating Epidemics through Mobile Micro-measures

Authors: Mohamed Kafsi, Ehsan Kazemi, Lucas Maystre, Lyudmila Yartseva, Matthias Grossglauser, Patrick Thiran

Abstract: Epidemics of infectious diseases are among the largest threats to the quality of life and the economic and social well-being of develo** countries. The arsenal of measures against such epidemics is well-established, but costly and insufficient to mitigate their impact. In this paper, we argue that mobile technology adds a powerful weapon to this arsenal, because (a) mobile devices endow us with… ▽ More Epidemics of infectious diseases are among the largest threats to the quality of life and the economic and social well-being of develo** countries. The arsenal of measures against such epidemics is well-established, but costly and insufficient to mitigate their impact. In this paper, we argue that mobile technology adds a powerful weapon to this arsenal, because (a) mobile devices endow us with the unprecedented ability to measure and model the detailed behavioral patterns of the affected population, and (b) they enable the delivery of personalized behavioral recommendations to individuals in real time. We combine these two ideas and propose several strategies to generate such recommendations from mobility patterns. The goal of each strategy is a large reduction in infections, with a small impact on the normal course of daily life. We evaluate these strategies over the Orange D4D dataset and show the benefit of mobile micro-measures, even if only a fraction of the population participates. These preliminary results demonstrate the potential of mobile technology to complement other measures like vaccination and quarantines against disease epidemics. △ Less

Submitted 8 July, 2013; originally announced July 2013.

Comments: Presented at NetMob 2013, Boston

arXiv:1212.2831 [pdf, ps, other]

doi 10.1109/TIT.2013.2262497

The Entropy of Conditional Markov Trajectories

Authors: Mohamed Kafsi, Matthias Grossglauser, Patrick Thiran

Abstract: To quantify the randomness of Markov trajectories with fixed initial and final states, Ekroot and Cover proposed a closed-form expression for the entropy of trajectories of an irreducible finite state Markov chain. Numerous applications, including the study of random walks on graphs, require the computation of the entropy of Markov trajectories conditioned on a set of intermediate states. However,… ▽ More To quantify the randomness of Markov trajectories with fixed initial and final states, Ekroot and Cover proposed a closed-form expression for the entropy of trajectories of an irreducible finite state Markov chain. Numerous applications, including the study of random walks on graphs, require the computation of the entropy of Markov trajectories conditioned on a set of intermediate states. However, the expression of Ekroot and Cover does not allow for computing this quantity. In this paper, we propose a method to compute the entropy of conditional Markov trajectories through a transformation of the original Markov chain into a Markov chain that exhibits the desired conditional distribution of trajectories. Moreover, we express the entropy of Markov trajectories - a global quantity - as a linear combination of local entropies associated with the Markov chain states. △ Less

Submitted 14 May, 2013; v1 submitted 12 December, 2012; originally announced December 2012.

Comments: Accepted for publication in IEEE Transactions on Information Theory

arXiv:0909.2504 [pdf, ps, other]

Hierarchical Routing over Dynamic Wireless Networks

Authors: Dominique Tschopp, Suhas Diggavi, Matthias Grossglauser

Abstract: Wireless network topologies change over time and maintaining routes requires frequent updates. Updates are costly in terms of consuming throughput available for data transmission, which is precious in wireless networks. In this paper, we ask whether there exist low-overhead schemes that produce low-stretch routes. This is studied by using the underlying geometric properties of the connectivity g… ▽ More Wireless network topologies change over time and maintaining routes requires frequent updates. Updates are costly in terms of consuming throughput available for data transmission, which is precious in wireless networks. In this paper, we ask whether there exist low-overhead schemes that produce low-stretch routes. This is studied by using the underlying geometric properties of the connectivity graph in wireless networks. △ Less

Submitted 16 September, 2009; v1 submitted 14 September, 2009; originally announced September 2009.

Comments: 29 pages, 19 figures, a shorter version was published in the proceedings of the 2008 ACM Sigmetrics conference

Report number: LICOS-REPORT-2007-005

Showing 1–24 of 24 results for author: Grossglauser, M