-
Federated singular value decomposition for high dimensional data
Authors:
Anne Hartebrodt,
Richard Röttger,
David B. Blumenthal
Abstract:
Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical dat…
▽ More
Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition (SVD) algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors associated with the samples are never exchanged and the vectors associated with the features only for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Metric Indexing for Graph Similarity Search
Authors:
Franka Bause,
David B. Blumenthal,
Erich Schubert,
Nils M. Kriege
Abstract:
Finding the graphs that are most similar to a query graph in a large database is a common task with various applications. A widely-used similarity measure is the graph edit distance, which provides an intuitive notion of similarity and naturally supports graphs with vertex and edge attributes. Since its computation is NP-hard, techniques for accelerating similarity search have been studied extensi…
▽ More
Finding the graphs that are most similar to a query graph in a large database is a common task with various applications. A widely-used similarity measure is the graph edit distance, which provides an intuitive notion of similarity and naturally supports graphs with vertex and edge attributes. Since its computation is NP-hard, techniques for accelerating similarity search have been studied extensively. However, index-based approaches for this are almost exclusively designed for graphs with categorical vertex and edge labels and uniform edit costs. We propose a filter-verification framework for similarity search, which supports non-uniform edit costs for graphs with arbitrary attributes. We employ an expensive lower bound obtained by solving an optimal assignment problem. This filter distance satisfies the triangle inequality, making it suitable for acceleration by metric indexing. In subsequent stages, assignment-based upper bounds are used to avoid further exact distance computations. Our extensive experimental evaluation shows that a significant runtime advantage over both a linear scan and state-of-the-art methods is achieved.
△ Less
Submitted 4 October, 2021;
originally announced October 2021.
-
The Minimum Edit Arborescence Problem and Its Use in Compressing Graph Collections [Extended Version]
Authors:
Lucas Gnecco,
Nicolas Boria,
Sébastien Bougleux,
Florian Yger,
David B. Blumenthal
Abstract:
The inference of minimum spanning arborescences within a set of objects is a general problem which translates into numerous application-specific unsupervised learning tasks. We introduce a unified and generic structure called edit arborescence that relies on edit paths between data in a collection, as well as the Min Edit Arborescence Problem, which asks for an edit arborescence that minimizes the…
▽ More
The inference of minimum spanning arborescences within a set of objects is a general problem which translates into numerous application-specific unsupervised learning tasks. We introduce a unified and generic structure called edit arborescence that relies on edit paths between data in a collection, as well as the Min Edit Arborescence Problem, which asks for an edit arborescence that minimizes the sum of costs of its inner edit paths. Through the use of suitable cost functions, this generic framework allows to model a variety of problems. In particular, we show that by introducing encoding size preserving edit costs, it can be used as an efficient method for compressing collections of labeled graphs. Experiments on various graph datasets, with comparisons to standard compression tools, show the potential of our method.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID Environments
Authors:
Reza Nasirigerdeh,
Mohammad Bakhtiari,
Reihaneh Torkzadehmahani,
Amirhossein Bayat,
Markus List,
David B. Blumenthal,
Jan Baumbach
Abstract:
Federated learning has faced performance and network communication challenges, especially in the environments where the data is not independent and identically distributed (IID) across the clients. To address the former challenge, we introduce the federated-centralized concordance property and show that the federated single-mini-batch training approach can achieve comparable performance as the cor…
▽ More
Federated learning has faced performance and network communication challenges, especially in the environments where the data is not independent and identically distributed (IID) across the clients. To address the former challenge, we introduce the federated-centralized concordance property and show that the federated single-mini-batch training approach can achieve comparable performance as the corresponding centralized training in the Non-IID environments. To deal with the latter, we present the federated multi-mini-batch approach and illustrate that it can establish a trade-off between the performance and communication efficiency and outperforms federated averaging in the Non-IID settings.
△ Less
Submitted 3 July, 2021; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Privacy-preserving Artificial Intelligence Techniques in Biomedicine
Authors:
Reihaneh Torkzadehmahani,
Reza Nasirigerdeh,
David B. Blumenthal,
Tim Kacprowski,
Markus List,
Julian Matschinske,
Julian Späth,
Nina Kerstin Wenke,
Béla Bihari,
Tobias Frisch,
Anne Hartebrodt,
Anne-Christin Hausschild,
Dominik Heider,
Andreas Holzinger,
Walter Hötzendorfer,
Markus Kastelitz,
Rudolf Mayer,
Cristian Nogales,
Anastasia Pustozerova,
Richard Röttger,
Harald H. H. W. Schmidt,
Ameli Schwalber,
Christof Tschohl,
Andrea Wohner,
Jan Baumbach
Abstract:
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary s…
▽ More
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
△ Less
Submitted 6 November, 2020; v1 submitted 22 July, 2020;
originally announced July 2020.
-
New Techniques for Graph Edit Distance Computation
Authors:
David B. Blumenthal
Abstract:
Due to their capacity to encode rich structural information, labeled graphs are often used for modeling various kinds of objects such as images, molecules, and chemical compounds. If pattern recognition problems such as clustering and classification are to be solved on these domains, a (dis-)similarity measure for labeled graphs has to be defined. A widely used measure is the graph edit distance (…
▽ More
Due to their capacity to encode rich structural information, labeled graphs are often used for modeling various kinds of objects such as images, molecules, and chemical compounds. If pattern recognition problems such as clustering and classification are to be solved on these domains, a (dis-)similarity measure for labeled graphs has to be defined. A widely used measure is the graph edit distance (GED), which, intuitively, is defined as the minimum amount of distortion that has to be applied to a source graph in order to transform it into a target graph. The main advantage of GED is its flexibility and sensitivity to small differences between the input graphs. Its main drawback is that it is hard to compute.
In this thesis, new results and techniques for several aspects of computing GED are presented. Firstly, theoretical aspects are discussed: competing definitions of GED are harmonized, the problem of computing GED is characterized in terms of complexity, and several reductions from GED to the quadratic assignment problem (QAP) are presented. Secondly, solvers for the linear sum assignment problem with error-correction (LSAPE) are discussed. LSAPE is a generalization of the well-known linear sum assignment problem (LSAP), and has to be solved as a subproblem by many GED algorithms. In particular, a new solver is presented that efficiently reduces LSAPE to LSAP. Thirdly, exact algorithms for computing GED are presented in a systematic way, and improvements of existing algorithms as well as a new mixed integer programming (MIP) based approach are introduced. Fourthly, a detailed overview of heuristic algorithms that approximate GED via upper and lower bounds is provided, and eight new heuristics are described. Finally, a new easily extensible C++ library for exactly or approximately computing GED is presented.
△ Less
Submitted 1 August, 2019;
originally announced August 2019.
-
Improved local search for graph edit distance
Authors:
Nicolas Boria,
David B. Blumenthal,
Sébastien Bougleux,
Luc Brun
Abstract:
The graph edit distance (GED) measures the dissimilarity between two graphs as the minimal cost of a sequence of elementary operations transforming one graph into another. This measure is fundamental in many areas such as structural pattern recognition or classification. However, exactly computing GED is NP-hard. Among different classes of heuristic algorithms that were proposed to compute approxi…
▽ More
The graph edit distance (GED) measures the dissimilarity between two graphs as the minimal cost of a sequence of elementary operations transforming one graph into another. This measure is fundamental in many areas such as structural pattern recognition or classification. However, exactly computing GED is NP-hard. Among different classes of heuristic algorithms that were proposed to compute approximate solutions, local search based algorithms provide the tightest upper bounds for GED. In this paper, we present K-REFINE and RANDPOST. K-REFINE generalizes and improves an existing local search algorithm and performs particularly well on small graphs. RANDPOST is a general warm start framework that stochastically generates promising initial solutions to be used by any local search based GED algorithm. It is particularly efficient on large graphs. An extensive empirical evaluation demonstrates that both K-REFINE and RANDPOST perform excellently in practice.
△ Less
Submitted 26 November, 2019; v1 submitted 5 July, 2019;
originally announced July 2019.
-
Upper Bounding the Graph Edit Distance Based on Rings and Machine Learning
Authors:
David B. Blumenthal,
Johann Gamper,
Sébastien Bougleux,
Luc Brun
Abstract:
The graph edit distance (GED) is a flexible distance measure which is widely used for inexact graph matching. Since its exact computation is NP-hard, heuristics are used in practice. A popular approach is to obtain upper bounds for GED via transformations to the linear sum assignment problem with error-correction (LSAPE). Typically, local structures and distances between them are employed for carr…
▽ More
The graph edit distance (GED) is a flexible distance measure which is widely used for inexact graph matching. Since its exact computation is NP-hard, heuristics are used in practice. A popular approach is to obtain upper bounds for GED via transformations to the linear sum assignment problem with error-correction (LSAPE). Typically, local structures and distances between them are employed for carrying out this transformation, but recently also machine learning techniques have been used. In this paper, we formally define a unifying framework LSAPE-GED for transformations from GED to LSAPE. We also introduce rings, a new kind of local structures designed for graphs where most information resides in the topology rather than in the node labels. Furthermore, we propose two new ring based heuristics RING and RING-ML, which instantiate LSAPE-GED using the traditional and the machine learning based approach for transforming GED to LSAPE, respectively. Extensive experiments show that using rings for upper bounding GED significantly improves the state of the art on datasets where most information resides in the graphs' topologies. This closes the gap between fast but rather inaccurate LSAPE based heuristics and more accurate but significantly slower GED algorithms based on local search.
△ Less
Submitted 28 January, 2021; v1 submitted 29 June, 2019;
originally announced July 2019.
-
Finding k-Dissimilar Paths with Minimum Collective Length
Authors:
Theodoros Chondrogiannis,
Panagiotis Bouros,
Johann Gamper,
Ulf Leser,
David B. Blumenthal
Abstract:
Shortest path computation is a fundamental problem in road networks. However, in many real-world scenarios, determining solely the shortest path is not enough. In this paper, we study the problem of finding k-Dissimilar Paths with Minimum Collective Length (kDPwML), which aims at computing a set of paths from a source s to a target t such that all paths are pairwise dissimilar by at least θand the…
▽ More
Shortest path computation is a fundamental problem in road networks. However, in many real-world scenarios, determining solely the shortest path is not enough. In this paper, we study the problem of finding k-Dissimilar Paths with Minimum Collective Length (kDPwML), which aims at computing a set of paths from a source s to a target t such that all paths are pairwise dissimilar by at least θand the sum of the path lengths is minimal. We introduce an exact algorithm for the kDPwML problem, which iterates over all possible s-t paths while employing two pruning techniques to reduce the prohibitively expensive computational cost. To achieve scalability, we also define the much smaller set of the simple single-via paths, and we adapt two algorithms for kDPwML queries to iterate over this set. Our experimental analysis on real road networks shows that iterating over all paths is impractical, while iterating over the set of simple single-via paths can lead to scalable solutions with only a small trade-off in the quality of the results.
△ Less
Submitted 24 October, 2018; v1 submitted 18 September, 2018;
originally announced September 2018.