Search | arXiv e-print repository

KaMPIng: Flexible and (Near) Zero-overhead C++ Bindings for MPI

Authors: Demian Hespe, Lukas Hübner, Florian Kurpicz, Peter Sanders, Matthias Schimek, Daniel Seemaier, Christoph Stelz, Tim Niklas Uhl

Abstract: The Message-Passing Interface (MPI) and C++ form the backbone of high-performance computing, but MPI only provides C and Fortran bindings. While this offers great language interoperability, high-level programming languages like C++ make software development quicker and less error-prone. We propose novel C++ language bindings that cover all abstraction levels from low-level MPI calls to convenien… ▽ More The Message-Passing Interface (MPI) and C++ form the backbone of high-performance computing, but MPI only provides C and Fortran bindings. While this offers great language interoperability, high-level programming languages like C++ make software development quicker and less error-prone. We propose novel C++ language bindings that cover all abstraction levels from low-level MPI calls to convenient STL-style bindings, where most parameters are inferred from a small subset of parameters, by bringing named parameters to C++. This enables rapid prototy** and fine-tuning runtime behavior and memory management. A flexible type system and additional safeness guarantees help to prevent programming errors. By exploiting C++'s template-metaprogramming capabilities, this has (near) zero-overhead, as only required code paths are generated at compile time. We demonstrate that our library is a strong foundation for a future distributed standard library using multiple application benchmarks, ranging from text-book sorting algorithms to phylogenetic interference. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2203.01107 [pdf, other]

doi 10.1109/FTXS56515.2022.00008

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Authors: Lukas Hübner, Demian Hespe, Peter Sanders, Alexandros Stamatakis

Abstract: Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algori… ▽ More Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24 576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application. △ Less

Submitted 25 January, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

Journal ref: 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, 2022, pp. 24-35

arXiv:2102.01540 [pdf, other]

Targeted Branching for the Maximum Independent Set Problem

Authors: Demian Hespe, Sebastian Lamm, Christian Schorr

Abstract: Finding a maximum independent set is a fundamental NP-hard problem that is used in many real-world applications. Given an unweighted graph, this problem asks for a maximum cardinality set of pairwise non-adjacent vertices. Some of the most successful algorithms for this problem are based on the branch-and-bound or branch-and-reduce paradigms. In particular, branch-and-reduce algorithms, which comb… ▽ More Finding a maximum independent set is a fundamental NP-hard problem that is used in many real-world applications. Given an unweighted graph, this problem asks for a maximum cardinality set of pairwise non-adjacent vertices. Some of the most successful algorithms for this problem are based on the branch-and-bound or branch-and-reduce paradigms. In particular, branch-and-reduce algorithms, which combine branch-and-bound with reduction rules, achieved substantial results, solving many previously infeasible instances. These results were to a large part achieved by develo** new, more practical reduction rules. However, other components that have been shown to have an impact on the performance of these algorithms have not received as much attention. One of these is the branching strategy, which determines what vertex is included or excluded in a potential solution. The most commonly used strategy selects vertices based on their degree and does not take into account other factors that contribute to the performance. In this work, we develop and evaluate several novel branching strategies for both branch-and-bound and branch-and-reduce algorithms. Our strategies are based on one of two approaches. They either (1) aim to decompose the graph into two or more connected components which can then be solved independently, or (2) try to remove vertices that hinder the application of a reduction rule. Our experiments on a large set of real-world instances indicate that our strategies are able to improve the performance of the state-of-the-art branch-and-reduce algorithms. To be more specific, our reduction-based packing branching rule is able to outperform the default branching strategy of selecting a vertex of highest degree on 65% of all instances tested. Furthermore, our decomposition-based strategy based on edge cuts is able to achieve a speedup of 2.29 on sparse networks (1.22 on all instances). △ Less

Submitted 29 March, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

arXiv:1908.06795 [pdf, other]

WeGotYouCovered: The Winning Solver from the PACE 2019 Implementation Challenge, Vertex Cover Track

Authors: Demian Hespe, Sebastian Lamm, Christian Schulz, Darren Strash

Abstract: We present the winning solver of the PACE 2019 Implementation Challenge, Vertex Cover Track. The minimum vertex cover problem is one of a handful of problems for which kernelization---the repeated reducing of the input size via data reduction rules---is known to be highly effective in practice. Our algorithm uses a portfolio of techniques, including an aggressive kernelization strategy, local sear… ▽ More We present the winning solver of the PACE 2019 Implementation Challenge, Vertex Cover Track. The minimum vertex cover problem is one of a handful of problems for which kernelization---the repeated reducing of the input size via data reduction rules---is known to be highly effective in practice. Our algorithm uses a portfolio of techniques, including an aggressive kernelization strategy, local search, branch-and-reduce, and a state-of-the-art branch-and-bound solver. Of particular interest is that several of our techniques were not from the literature on the vertex over problem: they were originally published to solve the (complementary) maximum independent set and maximum clique problems. Aside from illustrating our solver's performance in the PACE 2019 Implementation Challenge, our experiments provide several key insights not yet seen before in the literature. First, kernelization can boost the performance of branch-and-bound clique solvers enough to outperform branch-and-reduce solvers. Second, local search can significantly boost the performance of branch-and-reduce solvers. And finally, somewhat surprisingly, kernelization can sometimes make branch-and-bound algorithms perform worse than running branch-and-bound alone. △ Less

Submitted 20 August, 2019; v1 submitted 19 August, 2019; originally announced August 2019.

arXiv:1907.03535 [pdf, other]

More Hierarchy in Route Planning Using Edge Hierarchies

Authors: Demian Hespe, Peter Sanders

Abstract: A highly successful approach to route planning in networks (particularly road networks) is to identify a hierarchy in the network that allows faster queries after some preprocessing that basically inserts additional "shortcut"-edges into a graph. In the past there has been a succession of techniques that infer a more and more fine grained hierarchy enabling increasingly more efficient queries. Thi… ▽ More A highly successful approach to route planning in networks (particularly road networks) is to identify a hierarchy in the network that allows faster queries after some preprocessing that basically inserts additional "shortcut"-edges into a graph. In the past there has been a succession of techniques that infer a more and more fine grained hierarchy enabling increasingly more efficient queries. This appeared to culminate in contraction hierarchies that assign one hierarchy level to each vertex. In this paper we show how to identify an even more fine grained hierarchy that assigns one level to each edge of the network. Our findings indicate that this can lead to considerably smaller search spaces in terms of visited edges. Currently, this rarely implies improved query times so that it remains an open question whether edge hierarchies can lead to consistently improved performance. However, we believe that the technique as such is a noteworthy enrichment of the portfolio of available techniques that might prove useful in the future. △ Less

Submitted 10 September, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

arXiv:1905.10902 [pdf, other]

Engineering Kernelization for Maximum Cut

Authors: Damir Ferizovic, Demian Hespe, Sebastian Lamm, Matthias Mnich, Christian Schulz, Darren Strash

Abstract: Kernelization is a general theoretical framework for preprocessing instances of NP-hard problems into (generally smaller) instances with bounded size, via the repeated application of data reduction rules. For the fundamental Max Cut problem, kernelization algorithms are theoretically highly efficient for various parameterizations. However, the efficacy of these reduction rules in practice---to aid… ▽ More Kernelization is a general theoretical framework for preprocessing instances of NP-hard problems into (generally smaller) instances with bounded size, via the repeated application of data reduction rules. For the fundamental Max Cut problem, kernelization algorithms are theoretically highly efficient for various parameterizations. However, the efficacy of these reduction rules in practice---to aid solving highly challenging benchmark instances to optimality---remains entirely unexplored. We engineer a new suite of efficient data reduction rules that subsume most of the previously published rules, and demonstrate their significant impact on benchmark data sets, including synthetic instances, and data sets from the VLSI and image segmentation application domains. Our experiments reveal that current state-of-the-art solvers can be sped up by up to multiple orders of magnitude when combined with our data reduction rules. On social and biological networks in particular, kernelization enables us to solve four instances that were previously unsolved in a ten-hour time limit with state-of-the-art solvers; three of these instances are now solved in less than two seconds. △ Less

Submitted 26 May, 2019; originally announced May 2019.

Comments: 16 pages, 4 tables, 2 figures

ACM Class: F.2.2; G.2.2

arXiv:1709.05183 [pdf, other]

Fast OLAP Query Execution in Main Memory on Large Data in a Cluster

Authors: Demian Hespe, Martin Weidner, Jonathan Dees, Peter Sanders

Abstract: Main memory column-stores have proven to be efficient for processing analytical queries. Still, there has been much less work in the context of clusters. Using only a single machine poses several restrictions: Processing power and data volume are bounded to the number of cores and main memory fitting on one tightly coupled system. To enable the processing of larger data sets, switching to a cluste… ▽ More Main memory column-stores have proven to be efficient for processing analytical queries. Still, there has been much less work in the context of clusters. Using only a single machine poses several restrictions: Processing power and data volume are bounded to the number of cores and main memory fitting on one tightly coupled system. To enable the processing of larger data sets, switching to a cluster becomes necessary. In this work, we explore techniques for efficient execution of analytical SQL queries on large amounts of data in a parallel database cluster while making maximal use of the available hardware. This includes precompiled query plans for efficient CPU utilization, full parallelization on single nodes and across the cluster, and efficient inter-node communication. We implement all features in a prototype for running a subset of TPC-H benchmark queries. We evaluate our implementation using a 128 node cluster running TPC-H queries with 30 000 gigabyte of uncompressed data. △ Less

Submitted 15 September, 2017; originally announced September 2017.

arXiv:1708.06151 [pdf, other]

Scalable Kernelization for Maximum Independent Sets

Authors: Demian Hespe, Christian Schulz, Darren Strash

Abstract: The most efficient algorithms for finding maximum independent sets in both theory and practice use reduction rules to obtain a much smaller problem instance called a kernel. The kernel can then be solved quickly using exact or heuristic algorithms---or by repeatedly kernelizing recursively in the branch-and-reduce paradigm. It is of critical importance for these algorithms that kernelization is fa… ▽ More The most efficient algorithms for finding maximum independent sets in both theory and practice use reduction rules to obtain a much smaller problem instance called a kernel. The kernel can then be solved quickly using exact or heuristic algorithms---or by repeatedly kernelizing recursively in the branch-and-reduce paradigm. It is of critical importance for these algorithms that kernelization is fast and returns a small kernel. Current algorithms are either slow but produce a small kernel, or fast and give a large kernel. We attempt to accomplish both of these goals simultaneously, by giving an efficient parallel kernelization algorithm based on graph partitioning and parallel bipartite maximum matching. We combine our parallelization techniques with two techniques to accelerate kernelization further: dependency checking that prunes reductions that cannot be applied, and reduction tracking that allows us to stop kernelization when reductions become less fruitful. Our algorithm produces kernels that are orders of magnitude smaller than the fastest kernelization methods, while having a similar execution time. Furthermore, our algorithm is able to compute kernels with size comparable to the smallest known kernels, but up to two orders of magnitude faster than previously possible. Finally, we show that our kernelization algorithm can be used to accelerate existing state-of-the-art heuristic algorithms, allowing us to find larger independent sets faster on large real-world networks and synthetic instances. △ Less

Submitted 10 September, 2019; v1 submitted 21 August, 2017; originally announced August 2017.

Comments: Extended version

Showing 1–8 of 8 results for author: Hespe, D