Search | arXiv e-print repository

Towards Communication-Efficient Peer-to-Peer Networks

Authors: Khalid Hourani, William K. Moses Jr., Gopal Pandurangan

Abstract: We focus on designing Peer-to-Peer (P2P) networks that enable efficient communication. Over the last two decades, there has been substantial algorithmic research on distributed protocols for building P2P networks with various desirable properties such as high expansion, low diameter, and robustness to a large number of deletions. A key underlying theme in all of these works is to distributively bu… ▽ More We focus on designing Peer-to-Peer (P2P) networks that enable efficient communication. Over the last two decades, there has been substantial algorithmic research on distributed protocols for building P2P networks with various desirable properties such as high expansion, low diameter, and robustness to a large number of deletions. A key underlying theme in all of these works is to distributively build a \emph{random graph} topology that guarantees the above properties. Moreover, the random connectivity topology is widely deployed in many P2P systems today, including those that implement blockchains and cryptocurrencies. However, a major drawback of using a random graph topology for a P2P network is that the random topology does not respect the \emph{underlying} (Internet) communication topology. This creates a large \emph{propagation delay}, which is a major communication bottleneck in modern P2P networks. In this paper, we work towards designing P2P networks that are communication-efficient (having small propagation delay) with provable guarantees. Our main contribution is an efficient, decentralized protocol, $\textsc{Close-Weaver}$, that transforms a random graph topology embedded in an underlying Euclidean space into a topology that also respects the underlying metric. We then present efficient point-to-point routing and broadcast protocols that achieve essentially optimal performance with respect to the underlying space. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 33 pages, 7 figures, full version of paper to appear in ESA 2024

ACM Class: F.2

arXiv:2406.08843 [pdf, other]

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

Authors: Ivan R. Ivanov, Joachim Meyer, Aiden Grossman, William S. Moses, Johannes Doerfert

Abstract: The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the… ▽ More The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2312.07140 [pdf, other]

Exploiting Automorphisms of Temporal Graphs for Fast Exploration and Rendezvous

Authors: Konstantinos Dogeas, Thomas Erlebach, Frank Kammer, Johannes Meintrup, William K. Moses Jr

Abstract: Temporal graphs are dynamic graphs where the edge set can change in each time step, while the vertex set stays the same. Exploration of temporal graphs whose snapshot in each time step is a connected graph, called connected temporal graphs, has been widely studied. In this paper, we extend the concept of graph automorphisms from static graphs to temporal graphs for the first time and show that sym… ▽ More Temporal graphs are dynamic graphs where the edge set can change in each time step, while the vertex set stays the same. Exploration of temporal graphs whose snapshot in each time step is a connected graph, called connected temporal graphs, has been widely studied. In this paper, we extend the concept of graph automorphisms from static graphs to temporal graphs for the first time and show that symmetries enable faster exploration: We prove that a connected temporal graph with $n$ vertices and orbit number $r$ (i.e., $r$~is the number of automorphism orbits) can be explored in $O(r n^{1+ε})$ time steps, for any fixed $ε>0$. For $r=O(n^c)$ for constant $c<1$, this is a significant improvement over the known tight worst-case bound of $Θ(n^2)$ time steps for arbitrary connected temporal graphs. We also give two lower bounds for temporal exploration, showing that $Ω(n \log n)$ time steps are required for some inputs with $r=O(1)$ and that $Ω(rn)$ time steps are required for some inputs for any $r$ with $1\le r\le n$. Moreover, we show that the techniques we develop for fast exploration can be used to derive the following result for rendezvous: Two agents with different programs and without communication ability are placed by an adversary at arbitrary vertices and given full information about the connected temporal graph, except that they do not have consistent vertex labels. Then the two agents can meet at a common vertex after $O(n^{1+ε})$ time steps, for any constant $ε>0$. For some connected temporal graphs with the orbit number being a constant, we also present a complementary lower bound of $Ω(n\log n)$ time steps. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.17115 [pdf, ps, other]

Time- and Communication-Efficient Overlay Network Construction via Gossip

Authors: Fabien Dufoulon, Michael Moorman, William K. Moses Jr., Gopal Pandurangan

Abstract: We focus on the well-studied problem of distributed overlay network construction. We consider a synchronous gossip-based communication model where in each round a node can send a message of small size to another node whose identifier it knows. The network is assumed to be reconfigurable, i.e., a node can add new connections (edges) to other nodes whose identifier it knows or drop existing connecti… ▽ More We focus on the well-studied problem of distributed overlay network construction. We consider a synchronous gossip-based communication model where in each round a node can send a message of small size to another node whose identifier it knows. The network is assumed to be reconfigurable, i.e., a node can add new connections (edges) to other nodes whose identifier it knows or drop existing connections. Each node initially has only knowledge of its own identifier and the identifiers of its neighbors. The overlay construction problem is, given an arbitrary (connected) graph, to reconfigure it to obtain a bounded-degree expander graph as efficiently as possible. The overlay construction problem is relevant to building real-world peer-to-peer network topologies that have desirable properties such as low diameter, high conductance, robustness to adversarial deletions, etc. Our main result is that we show that starting from any arbitrary (connected) graph $G$ on $n$ nodes and $m$ edges, we can construct an overlay network that is a constant-degree expander in polylog $n$ rounds using only $\tilde{O}(n)$ messages. Our time and message bounds are both essentially optimal (up to polylogarithmic factors). Our distributed overlay construction protocol is very lightweight as it uses gossip (each node communicates with only one neighbor in each round) and also scalable as it uses only $\tilde{O}(n)$ messages, which is sublinear in $m$ (even when $m$ is moderately dense). To the best of our knowledge, this is the first result that achieves overlay network construction in polylog $n$ rounds and $o(m)$ messages. Our protocol uses graph sketches in a novel way to construct an expander overlay that is both time and communication efficient. A consequence of our overlay construction protocol is that distributed computation can be performed very efficiently in this model. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: Slightly shortened abstract

arXiv:2311.01511 [pdf, ps, other]

Dispersion, Capacitated Nodes, and the Power of a Trusted Shepherd

Authors: William K. Moses Jr., Amanda Redlich

Abstract: In this paper, we look at and expand the problems of dispersion and Byzantine dispersion of mobile robots on a graph, introduced by Augustine and Moses~Jr.~[ICDCN~2018] and by Molla, Mondal, and Moses~Jr.~[ALGOSENSORS~2020], respectively, to graphs where nodes have variable capacities. We use the idea of a single shepherd, a more powerful robot that will never act in a Byzantine manner, to achieve… ▽ More In this paper, we look at and expand the problems of dispersion and Byzantine dispersion of mobile robots on a graph, introduced by Augustine and Moses~Jr.~[ICDCN~2018] and by Molla, Mondal, and Moses~Jr.~[ALGOSENSORS~2020], respectively, to graphs where nodes have variable capacities. We use the idea of a single shepherd, a more powerful robot that will never act in a Byzantine manner, to achieve fast Byzantine dispersion, even when other robots may be strong Byzantine in nature. We also show the benefit of a shepherd for dispersion on capacitated graphs when no Byzantine robots are present. △ Less

Submitted 2 November, 2023; originally announced November 2023.

ACM Class: F.2

arXiv:2310.15505 [pdf, other]

The Quantum Tortoise and the Classical Hare: A simple framework for understanding which problems quantum computing will accelerate (and which it will not)

Authors: Sukwoong Choi, William S. Moses, Neil Thompson

Abstract: Quantum computing promises transformational gains for solving some problems, but little to none for others. For anyone ho** to use quantum computers now or in the future, it is important to know which problems will benefit. In this paper, we introduce a framework for answering this question both intuitively and quantitatively. The underlying structure of the framework is a race between quantum a… ▽ More Quantum computing promises transformational gains for solving some problems, but little to none for others. For anyone ho** to use quantum computers now or in the future, it is important to know which problems will benefit. In this paper, we introduce a framework for answering this question both intuitively and quantitatively. The underlying structure of the framework is a race between quantum and classical computers, where their relative strengths determine when each wins. While classical computers operate faster, quantum computers can sometimes run more efficient algorithms. Whether the speed advantage or the algorithmic advantage dominates determines whether a problem will benefit from quantum computing or not. Our analysis reveals that many problems, particularly those of small to moderate size that can be important for typical businesses, will not benefit from quantum computing. Conversely, larger problems or those with particularly big algorithmic gains will benefit from near-term quantum computing. Since very large algorithmic gains are rare in practice and theorized to be rare even in principle, our analysis suggests that the benefits from quantum computing will flow either to users of these rare cases, or practitioners processing very large data. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2309.15432 [pdf, other]

ComPile: A Large IR Dataset from Production Sources

Authors: Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes Doerfert

Abstract: Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, t… ▽ More Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components. △ Less

Submitted 30 April, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

arXiv:2305.07546 [pdf, other]

Understanding Automatic Differentiation Pitfalls

Authors: Jan Hückelheim, Harshitha Menon, William Moses, Bruce Christianson, Paul Hovland, Laurent Hascoët

Abstract: Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs accurately and efficiently. Sometimes, however, the derivatives computed by AD could be interpreted as incorrect. These pitfalls occur systematically across tools and approaches. In this paper we broadly categorize problematic… ▽ More Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs accurately and efficiently. Sometimes, however, the derivatives computed by AD could be interpreted as incorrect. These pitfalls occur systematically across tools and approaches. In this paper we broadly categorize problematic usages of AD and illustrate each category with examples such as chaos, time-averaged oscillations, discretizations, fixed-point loops, lookup tables, and linear solvers. We also review debugging techniques and their effectiveness in these situations. With this article we hope to help readers avoid unexpected behavior, detect problems more easily when they occur, and have more realistic expectations from AD tools. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.01753 [pdf, other]

Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots

Authors: Anisur Rahaman Molla, Kaushik Mondal, William K. Moses Jr

Abstract: Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering $k$ robots, initially arbitrarily placed on the nodes of an $n$-node g… ▽ More Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering $k$ robots, initially arbitrarily placed on the nodes of an $n$-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate. In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering~[Ta-Shma and Zwick, TALG, 2014] takes $\Tilde{O}$$(n^5 \log \ell)$ rounds, where $\ell$ is the smallest label among robots and $\Tilde{O}$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when $k \geq \lfloor n/2 \rfloor + 1$, the algorithm takes $O(n^3)$ rounds, (ii) when $k \geq \lfloor n/3 \rfloor + 1$, the algorithm takes $O(n^4 \log n)$ rounds, and (iii) otherwise, the algorithm takes $\Tilde{O}$$(n^5)$ rounds. The algorithm is not required to know $k$, but only $n$. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 19 pages, accepted at IPDPS 2023

ACM Class: F.2.3

arXiv:2210.01173 [pdf, other]

An Almost Singularly Optimal Asynchronous Distributed MST Algorithm

Authors: Fabien Dufoulon, Shay Kutten, William K. Moses Jr., Gopal Pandurangan, David Peleg

Abstract: A singularly (near) optimal distributed algorithm is one that is (near) optimal in \emph{two} criteria, namely, its time and message complexities. For \emph{synchronous} CONGEST networks, such algorithms are known for fundamental distributed computing problems such as leader election [Kutten et al., JACM 2015] and Minimum Spanning Tree (MST) construction [Pandurangan et al., STOC 2017, Elkin, PODC… ▽ More A singularly (near) optimal distributed algorithm is one that is (near) optimal in \emph{two} criteria, namely, its time and message complexities. For \emph{synchronous} CONGEST networks, such algorithms are known for fundamental distributed computing problems such as leader election [Kutten et al., JACM 2015] and Minimum Spanning Tree (MST) construction [Pandurangan et al., STOC 2017, Elkin, PODC 2017]. However, it is open whether a singularly (near) optimal bound can be obtained for the MST construction problem in general \emph{asynchronous} CONGEST networks. We present a randomized distributed MST algorithm that, with high probability, computes an MST in \emph{asynchronous} CONGEST networks and takes $\tilde{O}(D^{1+ε} + \sqrt{n})$ time and $\tilde{O}(m)$ messages, where $n$ is the number of nodes, $m$ the number of edges, $D$ is the diameter of the network, and $ε>0$ is an arbitrarily small constant (both time and message bounds hold with high probability). Our algorithm is message optimal (up to a polylog$(n)$ factor) and almost time optimal (except for a $D^ε$ factor). Our result answers an open question raised in Mashregi and King [DISC 2019] by giving the first known asynchronous MST algorithm that has sublinear time (for all $D = O(n^{1-ε})$) and uses $\tilde{O}(m)$ messages. Using a result of Mashregi and King [DISC 2019], this also yields the first asynchronous MST algorithm that is sublinear in both time and messages in the $KT_1$ CONGEST model. A key tool in our algorithm is the construction of a low diameter rooted spanning tree in asynchronous CONGEST that has depth $\tilde{O}(D^{1+ε})$ (for an arbitrarily small constant $ε> 0$) in $\tilde{O}(D^{1+ε})$ time and $\tilde{O}(m)$ messages. To the best of our knowledge, this is the first such construction that is almost singularly optimal in the asynchronous setting. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: 27 pages, accepted to DISC 2022

ACM Class: F.2.2; F.2.3; G.3

arXiv:2207.00257 [pdf, other]

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Authors: William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko

Abstract: While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs… ▽ More While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 76% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7$\times$. △ Less

Submitted 1 July, 2022; originally announced July 2022.

arXiv:2204.08385 [pdf, other]

Awake Complexity of Distributed Minimum Spanning Tree

Authors: John Augustine, William K. Moses Jr., Gopal Pandurangan

Abstract: We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-c… ▽ More We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-constrained networks such as ad hoc wireless and sensor networks, nodes spending so much time can lead to significant spending of resources such as energy. Motivated by the above consideration, we study distributed algorithms for MST under the \emph{slee** model} [Chatterjee et al., PODC 2020], a model for design and analysis of resource-efficient distributed algorithms. In the slee** model, a node can be in one of two modes in any round -- \emph{slee**} or \emph{awake} (unlike the traditional model where nodes are always awake). Only the rounds in which a node is \emph{awake} are counted, while \emph{slee**} rounds are ignored. A node spends resources only in the awake rounds and hence the main goal is to minimize the \emph{awake complexity} of a distributed algorithm, the worst-case number of rounds any node is awake. We present deterministic and randomized distributed MST algorithms that have an \emph{optimal} awake complexity of $O(\log n)$ time with a matching lower bound. We also show that our randomized awake-optimal algorithm has essentially the best possible round complexity by presenting a lower bound of $\tildeΩ(n)$ on the product of the awake and round complexity of any distributed algorithm (including randomized) that outputs an MST. To complement our trade-off lower bound, we present a parameterized family of distributed algorithms that gives an essentially optimal trade-off between the awake complexity and the round complexity. △ Less

Submitted 19 December, 2023; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: 29 pages, 1 table, 5 figures, abstract modified to fit arXiv constraints

ACM Class: F.2.2; F.2.3; G.2.2; G.3

arXiv:2204.08359 [pdf, other]

Distributed MIS in $O(\log\log{n} )$ Awake Complexity

Authors: Fabien Dufoulon, William K. Moses Jr., Gopal Pandurangan

Abstract: Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $Ω(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibro… ▽ More Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $Ω(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibroda, Wattenhofer, JACM 2016]. Breaking past the $O(\log{n})$ round complexity upper bound or showing stronger lower bounds have been longstanding open problems. Our main contribution is to show that MIS can be computed in awake complexity that is exponentially better compared to the best known round complexity of $O(\log n)$ and also bypassing its fundamental $Ω(\sqrt{\log{n}/\log\log{n}})$ round complexity lower bound exponentially. Specifically, we show that MIS can be computed by a randomized distributed (Monte Carlo) algorithm in $O(\log\log{n} )$ awake complexity with high probability (i.e., with probability at least $1 - n^{-1}$). This algorithm has a round complexity of $O((\log^7 n) \log \log n)$. We also show that we can improve the round complexity at the cost of a slight increase in awake complexity, by presenting a randomized distributed (Monte Carlo) algorithm for MIS that, with high probability, computes an MIS in $O((\log\log{n})\log^*n)$ awake complexity and $O((\log^3 n) (\log \log n) \log^*n)$ round complexity. Our algorithms work in the CONGEST model where messages of size $O(\log n)$ bits can be sent per edge per round. △ Less

Submitted 19 December, 2023; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: Abstract shortened to fit arXiv constraints

ACM Class: F.2.2; F.2.3; G.2

arXiv:2204.01722 [pdf, other]

Performance Portable Solid Mechanics via Matrix-Free $p$-Multigrid

Authors: Jed Brown, Valeria Barra, Natalie Beams, Leila Ghaffari, Matthew Knepley, William Moses, Rezgar Shakeri, Karen Stengel, Jeremy L. Thompson, Junchao Zhang

Abstract: Finite element analysis of solid mechanics is a foundational tool of modern engineering, with low-order finite element methods and assembled sparse matrices representing the industry standard for implicit analysis. We use performance models and numerical experiments to demonstrate that high-order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs;… ▽ More Finite element analysis of solid mechanics is a foundational tool of modern engineering, with low-order finite element methods and assembled sparse matrices representing the industry standard for implicit analysis. We use performance models and numerical experiments to demonstrate that high-order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs; these data structures also offer up to 2x benefit for linear elements. We demonstrate the reliability, efficiency, and scalability of matrix-free $p$-multigrid methods with algebraic multigrid coarse solvers through large deformation hyperelastic simulations of multiscale structures. We investigate accuracy, cost, and execution time on multi-node CPU and GPU systems for moderate to large models (millions to billions of degrees of freedom) using AMD MI250X (OLCF Crusher), NVIDIA A100 (NERSC Perlmutter), and V100 (LLNL Lassen and OLCF Summit), resulting in order of magnitude efficiency improvements over a broad range of model properties and scales. We discuss efficient matrix-free representation of Jacobians and demonstrate how automatic differentiation enables rapid development of nonlinear material models without impacting debuggability and workflows targeting GPUs. The methods are broadly applicable and amenable to common workflows, presented here via open source libraries that encapsulate all GPU-specific aspects and are accessible to both new and legacy code, allowing application code to be GPU-oblivious without compromising end-to-end performance on GPUs. △ Less

Submitted 23 May, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

ACM Class: G.1.8; G.1.5; G.1.10; G.4; J.2; J.6; D.1.3

arXiv:2108.02197 [pdf, other]

Singularly Near Optimal Leader Election in Asynchronous Networks

Authors: Shay Kutten, William K. Moses Jr., Gopal Pandurangan, David Peleg

Abstract: This paper concerns designing distributed algorithms that are {\em singularly optimal}, i.e., algorithms that are {\em simultaneously} time and message {\em optimal}, for the fundamental leader election problem in {\em asynchronous} networks. Kutten et al. (JACM 2015) presented a singularly near optimal randomized leader election algorithm for general {\em synchronous} networks that ran in… ▽ More This paper concerns designing distributed algorithms that are {\em singularly optimal}, i.e., algorithms that are {\em simultaneously} time and message {\em optimal}, for the fundamental leader election problem in {\em asynchronous} networks. Kutten et al. (JACM 2015) presented a singularly near optimal randomized leader election algorithm for general {\em synchronous} networks that ran in $O(D)$ time and used $O(m \log n)$ messages (where $D$, $m$, and $n$ are the network's diameter, number of edges and number of nodes, respectively) with high probability.\footnote{Throughout, "with high probability" means "with probability at least $1-1/n^c$, for constant $c$."} Both bounds are near optimal (up to a logarithmic factor), since $Ω(D)$ and $Ω(m)$ are the respective lower bounds for time and messages for leader election even for synchronous networks and even for (Monte-Carlo) randomized algorithms. On the other hand, for general asynchronous networks, leader election algorithms are only known that are either time or message optimal, but not both. Kutten et al. (DISC 2020) presented a randomized asynchronous leader election algorithm that is singularly near optimal for \emph{complete networks}, but left open the problem for general networks. This paper shows that singularly near optimal (up to polylogarithmic factors) bounds can be achieved for general {\em asynchronous} networks. We present a randomized singularly near optimal leader election algorithm that runs in $O(D + \log^2n)$ time and $O(m\log^2 n)$ messages with high probability. Our result is the first known distributed leader election algorithm for asynchronous networks that is near optimal with respect to both time and message complexity and improves over a long line of results including the classical results of Gallager et al. (ACM TOPLAS, 1983), Peleg (JPDC, 1989), and Awerbuch (STOC 89). △ Less

Submitted 9 August, 2021; v1 submitted 4 August, 2021; originally announced August 2021.

Comments: 22 pages. Accepted to DISC 2021

ACM Class: F.2.2; F.2.3; G.3

arXiv:2106.01108 [pdf, other]

Efficient Deterministic Leader Election for Programmable Matter

Authors: Fabien Dufoulon, Shay Kutten, William K. Moses Jr.

Abstract: It was suggested that a programmable matter system (composed of multiple computationally weak mobile particles) should remain connected at all times since otherwise, reconnection is difficult and may be impossible. At the same time, it was not clear that allowing the system to disconnect carried a significant advantage in terms of time complexity. We demonstrate for a fundamental task, that of lea… ▽ More It was suggested that a programmable matter system (composed of multiple computationally weak mobile particles) should remain connected at all times since otherwise, reconnection is difficult and may be impossible. At the same time, it was not clear that allowing the system to disconnect carried a significant advantage in terms of time complexity. We demonstrate for a fundamental task, that of leader election, an algorithm where the system disconnects and then reconnects automatically in a non-trivial way (particles can move far away from their former neighbors and later reconnect to others). Moreover, the runtime of the temporarily disconnecting deterministic leader election algorithm is linear in the diameter. Hence, the disconnecting -- reconnecting algorithm is as fast as previous randomized algorithms. When comparing to previous deterministic algorithms, we note that some of the previous work assumed weaker schedulers. Still, the runtime of all the previous deterministic algorithms that did not assume special shapes of the particle system (shapes with no holes) was at least quadratic in $n$, where $n$ is the number of particles in the system. (Moreover, the new algorithm is even faster in some parameters than the deterministic algorithms that did assume special initial shapes.) Since leader election is an important module in algorithms for various other tasks, the presented algorithm can be useful for speeding up other algorithms under the assumption of a strong scheduler. This leaves open the question: "can a deterministic algorithm be as fast as the randomized ones also under weaker schedulers?" △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: PODC 2021

arXiv:2102.07528 [pdf, ps, other]

Byzantine Dispersion on Graphs

Authors: Anisur Rahaman Molla, Kaushik Mondal, William K. Moses Jr

Abstract: This paper considers the problem of Byzantine dispersion and extends previous work along several parameters. The problem of Byzantine dispersion asks: given $n$ robots, up to $f$ of which are Byzantine, initially placed arbitrarily on an $n$ node anonymous graph, design a terminating algorithm to be run by the robots such that they eventually reach a configuration where each node has at most one n… ▽ More This paper considers the problem of Byzantine dispersion and extends previous work along several parameters. The problem of Byzantine dispersion asks: given $n$ robots, up to $f$ of which are Byzantine, initially placed arbitrarily on an $n$ node anonymous graph, design a terminating algorithm to be run by the robots such that they eventually reach a configuration where each node has at most one non-Byzantine robot on it. Previous work solved this problem for rings and tolerated up to $n-1$ Byzantine robots. In this paper, we investigate the problem on more general graphs. We first develop an algorithm that tolerates up to $n-1$ Byzantine robots and works for a more general class of graphs. We then develop an algorithm that works for any graph but tolerates a lesser number of Byzantine robots. We subsequently turn our focus to the strength of the Byzantine robots. Previous work considers only ``weak" Byzantine robots that cannot fake their IDs. We develop an algorithm that solves the problem when Byzantine robots are not weak and can fake IDs. Finally, we study the situation where the number of the robots is not $n$ but some $k$. We show that in such a scenario, the number of Byzantine robots that can be tolerated is severely restricted. Specifically, we show that it is impossible to deterministically solve Byzantine dispersion when $\lceil k/n \rceil > \lceil (k-f)/n \rceil$. △ Less

Submitted 26 September, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

arXiv:2010.01709 [pdf, other]

doi 10.5555/3495724.3496770

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Authors: William S. Moses, Valentin Churavy

Abstract: Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of st… ▽ More Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows. △ Less

Submitted 4 October, 2020; originally announced October 2020.

Comments: To be published in NeurIPS 2020

arXiv:2008.02782 [pdf, other]

Singularly Optimal Randomized Leader Election

Authors: Shay Kutten, William K. Moses Jr., Gopal Pandurangan, David Peleg

Abstract: This paper concerns designing distributed algorithms that are singularly optimal, i.e., algorithms that are simultaneously time and message optimal, for the fundamental leader election problem in networks. Our main result is a randomized distributed leader election algorithm for asynchronous complete networks that is essentially (up to a polylogarithmic factor) singularly optimal. Our algorithm us… ▽ More This paper concerns designing distributed algorithms that are singularly optimal, i.e., algorithms that are simultaneously time and message optimal, for the fundamental leader election problem in networks. Our main result is a randomized distributed leader election algorithm for asynchronous complete networks that is essentially (up to a polylogarithmic factor) singularly optimal. Our algorithm uses $O(n)$ messages with high probability and runs in $O(\log^2 n)$ time (with high probability) to elect a unique leader. The $O(n)$ message complexity should be contrasted with the $Ω(n \log n)$ lower bounds for the deterministic message complexity of leader election algorithms (regardless of time), proven by Korach, Moran, and Zaks (TCS, 1989) for asynchronous algorithms and by Afek and Gafni (SIAM J. Comput., 1991) for synchronous networks. Hence, our result also separates the message complexities of randomized and deterministic leader election. More importantly, our (randomized) time complexity of $O(\log^2 n)$ for obtaining the optimal $O(n)$ message complexity is significantly smaller than the long-standing $\tildeΘ(n)$ time complexity obtained by Afek and Gafni and by Singh (SIAM J. Comput., 1997) for message optimal (deterministic) election in asynchronous networks. In synchronous complete networks, Afek and Gafni showed an essentially singularly optimal deterministic algorithm with $O(\log n)$ time and $O(n \log n)$ messages. Ramanathan et al. (Distrib. Comput. 2007) used randomization to improve the message complexity, and showed a randomized algorithm with $O(n)$ messages and $O(\log n)$ time (with failure probability $O(1 / \log^{Ω(1)}n)$). Our second result is a tightly singularly optimal randomized algorithm, with $O(1)$ time and $O(n)$ messages, for this setting, whose time bound holds with certainty and message bound holds with high probability. △ Less

Submitted 16 August, 2020; v1 submitted 6 August, 2020; originally announced August 2020.

Comments: 24 pages. Full version of paper accepted at DISC 2020

ACM Class: F.2.2; F.2.3; G.2.2; G.3

arXiv:2005.13685 [pdf, other]

ProTuner: Tuning Programs with Monte Carlo Tree Search

Authors: Ameer Haj-Ali, Hasan Genc, Qi**g Huang, William Moses, John Wawrzynek, Krste Asanović, Ion Stoica

Abstract: We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate performance comparisons between partial and less mea… ▽ More We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate performance comparisons between partial and less meaningful schedules, MCTS compares complete schedules and looks ahead before making any intermediate scheduling decision. We further explore modifications to the standard MCTS algorithm as well as combining real execution time measurements with the cost model. Our results show that MCTS can outperform beam search on a suite of 16 real benchmarks. △ Less

Submitted 27 May, 2020; originally announced May 2020.

arXiv:2004.11439 [pdf, ps, other]

Efficient Dispersion on an Anonymous Ring in the Presence of Weak Byzantine Robots

Authors: Anisur Rahaman Molla, Kaushik Mondal, William K. Moses Jr

Abstract: The problem of dispersion of mobile robots on a graph asks that $n$ robots initially placed arbitrarily on the nodes of an $n$-node anonymous graph, autonomously move to reach a final configuration where exactly each node has at most one robot on it. This problem is of significant interest due to its relationship to other fundamental robot coordination problems, such as exploration, scattering, lo… ▽ More The problem of dispersion of mobile robots on a graph asks that $n$ robots initially placed arbitrarily on the nodes of an $n$-node anonymous graph, autonomously move to reach a final configuration where exactly each node has at most one robot on it. This problem is of significant interest due to its relationship to other fundamental robot coordination problems, such as exploration, scattering, load balancing, relocation of self-driving electric cars to recharge stations, etc. The robots have unique IDs, typically in the range $[1,poly(n)]$ and limited memory, whereas the graph is anonymous, i.e., the nodes do not have identifiers. The objective is to simultaneously minimize two performance metrics: (i) time to achieve dispersion and (ii) memory requirement at each robot. This problem has been relatively well-studied when robots are non-faulty. In this paper, we introduce the notion of Byzantine faults to this problem, i.e., we formalize the problem of dispersion in the presence of up to $f$ Byzantine robots. We then study the problem on a ring while simultaneously optimizing the time complexity of algorithms and the memory requirement per robot. Specifically, we design deterministic algorithms that attempt to match the time lower bound ($Ω(n)$ rounds) and memory lower bound ($Ω(\log n)$ bits per robot). Our main result is a deterministic algorithm that is both time and memory optimal, i.e., $O(n)$ rounds and $O(\log n)$ bits of memory required per robot, subject to certain constraints. We subsequently provide results that require less assumptions but are either only time or memory optimal but not both. We also provide a primitive, utilized often, that takes robots initially gathered at a node of the ring and disperses them in a time and memory optimal manner without additional assumptions required. △ Less

Submitted 3 September, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

Comments: 20 pages

arXiv:2003.00671 [pdf, other]

AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning

Authors: Qi**g Huang, Ameer Haj-Ali, William Moses, John Xiang, Ion Stoica, Krste Asanovic, John Wawrzynek

Abstract: The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order--often referred to as the phase-ordering problem, is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, w… ▽ More The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order--often referred to as the phase-ordering problem, is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement AutoPhase: a framework that takes a program and uses deep reinforcement learning to find a sequence of compilation passes that minimizes its execution time. Without loss of generality, we construct this framework in the context of the LLVM compiler toolchain and target high-level synthesis programs. We use random forests to quantify the correlation between the effectiveness of a given pass and the program's features. This helps us reduce the search space by avoiding phase orderings that are unlikely to improve the performance of a given program. We compare the performance of AutoPhase to state-of-the-art algorithms that address the phase-ordering problem. In our evaluation, we show that AutoPhase improves circuit performance by 28% when compared to using the -O3 compiler flag, and achieves competitive results compared to the state-of-the-art solutions, while requiring fewer samples. Furthermore, unlike existing state-of-the-art solutions, our deep reinforcement learning solution shows promising result in generalizing to real benchmarks and 12,874 different randomly generated programs, after training on a hundred randomly generated programs. △ Less

Submitted 4 March, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: arXiv admin note: text overlap with arXiv:1901.04615

arXiv:2001.04525 [pdf, other]

Live Exploration with Mobile Robots in a Dynamic Ring, Revisited

Authors: Subhrangsu Mandal, Anisur Rahaman Molla, William K. Moses Jr

Abstract: The graph exploration problem requires a group of mobile robots, initially placed arbitrarily on the nodes of a graph, to work collaboratively to explore the graph such that each node is eventually visited by at least one robot. One important requirement of exploration is the {\em termination} condition, i.e., the robots must know that exploration is completed. The problem of live exploration of a… ▽ More The graph exploration problem requires a group of mobile robots, initially placed arbitrarily on the nodes of a graph, to work collaboratively to explore the graph such that each node is eventually visited by at least one robot. One important requirement of exploration is the {\em termination} condition, i.e., the robots must know that exploration is completed. The problem of live exploration of a dynamic ring using mobile robots was recently introduced in [Di Luna et al., ICDCS 2016]. In it, they proposed multiple algorithms to solve exploration in fully synchronous and semi-synchronous settings with various guarantees when $2$ robots were involved. They also provided guarantees that with certain assumptions, exploration of the ring using two robots was impossible. An important question left open was how the presence of $3$ robots would affect the results. In this paper, we try to settle this question in a fully synchronous setting and also show how to extend our results to a semi-synchronous setting. In particular, we present algorithms for exploration with explicit termination using $3$ robots in conjunction with either (i) unique IDs of the robots and edge crossing detection capability (i.e., two robots moving in opposite directions through an edge in the same round can detect each other), or (ii) access to randomness. The time complexity of our deterministic algorithm is asymptotically optimal. We also provide complementary impossibility results showing that there does not exist any explicit termination algorithm for $2$ robots. The theoretical analysis and comprehensive simulations of our algorithm show the effectiveness and efficiency of the algorithm in dynamic rings. We also present an algorithm to achieve exploration with partial termination using $3$ robots in the semi-synchronous setting. △ Less

Submitted 13 January, 2020; originally announced January 2020.

Comments: 13 pages

arXiv:1910.05664 [pdf, other]

Extracting Incentives from Black-Box Decisions

Authors: Yonadav Shavit, William S. Moses

Abstract: An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects' behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clea… ▽ More An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects' behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clear, we prove that for many non-linear functions (e.g. neural networks, random forests), classical methods for interpreting the behavior of models (e.g. input gradients) provide poor advice to individuals on which actions they should take. In this work, we propose a mathematical framework for understanding algorithmic incentives as the challenge of solving a Markov Decision Process, where the state includes the set of input features, and the reward is a function of the model's output. We can then leverage the many toolkits for solving MDPs (e.g. tree-based planning, reinforcement learning) to identify the optimal actions each individual is incentivized to take to improve their decision under a given model. We demonstrate the utility of our method by estimating the maximally-incentivized actions in two real-world settings: a recidivism risk predictor we train using ProPublica's COMPAS dataset, and an online credit scoring tool published by the Fair Isaac Corporation (FICO). △ Less

Submitted 12 October, 2019; originally announced October 2019.

Comments: Accepted to the NeurIPS 2019 Workshop on Robust AI in Financial Services: Data, Fairness, Explainability, Trustworthiness, and Privacy

arXiv:1905.00580 [pdf, other]

Deterministic Leader Election in Programmable Matter

Authors: Yuval Emek, Shay Kutten, Ron Lavi, William K. Moses Jr

Abstract: Addressing a fundamental problem in programmable matter, we present the first deterministic algorithm to elect a unique leader in a system of connected amoebots assuming only that amoebots are initially contracted. Previous algorithms either used randomization, made various assumptions (shapes with no holes, or known shared chirality), or elected several co-leaders in some cases. Some of the bui… ▽ More Addressing a fundamental problem in programmable matter, we present the first deterministic algorithm to elect a unique leader in a system of connected amoebots assuming only that amoebots are initially contracted. Previous algorithms either used randomization, made various assumptions (shapes with no holes, or known shared chirality), or elected several co-leaders in some cases. Some of the building blocks we introduce in constructing the algorithm are of interest by themselves, especially the procedure we present for reaching common chirality among the amoebots. Given the leader election and the chirality agreement building block, it is known that various tasks in programmable matter can be performed or improved. The main idea of the new algorithm is the usage of the ability of the amoebots to move, which previous leader election algorithms have not used. △ Less

Submitted 2 May, 2019; originally announced May 2019.

Comments: 33 pages, 16 figures, to appear in the proceedings of ICALP 2019

arXiv:1902.10489 [pdf, ps, other]

doi 10.1007/978-3-030-14812-6_30

Dispersion of Mobile Robots: The Power of Randomness

Authors: Anisur Rahaman Molla, William K. Moses Jr

Abstract: We consider cooperation among insects, modeled as cooperation between mobile robots on a graph. Within this setting, we consider the problem of mobile robot dispersion on graphs. The study of mobile robots on a graph is an interesting paradigm with many interesting problems and applications. The problem of dispersion in this context, introduced by Augustine and Moses Jr., asks that $n$ robots, ini… ▽ More We consider cooperation among insects, modeled as cooperation between mobile robots on a graph. Within this setting, we consider the problem of mobile robot dispersion on graphs. The study of mobile robots on a graph is an interesting paradigm with many interesting problems and applications. The problem of dispersion in this context, introduced by Augustine and Moses Jr., asks that $n$ robots, initially placed arbitrarily on an $n$ node graph, work together to quickly reach a configuration with exactly one robot at each node. Previous work on this problem has looked at the trade-off between the time to achieve dispersion and the amount of memory required by each robot. However, the trade-off was analyzed for \textit{deterministic algorithms} and the minimum memory required to achieve dispersion was found to be $Ω(\log n)$ bits at each robot. In this paper, we show that by harnessing the power of \textit{randomness}, one can achieve dispersion with $O(\log Δ)$ bits of memory at each robot, where $Δ$ is the maximum degree of the graph. Furthermore, we show a matching lower bound of $Ω(\log Δ)$ bits for any \textit{randomized algorithm} to solve dispersion. We further extend the problem to a general $k$-dispersion problem where $k> n$ robots need to disperse over $n$ nodes such that at most $\lceil k/n \rceil$ robots are at each node in the final configuration. △ Less

Submitted 27 February, 2019; originally announced February 2019.

Comments: 20 pages, 1 table. Accepted at TAMC 2019: Theory & Applications of Models of Computation. The final authenticated version is available online at https://doi.org/10.1007/978-3-030-14812-6_30

arXiv:1901.04615 [pdf, other]

AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning

Authors: Ameer Haj-Ali, Qi**g Huang, William Moses, John Xiang, Ion Stoica, Krste Asanovic, John Wawrzynek

Abstract: The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order--often referred to as the phase-ordering problem--is an NP-hard problem. In this paper, we evaluate a new technique to address the… ▽ More The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order--often referred to as the phase-ordering problem--is an NP-hard problem. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. We implement a framework in the context of the LLVM compiler to optimize the ordering for HLS programs and compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag. △ Less

Submitted 3 April, 2019; v1 submitted 14 January, 2019; originally announced January 2019.

arXiv:1802.04730 [pdf, other]

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Authors: Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen

Abstract: Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and… ▽ More Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrap** high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff] △ Less

Submitted 28 June, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

arXiv:1707.06391 [pdf, other]

Deterministic Dispersion of Mobile Robots in Dynamic Rings

Authors: Ankush Agarwalla, John Augustine, William K. Moses Jr., Madhav Sankar K., Arvind Krishna Sridhar

Abstract: In this work, we study the problem of dispersion of mobile robots on dynamic rings. The problem of dispersion of $n$ robots on an $n$ node graph, introduced by Augustine and Moses Jr. [1], requires robots to coordinate with each other and reach a configuration where exactly one robot is present on each node. This problem has real world applications and applies whenever we want to minimize the tota… ▽ More In this work, we study the problem of dispersion of mobile robots on dynamic rings. The problem of dispersion of $n$ robots on an $n$ node graph, introduced by Augustine and Moses Jr. [1], requires robots to coordinate with each other and reach a configuration where exactly one robot is present on each node. This problem has real world applications and applies whenever we want to minimize the total cost of $n$ agents sharing $n$ resources, located at various places, subject to the constraint that the cost of an agent moving to a different resource is comparatively much smaller than the cost of multiple agents sharing a resource (e.g. smart electric cars sharing recharge stations). The study of this problem also provides indirect benefits to the study of scattering on graphs, the study of exploration by mobile robots, and the study of load balancing on graphs. We solve the problem of dispersion in the presence of two types of dynamism in the underlying graph: (i) vertex permutation and (ii) 1-interval connectivity. We introduce the notion of vertex permutation dynamism and have it mean that for a given set of nodes, in every round, the adversary ensures a ring structure is maintained, but the connections between the nodes may change. We use the idea of 1-interval connectivity from Di Luna et al. [10], where for a given ring, in each round, the adversary chooses at most one edge to remove. We assume robots have full visibility and present asymptotically time optimal algorithms to achieve dispersion in the presence of both types of dynamism when robots have chirality. When robots do not have chirality, we present asymptotically time optimal algorithms to achieve dispersion subject to certain constraints. Finally, we provide impossibility results for dispersion when robots have no visibility. △ Less

Submitted 16 October, 2017; v1 submitted 20 July, 2017; originally announced July 2017.

Comments: 21 pages, 10 figures, concise version of paper to appear in ICDCN 2018

ACM Class: F.2.2; G.2.2

arXiv:1707.05629 [pdf, ps, other]

Dispersion of Mobile Robots: A Study of Memory-Time Trade-offs

Authors: John Augustine, William K. Moses Jr

Abstract: We introduce a new problem in the domain of mobile robots, which we term dispersion. In this problem, $n$ robots are placed in an $n$ node graph arbitrarily and must coordinate with each other to reach a final configuration such that exactly one robot is at each node. We study this problem through the lenses of minimizing the memory required by each robot and of minimizing the number of rounds req… ▽ More We introduce a new problem in the domain of mobile robots, which we term dispersion. In this problem, $n$ robots are placed in an $n$ node graph arbitrarily and must coordinate with each other to reach a final configuration such that exactly one robot is at each node. We study this problem through the lenses of minimizing the memory required by each robot and of minimizing the number of rounds required to achieve dispersion. Dispersion is of interest due to its relationship to the problems of scattering on a graph, exploration using mobile robots, and load balancing on a graph. Additionally, dispersion has an immediate real world application due to its relationship to the problem of recharging electric cars, as each car can be considered a robot and recharging stations and the roads connecting them nodes and edges of a graph respectively. Since recharging is a costly affair relative to traveling, we want to distribute these cars amongst the various available recharge points where communication should be limited to car-to-car interactions. We provide lower bounds on both the memory required for robots to achieve dispersion and the minimum running time to achieve dispersion on any type of graph. We then analyze the trade-offs between time and memory for various types of graphs. We provide time optimal and memory optimal algorithms for several types of graphs and show the power of a little memory in terms of running time. △ Less

Submitted 17 July, 2018; v1 submitted 18 July, 2017; originally announced July 2017.

Comments: 18 pages, conference version appeared in ICDCN 2018

ACM Class: F.2.2; G.2.2

arXiv:1702.02460 [pdf, ps, other]

Deterministic Backbone Creation in an SINR Network without Knowledge of Location

Authors: Dariusz R. Kowalski, William K. Moses Jr., Shailesh Vaya

Abstract: For a given network, a backbone is an overlay network consisting of a connected dominating set with additional accessibility properties. Once a backbone is created for a network, it can be utilized for fast communication amongst the nodes of the network. The Signal-to-Interference-plus-Noise-Ratio (SINR) model has become the standard for modeling communication among devices in wireless networks.… ▽ More For a given network, a backbone is an overlay network consisting of a connected dominating set with additional accessibility properties. Once a backbone is created for a network, it can be utilized for fast communication amongst the nodes of the network. The Signal-to-Interference-plus-Noise-Ratio (SINR) model has become the standard for modeling communication among devices in wireless networks. For this model, the community has pondered what the most realistic solutions for communication problems in wireless networks would look like. Such solutions would have the characteristic that they would make the least number of assumptions about the availability of information about the participating nodes. Solving problems when nothing at all is known about the network and having nodes just start participating would be ideal. However, this is quite challenging and most likely not feasible. The pragmatic approach is then to make meaningful assumptions about the available information and present efficient solutions based on this information. We present a solution for creation of backbone in the SINR model, when nodes do not have access to their physical coordinates or the coordinates of other nodes in the network. This restriction models the deployment of nodes in various situations for sensing hurricanes, cyclones, and so on, where only information about nodes prior to their deployment may be known but not their actual locations post deployment. We assume that nodes have access to knowledge of their label, the labels of nodes within their neighborhood, the range from which labels are taken $[N]$ and the total number of participating nodes $n$. We also assume that nodes wake up spontaneously. We present an efficient deterministic protocol to create a backbone with a round complexity of $O(Δ\lg^2 N)$. △ Less

Submitted 8 February, 2017; originally announced February 2017.

Comments: 12 pages, 1 table

ACM Class: F.2.2; G.2.2

arXiv:1702.02455 [pdf, other]

Deterministic Protocols in the SINR Model without Knowledge of Coordinates

Authors: William K. Moses Jr., Shailesh Vaya

Abstract: Much work has been developed for studying the classical broadcasting problem in the SINR (Signal-to-Interference-plus-Noise-Ratio) model for wireless device transmission. The setting typically studied is when all radio nodes transmit a signal of the same strength. This work studies the challenging problem of devising a distributed algorithm for multi-broadcasting, assuming a subset of nodes are in… ▽ More Much work has been developed for studying the classical broadcasting problem in the SINR (Signal-to-Interference-plus-Noise-Ratio) model for wireless device transmission. The setting typically studied is when all radio nodes transmit a signal of the same strength. This work studies the challenging problem of devising a distributed algorithm for multi-broadcasting, assuming a subset of nodes are initially awake, for the SINR model when each device only has access to knowledge about the total number of nodes in the network $n$, the range from which each node's label is taken $\lbrace 1,\dots,N \rbrace$, and the label of the device itself. Specifically, we assume no knowledge of the physical coordinates of devices and also no knowledge of the neighborhood of each node. We present a deterministic protocol for this problem in $O(n \lg N \lg n)$ rounds. There is no known polynomial time deterministic algorithm in literature for this setting, and it remains the principle open problem in this domain. A lower bound of $Ω(n \lg N)$ rounds is known for deterministic broadcasting without local knowledge. In addition to the above result, we present algorithms to achieve multi-broadcast in $O(n \lg N)$ rounds and create a backbone in $O(n \lg N)$ rounds, assuming that all nodes are initially awake. For a given backbone, messages can be exchanged between every pair of connected nodes in the backbone in $O(\lg N)$ rounds and between any node and its designated contact node in the backbone in $O(Δ\lg N)$ rounds. △ Less

Submitted 15 August, 2020; v1 submitted 8 February, 2017; originally announced February 2017.

Comments: This is the author version of the paper which will appear in the Journal of Computer and System Sciences. 36 pages, 1 table, 4 figures; v3 improves the presentation, style, and some technical matter of the paper

ACM Class: F.2.2; G.2.2

arXiv:1702.01973 [pdf, ps, other]

Achieving Dilution without Knowledge of Coordinates in the SINR Model

Authors: William K. Moses Jr., Shailesh Vaya

Abstract: Considerable literature has been developed for various fundamental distributed problems in the SINR (Signal-to-Interference-plus-Noise-Ratio) model for radio transmission. A setting typically studied is when all nodes transmit a signal of the same strength, and each device only has access to knowledge about the total number of nodes in the network $n$, the range from which each node's label is tak… ▽ More Considerable literature has been developed for various fundamental distributed problems in the SINR (Signal-to-Interference-plus-Noise-Ratio) model for radio transmission. A setting typically studied is when all nodes transmit a signal of the same strength, and each device only has access to knowledge about the total number of nodes in the network $n$, the range from which each node's label is taken $[1,\dots,N]$, and the label of the device itself. In addition, an assumption is made that each node also knows its coordinates in the Euclidean plane. In this paper, we create a technique which allows algorithm designers to remove that last assumption. The assumption about the unavailability of the knowledge of the physical coordinates of the nodes truly captures the `ad-hoc' nature of wireless networks. Previous work in this area uses a flavor of a technique called dilution, in which nodes transmit in a (predetermined) round-robin fashion, and are able to reach all their neighbors. However, without knowing the physical coordinates, it's not possible to know the coordinates of their containing (pivotal) grid box and seemingly not possible to use dilution (to coordinate their transmissions). We propose a new technique to achieve dilution without using the knowledge of physical coordinates. This technique exploits the understanding that the transmitting nodes lie in 2-D space, segmented by an appropriate pivotal grid, without explicitly referring to the actual physical coordinates of these nodes. Using this technique, it is possible for every weak device to successfully transmit its message to all of its neighbors in $Θ(\lg N)$ rounds, as long as the density of transmitting nodes in any physical grid box is bounded by a known constant. This technique, we feel, is an important generic tool for devising practical protocols when physical coordinates of the nodes are not known. △ Less

Submitted 7 February, 2017; originally announced February 2017.

Comments: 10 pages

ACM Class: F.2.2

arXiv:1607.04220 [pdf, other]

Computational Complexity of Arranging Music

Authors: William S. Moses, Erik D. Demaine

Abstract: This paper proves that arrangement of music is NP-hard when subject to various constraints: avoiding musical dissonance, limiting how many notes can be played simultaneously, and limiting transition speed between chords. These results imply the computational complexity of related musical problems, including musical choreography and rhythm games. This paper proves that arrangement of music is NP-hard when subject to various constraints: avoiding musical dissonance, limiting how many notes can be played simultaneously, and limiting transition speed between chords. These results imply the computational complexity of related musical problems, including musical choreography and rhythm games. △ Less

Submitted 14 July, 2016; originally announced July 2016.

arXiv:1602.08298 [pdf, ps, other]

doi 10.1137/1.9781611974331.ch48

Balanced Allocation: Patience is not a Virtue

Authors: John Augustine, William K. Moses Jr., Amanda Redlich, Eli Upfal

Abstract: Load balancing is a well-studied problem, with balls-in-bins being the primary framework. The greedy algorithm $\mathsf{Greedy}[d]$ of Azar et al. places each ball by probing $d > 1$ random bins and placing the ball in the least loaded of them. With high probability, the maximum load under $\mathsf{Greedy}[d]$ is exponentially lower than the result when balls are placed uniformly randomly. Vöcking… ▽ More Load balancing is a well-studied problem, with balls-in-bins being the primary framework. The greedy algorithm $\mathsf{Greedy}[d]$ of Azar et al. places each ball by probing $d > 1$ random bins and placing the ball in the least loaded of them. With high probability, the maximum load under $\mathsf{Greedy}[d]$ is exponentially lower than the result when balls are placed uniformly randomly. Vöcking showed that a slightly asymmetric variant, $\mathsf{Left}[d]$, provides a further significant improvement. However, this improvement comes at an additional computational cost of imposing structure on the bins. Here, we present a fully decentralized and easy-to-implement algorithm called $\mathsf{FirstDiff}[d]$ that combines the simplicity of $\mathsf{Greedy}[d]$ and the improved balance of $\mathsf{Left}[d]$. The key idea in $\mathsf{FirstDiff}[d]$ is to probe until a different bin size from the first observation is located, then place the ball. Although the number of probes could be quite large for some of the balls, we show that $\mathsf{FirstDiff}[d]$ requires only at most $d$ probes on average per ball (in both the standard and the heavily-loaded settings). Thus the number of probes is no greater than either that of $\mathsf{Greedy}[d]$ or $\mathsf{Left}[d]$. More importantly, we show that $\mathsf{FirstDiff}[d]$ closely matches the improved maximum load ensured by $\mathsf{Left}[d]$ in both the standard and heavily-loaded settings. We further provide a tight lower bound on the maximum load up to $O(\log \log \log n)$ terms. We additionally give experimental data that $\mathsf{FirstDiff}[d]$ is indeed as good as $\mathsf{Left}[d]$, if not better, in practice. △ Less

Submitted 22 January, 2018; v1 submitted 26 February, 2016; originally announced February 2016.

Comments: 26 pages, preliminary version accepted at SODA 2016

ACM Class: F.2.2; G.2.0; G.3

Journal ref: In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2016), 655-671

arXiv:1311.6209 [pdf]

Distributed Algorithms for Large-Scale Graphs

Authors: Khalid Hourani, Hartmut Klauck, William K. Moses Jr., Danupon Nanongkai, Gopal Pandurangan, Peter Robinson, Michele Scquizzato

Abstract: Motivated by the increasing need for fast processing of large-scale graphs, we study a number of fundamental graph problems in a message-passing model for distributed computing, called $k$-machine model, where we have $k$ machines that jointly perform computations on $n$-node graphs. The graph is assumed to be partitioned in a balanced fashion among the $k$ machines, a common implementation in man… ▽ More Motivated by the increasing need for fast processing of large-scale graphs, we study a number of fundamental graph problems in a message-passing model for distributed computing, called $k$-machine model, where we have $k$ machines that jointly perform computations on $n$-node graphs. The graph is assumed to be partitioned in a balanced fashion among the $k$ machines, a common implementation in many real-world systems. Communication is point-to-point via bandwidth-constrained links, and the goal is to minimize the round complexity, i.e., the number of communication rounds required to finish a computation. We present a generic methodology that allows to obtain efficient algorithms in the $k$-machine model using distributed algorithms for the classical CONGEST model of distributed computing. Using this methodology, we obtain algorithms for various fundamental graph problems such as connectivity, minimum spanning trees, shortest paths, maximal independent sets, and finding subgraphs, showing that many of these problems can be solved in $\tilde{O}(n/k)$ rounds; this shows that one can achieve speedup nearly linear in $k$. To complement our upper bounds, we present lower bounds on the round complexity that quantify the fundamental limitations of solving graph problems distributively. We first show a lower bound of $Ω(n/k)$ rounds for computing a spanning tree of the input graph. This result implies the same bound for other fundamental problems such as computing a minimum spanning tree, breadth-first tree, or shortest paths tree. We also show a $\tilde Ω(n/k^2)$ lower bound for connectivity, spanning tree verification and other related problems. The latter lower bounds follow from the development and application of novel results in a random-partition variant of the classical communication complexity model. △ Less

Submitted 8 February, 2023; v1 submitted 25 November, 2013; originally announced November 2013.

Comments: Preliminary version appeared at SODA 2015

ACM Class: C.2.4; F.1.1

arXiv:1112.4033 [pdf]

doi 10.5121/ijnsa.2011.3601

Rational Secret Sharing over an Asynchronous Broadcast Channel with Information Theoretic Security

Authors: William K. Moses Jr., C. Pandu Rangan

Abstract: We consider the problem of rational secret sharing introduced by Halpern and Teague [1], where the players involved in secret sharing play only if it is to their advantage. This can be characterized in the form of preferences. Players would prefer to get the secret than to not get it and secondly with lesser preference, they would like as few other players to get the secret as possible. Several po… ▽ More We consider the problem of rational secret sharing introduced by Halpern and Teague [1], where the players involved in secret sharing play only if it is to their advantage. This can be characterized in the form of preferences. Players would prefer to get the secret than to not get it and secondly with lesser preference, they would like as few other players to get the secret as possible. Several positive results have already been published to efficiently solve the problem of rational secret sharing but only a handful of papers have touched upon the use of an asynchronous broadcast channel. [2] used cryptographic primitives, [3] used an interactive dealer, and [4] used an honest minority of players in order to handle an asynchronous broadcast channel. In our paper, we propose an m-out-of-n rational secret sharing scheme which can function over an asynchronous broadcast channel without the use of cryptographic primitives and with a non-interactive dealer. This is possible because our scheme uses a small number, k+1, of honest players. The protocol is resilient to coalitions of size up to k and furthermore it is ε-resilient to coalitions of size up to and including m-1. The protocol will have a strict Nash equilibrium with probability Pr((k+1)/n) and an ε-Nash equilibrium with probability Pr((n-k-1)/n) . Furthermore, our protocol is immune to backward induction. Later on in the paper, we extend our results to include malicious players as well. We also show that our protocol handles the possibility of a player deviating in order to force another player to get a wrong value in what we believe to be a more time efficient manner than was done in Asharov and Lindell [5]. △ Less

Submitted 17 December, 2011; originally announced December 2011.

Comments: 18 pages, 2 tables

Journal ref: International Journal of Network Security & Its Applications (IJNSA), Volume 3, Number 6, November 2011, pp. 1-18

Showing 1–37 of 37 results for author: Moses, W