-
Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems
Authors:
Bibrak Qamar Chandio,
Prateek Srivastava,
Maciej Brodowicz,
Martin Swany,
Thomas Sterling
Abstract:
The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-struct…
▽ More
The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex.
Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.
△ Less
Submitted 7 May, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Flexible Communication for Optimal Distributed Learning over Unpredictable Networks
Authors:
Sahil Tyagi,
Martin Swany
Abstract:
Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering sy…
▽ More
Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
△ Less
Submitted 29 January, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Accelerating Distributed ML Training via Selective Synchronization
Authors:
Sahil Tyagi,
Martin Swany
Abstract:
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either r…
▽ More
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
△ Less
Submitted 29 January, 2024; v1 submitted 16 July, 2023;
originally announced July 2023.
-
GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training
Authors:
Sahil Tyagi,
Martin Swany
Abstract:
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propo…
▽ More
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.
△ Less
Submitted 29 January, 2024; v1 submitted 20 May, 2023;
originally announced May 2023.
-
GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs
Authors:
Boyuan Zhang,
Jiannan Tian,
Sheng Di,
Xiaodong Yu,
Martin Swany,
Dingwen Tao,
Franck Cappello
Abstract:
Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput d…
▽ More
Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose GPULZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate GPULZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that GPULZ achieves up to 272.1X speedup on A4000 and up to 1.4X higher compression ratio compared to state-of-the-art solutions.
△ Less
Submitted 2 May, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.
-
QTrojan: A Circuit Backdoor Against Quantum Neural Networks
Authors:
Cheng Chu,
Lei Jiang,
Martin Swany,
Fan Chen
Abstract:
We propose a circuit-level backdoor attack, \textit{QTrojan}, against Quantum Neural Networks (QNNs) in this paper. QTrojan is implemented by few quantum gates inserted into the variational quantum circuit of the victim QNN. QTrojan is much stealthier than a prior Data-Poisoning-based Backdoor Attack (DPBA), since it does not embed any trigger in the inputs of the victim QNN or require the access…
▽ More
We propose a circuit-level backdoor attack, \textit{QTrojan}, against Quantum Neural Networks (QNNs) in this paper. QTrojan is implemented by few quantum gates inserted into the variational quantum circuit of the victim QNN. QTrojan is much stealthier than a prior Data-Poisoning-based Backdoor Attack (DPBA), since it does not embed any trigger in the inputs of the victim QNN or require the access to original training datasets. Compared to a DPBA, QTrojan improves the clean data accuracy by 21\% and the attack success rate by 19.9\%.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Graph Attention Multi-Agent Fleet Autonomy for Advanced Air Mobility
Authors:
Malintha Fernando,
Ransalu Senanayake,
Heeyoul Choi,
Martin Swany
Abstract:
Autonomous mobility is emerging as a new disruptive mode of urban transportation for moving cargo and passengers. However, designing scalable autonomous fleet coordination schemes to accommodate fast-growing mobility systems is challenging primarily due to the increasing heterogeneity of the fleets, time-varying demand patterns, service area expansions, and communication limitations. We introduce…
▽ More
Autonomous mobility is emerging as a new disruptive mode of urban transportation for moving cargo and passengers. However, designing scalable autonomous fleet coordination schemes to accommodate fast-growing mobility systems is challenging primarily due to the increasing heterogeneity of the fleets, time-varying demand patterns, service area expansions, and communication limitations. We introduce the concept of partially observable advanced air mobility games to coordinate a fleet of aerial vehicles by accounting for the heterogeneity of the interacting agents and the self-interested nature inherent to commercial mobility fleets. To model the complex interactions among the agents and the observation uncertainty in the mobility networks, we propose a novel heterogeneous graph attention encoder-decoder (HetGAT Enc-Dec) neural network-based stochastic policy. We train the policy by leveraging deep multi-agent reinforcement learning, allowing decentralized decision-making for the agents using their local observations. Through extensive experimentation, we show that the learned policy generalizes to various fleet compositions, demand patterns, and observation topologies. Further, fleets operating under the HetGAT Enc-Dec policy outperform other state-of-the-art graph neural network policies by achieving the highest fleet reward and fulfillment ratios in on-demand mobility networks.
△ Less
Submitted 1 August, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
ScaDLES: Scalable Deep Learning over Streaming data at the Edge
Authors:
Sahil Tyagi,
Martin Swany
Abstract:
Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming da…
▽ More
Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
△ Less
Submitted 29 January, 2024; v1 submitted 21 January, 2023;
originally announced January 2023.
-
Graphical Games for UAV Swarm Control Under Time-Varying Communication Networks
Authors:
Malintha Fernando,
Ransalu Senanayake,
Ariful Azad,
Martin Swany
Abstract:
We propose a unified framework for coordinating Unmanned Aerial Vehicle (UAV) swarms operating under time-varying communication networks. Our framework builds on the concept of graphical games, which we argue provides a compelling paradigm to subsume the interaction structures found in networked UAV swarms thanks to the shared local neighborhood properties. We present a general-sum, factorizable p…
▽ More
We propose a unified framework for coordinating Unmanned Aerial Vehicle (UAV) swarms operating under time-varying communication networks. Our framework builds on the concept of graphical games, which we argue provides a compelling paradigm to subsume the interaction structures found in networked UAV swarms thanks to the shared local neighborhood properties. We present a general-sum, factorizable payoff function for cooperative UAV swarms based on the aggregated local states and yield a Nash equilibrium for the stage games. Further, we propose a decomposition-based approach to solve stage-graphical games in a scalable and decentralized fashion by approximating virtual, mean neighborhoods. Finally, we discuss extending the proposed framework toward general-sum stochastic games by leveraging deep Q-learning and model-predictive control.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
CoCo Games: Graphical Game-Theoretic Swarm Control for Communication-Aware Coverage
Authors:
Malintha Fernando,
Ransalu Senanayake,
Martin Swany
Abstract:
We propose a novel framework for real-time communication-aware coverage control in networked robot swarms. Our framework unifies the robot dynamics with network-level message-routing to reach consensus on swarm formations in the presence of communication uncertainties by leveraging local information. Specifically, we formulate the communication-aware coverage as a cooperative graphical game, and u…
▽ More
We propose a novel framework for real-time communication-aware coverage control in networked robot swarms. Our framework unifies the robot dynamics with network-level message-routing to reach consensus on swarm formations in the presence of communication uncertainties by leveraging local information. Specifically, we formulate the communication-aware coverage as a cooperative graphical game, and use variational inference to reach mixed strategy Nash equilibria of the stage games. We experimentally validate the proposed approach in a mobile ad-hoc wireless network scenario using teams of aerial vehicles and terrestrial user equipment (UE) operating over a large geographic region of interest. We show that our approach can provide wireless coverage to stationary and mobile UEs under realistic network conditions.
△ Less
Submitted 28 April, 2022; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Cybercosm: New Foundations for a Converged Science Data Ecosystem
Authors:
Mark Asch,
François Bodin,
Micah Beck,
Terry Moore,
Michela Taufer,
Martin Swany,
Jean-Pierre Vilotte
Abstract:
Scientific communities naturally tend to organize around data ecosystems created by the combination of their observational devices, their data repositories, and the workflows essential to carry their research from observation to discovery. However, these legacy data ecosystems are now breaking down under the pressure of the exponential growth in the volume and velocity of these workflows, which ar…
▽ More
Scientific communities naturally tend to organize around data ecosystems created by the combination of their observational devices, their data repositories, and the workflows essential to carry their research from observation to discovery. However, these legacy data ecosystems are now breaking down under the pressure of the exponential growth in the volume and velocity of these workflows, which are further complicated by the need to integrate the highly data intensive methods of the Artificial Intelligence revolution. Enabling ground breaking science that makes full use of this new, data saturated research environment will require distributed systems that support dramatically improved resource sharing, workflow portability and composability, and data ecosystem convergence.
The Cybercosm vision presented in this white paper describes a radically different approach to the architecture of distributed systems for data-intensive science and its application workflows. As opposed to traditional models that restrict interoperability by hiving off storage, networking, and computing resources in separate technology silos, Cybercosm defines a minimally sufficient hypervisor as a spanning layer for its data plane that virtualizes and converges the local resources of the system's nodes in a fully interoperable manner. By building on a common, universal interface into which the problems that infect today's data-intensive workflows can be decomposed and attacked, Cybercosm aims to support scalable, portable and composable workflows that span and merge the distributed data ecosystems that characterize leading edge research communities today.
△ Less
Submitted 29 June, 2021; v1 submitted 22 May, 2021;
originally announced May 2021.
-
Energy Aware Routing with Computational Offloading for Wireless Sensor Networks
Authors:
Adam Barker,
Martin Swany
Abstract:
Wireless sensor networks (WSN) are characterized by a network of small, battery powered devices, operating remotely with no pre-existing infrastructure. The unique structure of WSN allow for novel approaches to data reduction and energy preservation. This paper presents a modification to the existing Q-routing protocol by providing an alternate action of performing sensor data reduction in place t…
▽ More
Wireless sensor networks (WSN) are characterized by a network of small, battery powered devices, operating remotely with no pre-existing infrastructure. The unique structure of WSN allow for novel approaches to data reduction and energy preservation. This paper presents a modification to the existing Q-routing protocol by providing an alternate action of performing sensor data reduction in place thereby reducing energy consumption, bandwidth usage, and message transmission time. The algorithm is further modified to include an energy factor which increases the cost of forwarding as energy reserves deplete. This encourages the network to conserve energy in favor of network preservation when energy reserves are low. Our experimental results show that this approach can, in periods of high network traffic, simultaneously reduce bandwidth, conserve energy, and maintain low message transition times.
△ Less
Submitted 30 November, 2020;
originally announced November 2020.
-
An information services algorithm to heuristically summarize IP addresses for a distributed, hierarchical directory service
Authors:
Marcos Portnoi,
Jason Zurawsky,
Martin Swany
Abstract:
A distributed, hierarchical information service for computer networks might rely in several instances, located in different layers. A distributed directory service, for example, might be comprised of upper level listings, and local directories. The upper level listings contain a compact version of the local directories. Clients desiring to access the information contained in local directories migh…
▽ More
A distributed, hierarchical information service for computer networks might rely in several instances, located in different layers. A distributed directory service, for example, might be comprised of upper level listings, and local directories. The upper level listings contain a compact version of the local directories. Clients desiring to access the information contained in local directories might first access the high-level listings, in order to locate the appropriate local instance. One of the keys for the competent operation of such service is the ability of properly summarizing the information, which will be maintained in the upper level directories. We analyze the case of the Lookup Service in the Information Services plane of perfSONAR performance monitoring distributed architecture, which implements IPv4 summarization in its functions. We propose an empirical method, or heuristic, to achieve the summarizations, based on the PATRICIA tree. We further apply the heuristic on a simulated distributed test bed and contemplate the results.
△ Less
Submitted 7 January, 2015; v1 submitted 31 December, 2014;
originally announced January 2015.
-
Offloading MPI Parallel Prefix Scan (MPI_Scan) with the NetFPGA
Authors:
Omer Arap,
Martin Swany
Abstract:
Parallel programs written using the standard Message Passing Interface (MPI) frequently depend upon the ability to efficiently execute collective operations. MPI_Scan is a collective operation defined in MPI that implements parallel prefix scan which is very useful primitive operation in several parallel applications. This operation can be very time consuming. In this paper, we explore the use of…
▽ More
Parallel programs written using the standard Message Passing Interface (MPI) frequently depend upon the ability to efficiently execute collective operations. MPI_Scan is a collective operation defined in MPI that implements parallel prefix scan which is very useful primitive operation in several parallel applications. This operation can be very time consuming. In this paper, we explore the use of hardware programmable network interface cards utilizing standard media access protocols for offloading the MPI_Scan operation to the underlying network. Our work is based upon the NetFPGA - a programmable network interface with an on-board Virtex FPGA and four Ethernet interfaces. We have implemented a network-level MPI_Scan operation using the NetFPGA for use in MPI environments. This paper compares the performance of this implementation with MPI over Ethernet for a small configuration.
△ Less
Submitted 21 August, 2014;
originally announced August 2014.