Search | arXiv e-print repository

Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems

Authors: Bibrak Qamar Chandio, Prateek Srivastava, Maciej Brodowicz, Martin Swany, Thomas Sterling

Abstract: The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-struct… ▽ More The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements. △ Less

Submitted 7 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: arXiv admin note: text overlap with arXiv:2402.02576

ACM Class: C.1.4; C.3; C.4; D.1.3

arXiv:2312.02493 [pdf, other]

doi 10.1109/BigData59044.2023.10386724

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Authors: Sahil Tyagi, Martin Swany

Abstract: Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering sy… ▽ More Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy. △ Less

Submitted 29 January, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: 2023 IEEE International Conference on Big Data (BigData)

Journal ref: 2023 IEEE International Conference on Big Data (BigData), 925-935

arXiv:2307.07950 [pdf, other]

doi 10.1109/CLUSTER52292.2023.00008

Accelerating Distributed ML Training via Selective Synchronization

Authors: Sahil Tyagi, Martin Swany

Abstract: In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either r… ▽ More In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$. △ Less

Submitted 29 January, 2024; v1 submitted 16 July, 2023; originally announced July 2023.

Journal ref: Tyagi, S., & Swany, M. (2023). Accelerating Distributed ML Training via Selective Synchronization. 2023 IEEE International Conference on Cluster Computing (CLUSTER), 1-12

arXiv:2305.12201 [pdf, other]

doi 10.1109/CLOUD60044.2023.00045

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Authors: Sahil Tyagi, Martin Swany

Abstract: Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propo… ▽ More Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup. △ Less

Submitted 29 January, 2024; v1 submitted 20 May, 2023; originally announced May 2023.

Journal ref: Tyagi, S., & Swany, M. (2023). GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training. 2023 IEEE 16th International Conference on Cloud Computing (CLOUD), 319-329

arXiv:2304.07342 [pdf, other]

doi 10.1145/3577193.3593706

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Authors: Boyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Martin Swany, Dingwen Tao, Franck Cappello

Abstract: Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput d… ▽ More Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose GPULZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate GPULZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that GPULZ achieves up to 272.1X speedup on A4000 and up to 1.4X higher compression ratio compared to state-of-the-art solutions. △ Less

Submitted 2 May, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: 12 pages, 9 figures, 3 tables, accepted by ACM ICS '23

arXiv:2302.08090 [pdf, other]

QTrojan: A Circuit Backdoor Against Quantum Neural Networks

Authors: Cheng Chu, Lei Jiang, Martin Swany, Fan Chen

Abstract: We propose a circuit-level backdoor attack, \textit{QTrojan}, against Quantum Neural Networks (QNNs) in this paper. QTrojan is implemented by few quantum gates inserted into the variational quantum circuit of the victim QNN. QTrojan is much stealthier than a prior Data-Poisoning-based Backdoor Attack (DPBA), since it does not embed any trigger in the inputs of the victim QNN or require the access… ▽ More We propose a circuit-level backdoor attack, \textit{QTrojan}, against Quantum Neural Networks (QNNs) in this paper. QTrojan is implemented by few quantum gates inserted into the variational quantum circuit of the victim QNN. QTrojan is much stealthier than a prior Data-Poisoning-based Backdoor Attack (DPBA), since it does not embed any trigger in the inputs of the victim QNN or require the access to original training datasets. Compared to a DPBA, QTrojan improves the clean data accuracy by 21\% and the attack success rate by 19.9\%. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Journal ref: ICASSP2023

arXiv:2302.07337 [pdf, other]

doi 10.15607/RSS.2023.XIX.105

Graph Attention Multi-Agent Fleet Autonomy for Advanced Air Mobility

Authors: Malintha Fernando, Ransalu Senanayake, Heeyoul Choi, Martin Swany

Abstract: Autonomous mobility is emerging as a new disruptive mode of urban transportation for moving cargo and passengers. However, designing scalable autonomous fleet coordination schemes to accommodate fast-growing mobility systems is challenging primarily due to the increasing heterogeneity of the fleets, time-varying demand patterns, service area expansions, and communication limitations. We introduce… ▽ More Autonomous mobility is emerging as a new disruptive mode of urban transportation for moving cargo and passengers. However, designing scalable autonomous fleet coordination schemes to accommodate fast-growing mobility systems is challenging primarily due to the increasing heterogeneity of the fleets, time-varying demand patterns, service area expansions, and communication limitations. We introduce the concept of partially observable advanced air mobility games to coordinate a fleet of aerial vehicles by accounting for the heterogeneity of the interacting agents and the self-interested nature inherent to commercial mobility fleets. To model the complex interactions among the agents and the observation uncertainty in the mobility networks, we propose a novel heterogeneous graph attention encoder-decoder (HetGAT Enc-Dec) neural network-based stochastic policy. We train the policy by leveraging deep multi-agent reinforcement learning, allowing decentralized decision-making for the agents using their local observations. Through extensive experimentation, we show that the learned policy generalizes to various fleet compositions, demand patterns, and observation topologies. Further, fleets operating under the HetGAT Enc-Dec policy outperform other state-of-the-art graph neural network policies by achieving the highest fleet reward and fulfillment ratios in on-demand mobility networks. △ Less

Submitted 1 August, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: Accepted to Robotics: Science and Systems, 2023. 14 pages, 13 figures, 3 tables

Journal ref: Robotics: Science and Systems, 2023

arXiv:2301.08897 [pdf, other]

doi 10.1109/BigData55660.2022.10020597

ScaDLES: Scalable Deep Learning over Streaming data at the Edge

Authors: Sahil Tyagi, Martin Swany

Abstract: Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming da… ▽ More Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD. △ Less

Submitted 29 January, 2024; v1 submitted 21 January, 2023; originally announced January 2023.

Journal ref: Tyagi, S., & Swany, M. (2022). ScaDLES: Scalable Deep Learning over Streaming data at the Edge. 2022 IEEE International Conference on Big Data (Big Data), 2113-2122

arXiv:2205.02203 [pdf, other]

Graphical Games for UAV Swarm Control Under Time-Varying Communication Networks

Authors: Malintha Fernando, Ransalu Senanayake, Ariful Azad, Martin Swany

Abstract: We propose a unified framework for coordinating Unmanned Aerial Vehicle (UAV) swarms operating under time-varying communication networks. Our framework builds on the concept of graphical games, which we argue provides a compelling paradigm to subsume the interaction structures found in networked UAV swarms thanks to the shared local neighborhood properties. We present a general-sum, factorizable p… ▽ More We propose a unified framework for coordinating Unmanned Aerial Vehicle (UAV) swarms operating under time-varying communication networks. Our framework builds on the concept of graphical games, which we argue provides a compelling paradigm to subsume the interaction structures found in networked UAV swarms thanks to the shared local neighborhood properties. We present a general-sum, factorizable payoff function for cooperative UAV swarms based on the aggregated local states and yield a Nash equilibrium for the stage games. Further, we propose a decomposition-based approach to solve stage-graphical games in a scalable and decentralized fashion by approximating virtual, mean neighborhoods. Finally, we discuss extending the proposed framework toward general-sum stochastic games by leveraging deep Q-learning and model-predictive control. △ Less

Submitted 4 May, 2022; originally announced May 2022.

Comments: Presented in Workshop on Intelligent Aerial Robotics, International Conference on Robotics and Automation, 2022

arXiv:2111.04576 [pdf, other]

doi 10.1109/LRA.2022.3160968

CoCo Games: Graphical Game-Theoretic Swarm Control for Communication-Aware Coverage

Authors: Malintha Fernando, Ransalu Senanayake, Martin Swany

Abstract: We propose a novel framework for real-time communication-aware coverage control in networked robot swarms. Our framework unifies the robot dynamics with network-level message-routing to reach consensus on swarm formations in the presence of communication uncertainties by leveraging local information. Specifically, we formulate the communication-aware coverage as a cooperative graphical game, and u… ▽ More We propose a novel framework for real-time communication-aware coverage control in networked robot swarms. Our framework unifies the robot dynamics with network-level message-routing to reach consensus on swarm formations in the presence of communication uncertainties by leveraging local information. Specifically, we formulate the communication-aware coverage as a cooperative graphical game, and use variational inference to reach mixed strategy Nash equilibria of the stage games. We experimentally validate the proposed approach in a mobile ad-hoc wireless network scenario using teams of aerial vehicles and terrestrial user equipment (UE) operating over a large geographic region of interest. We show that our approach can provide wireless coverage to stationary and mobile UEs under realistic network conditions. △ Less

Submitted 28 April, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: 8 pages, 7 figures

Journal ref: 2022 - IEEE Robotics and Automation Letters

arXiv:2105.10680 [pdf, other]

Cybercosm: New Foundations for a Converged Science Data Ecosystem

Authors: Mark Asch, François Bodin, Micah Beck, Terry Moore, Michela Taufer, Martin Swany, Jean-Pierre Vilotte

Abstract: Scientific communities naturally tend to organize around data ecosystems created by the combination of their observational devices, their data repositories, and the workflows essential to carry their research from observation to discovery. However, these legacy data ecosystems are now breaking down under the pressure of the exponential growth in the volume and velocity of these workflows, which ar… ▽ More Scientific communities naturally tend to organize around data ecosystems created by the combination of their observational devices, their data repositories, and the workflows essential to carry their research from observation to discovery. However, these legacy data ecosystems are now breaking down under the pressure of the exponential growth in the volume and velocity of these workflows, which are further complicated by the need to integrate the highly data intensive methods of the Artificial Intelligence revolution. Enabling ground breaking science that makes full use of this new, data saturated research environment will require distributed systems that support dramatically improved resource sharing, workflow portability and composability, and data ecosystem convergence. The Cybercosm vision presented in this white paper describes a radically different approach to the architecture of distributed systems for data-intensive science and its application workflows. As opposed to traditional models that restrict interoperability by hiving off storage, networking, and computing resources in separate technology silos, Cybercosm defines a minimally sufficient hypervisor as a spanning layer for its data plane that virtualizes and converges the local resources of the system's nodes in a fully interoperable manner. By building on a common, universal interface into which the problems that infect today's data-intensive workflows can be decomposed and attacked, Cybercosm aims to support scalable, portable and composable workflows that span and merge the distributed data ecosystems that characterize leading edge research communities today. △ Less

Submitted 29 June, 2021; v1 submitted 22 May, 2021; originally announced May 2021.

Comments: Updated author list

MSC Class: ---

arXiv:2011.14795 [pdf]

Energy Aware Routing with Computational Offloading for Wireless Sensor Networks

Authors: Adam Barker, Martin Swany

Abstract: Wireless sensor networks (WSN) are characterized by a network of small, battery powered devices, operating remotely with no pre-existing infrastructure. The unique structure of WSN allow for novel approaches to data reduction and energy preservation. This paper presents a modification to the existing Q-routing protocol by providing an alternate action of performing sensor data reduction in place t… ▽ More Wireless sensor networks (WSN) are characterized by a network of small, battery powered devices, operating remotely with no pre-existing infrastructure. The unique structure of WSN allow for novel approaches to data reduction and energy preservation. This paper presents a modification to the existing Q-routing protocol by providing an alternate action of performing sensor data reduction in place thereby reducing energy consumption, bandwidth usage, and message transmission time. The algorithm is further modified to include an energy factor which increases the cost of forwarding as energy reserves deplete. This encourages the network to conserve energy in favor of network preservation when energy reserves are low. Our experimental results show that this approach can, in periods of high network traffic, simultaneously reduce bandwidth, conserve energy, and maintain low message transition times. △ Less

Submitted 30 November, 2020; originally announced November 2020.

Comments: 17 pages, NeTIOT 2020

arXiv:1501.00182 [pdf]

doi 10.1109/GRID.2010.5697950

An information services algorithm to heuristically summarize IP addresses for a distributed, hierarchical directory service

Authors: Marcos Portnoi, Jason Zurawsky, Martin Swany

Abstract: A distributed, hierarchical information service for computer networks might rely in several instances, located in different layers. A distributed directory service, for example, might be comprised of upper level listings, and local directories. The upper level listings contain a compact version of the local directories. Clients desiring to access the information contained in local directories migh… ▽ More A distributed, hierarchical information service for computer networks might rely in several instances, located in different layers. A distributed directory service, for example, might be comprised of upper level listings, and local directories. The upper level listings contain a compact version of the local directories. Clients desiring to access the information contained in local directories might first access the high-level listings, in order to locate the appropriate local instance. One of the keys for the competent operation of such service is the ability of properly summarizing the information, which will be maintained in the upper level directories. We analyze the case of the Lookup Service in the Information Services plane of perfSONAR performance monitoring distributed architecture, which implements IPv4 summarization in its functions. We propose an empirical method, or heuristic, to achieve the summarizations, based on the PATRICIA tree. We further apply the heuristic on a simulated distributed test bed and contemplate the results. △ Less

Submitted 7 January, 2015; v1 submitted 31 December, 2014; originally announced January 2015.

Comments: Grid Computing (GRID), 2010 11th IEEE/ACM International Conference on, 25-28 Oct. 2010

arXiv:1408.4939 [pdf]

Offloading MPI Parallel Prefix Scan (MPI_Scan) with the NetFPGA

Authors: Omer Arap, Martin Swany

Abstract: Parallel programs written using the standard Message Passing Interface (MPI) frequently depend upon the ability to efficiently execute collective operations. MPI_Scan is a collective operation defined in MPI that implements parallel prefix scan which is very useful primitive operation in several parallel applications. This operation can be very time consuming. In this paper, we explore the use of… ▽ More Parallel programs written using the standard Message Passing Interface (MPI) frequently depend upon the ability to efficiently execute collective operations. MPI_Scan is a collective operation defined in MPI that implements parallel prefix scan which is very useful primitive operation in several parallel applications. This operation can be very time consuming. In this paper, we explore the use of hardware programmable network interface cards utilizing standard media access protocols for offloading the MPI_Scan operation to the underlying network. Our work is based upon the NetFPGA - a programmable network interface with an on-board Virtex FPGA and four Ethernet interfaces. We have implemented a network-level MPI_Scan operation using the NetFPGA for use in MPI environments. This paper compares the performance of this implementation with MPI over Ethernet for a small configuration. △ Less

Submitted 21 August, 2014; originally announced August 2014.

Comments: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423)

Report number: FSP/2014/06

Showing 1–14 of 14 results for author: Swany, M