Search | arXiv e-print repository

doi 10.1145/3620665.3640384

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Authors: Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu

Abstract: Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of… ▽ More Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD's noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119x training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Journal ref: Published at 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-29), 2024

arXiv:2308.12066 [pdf, other]

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Authors: Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang

Abstract: Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which… ▽ More Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance. △ Less

Submitted 27 April, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

arXiv:2212.03280 [pdf, other]

Optimizing Resource Allocation with High-Reliability Constraint for Multicasting Automotive Messages in 5G NR C-V2X Networks

Authors: Kuan-Lin Chen, Wei-Yu Chen, Ren-Hung Hwang

Abstract: Cellular vehicle-to-everything (C-V2X) has been continuously evolving since Release 14 of the 3rd Generation Partnership Project (3GPP) for future autonomous vehicles. Apart from automotive safety, 5G NR further bring new capabilities to C-V2X for autonomous driving, such as real-time local update, and coordinated driving. These capabilities rely on the provision of low latency and high reliabilit… ▽ More Cellular vehicle-to-everything (C-V2X) has been continuously evolving since Release 14 of the 3rd Generation Partnership Project (3GPP) for future autonomous vehicles. Apart from automotive safety, 5G NR further bring new capabilities to C-V2X for autonomous driving, such as real-time local update, and coordinated driving. These capabilities rely on the provision of low latency and high reliability from 5G NR. Among them, a basic demand is broadcasting or multicasting environment update messages, such as cooperative perception data, with high reliability and low latency from a Road Side Unit (RSU) or a base station (BS). In other words, broadcasting multiple types of automotive messages with high reliability and low latency is one of the key issues in 5G NR C-V2X. In this work, we consider how to select Modulation and Coding Scheme (MCS), RSU/BS, Forward Error Correction (FEC) code rate, to maximize the system utility, which is a function of message delivery reliability. We formulate the optimization problem as a nonlinear integer programming problem. Since the optimization problem is NP-hard, we propose an approximation algorithm, referred to as the Hyperbolic Successive Convex Approximation (HSCA) algorithm, which uses the successive convex approximation to find the optimal solution. In our simulations, we compare the performance of HSCA with those of three algorithms respectively, including the baseline algorithm, the heuristic algorithm, and the optimal solution. Our simulation results show that HSCA outperforms the baseline and the heuristic algorithms and is very competitive to the optimal solution. △ Less

Submitted 29 September, 2022; originally announced December 2022.

Comments: 13 pages, submitted to IEEE Transactions on Vehicular Technology

MSC Class: C.2

arXiv:2209.01349 [pdf, other]

Towards the Age of Intelligent Vehicular Networks for Connected and Autonomous Vehicles in 6G

Authors: Van-Linh Nguyen, Ren-Hung Hwang, Po-Ching Lin, Abhishek Vyas, Van-Tao Nguyen

Abstract: Twenty-two years after the advent of the first-generation vehicular network, i.e., dedicated short-range communications (DSRC) standard/IEEE 802.11p, the vehicular technology market has become very competitive with a new player, Cellular Vehicle-to-Everything (C-V2X). Currently, C-V2X technology likely dominates the race because of the big advantages of comprehensive coverage and high throughput/r… ▽ More Twenty-two years after the advent of the first-generation vehicular network, i.e., dedicated short-range communications (DSRC) standard/IEEE 802.11p, the vehicular technology market has become very competitive with a new player, Cellular Vehicle-to-Everything (C-V2X). Currently, C-V2X technology likely dominates the race because of the big advantages of comprehensive coverage and high throughput/reliability. Meanwhile, DSRC-based technologies are struggling to survive and rebound with many hopes betting on the success of the second-generation standard, IEEE P802.11bd. While the standards battle to attract automotive makers and dominate the commercial market landing, the research community has started thinking about the shape of the next-generation vehicular networks. This article details the state-of-the-art progress of vehicular networks, particularly the cellular V2X-related technologies in specific use cases, compared to the features of the current generation. Through the typical examples, we also highlight why 5G is inadequate to provide the best connectivity for vehicular applications, and then 6G technologies can fill up the vacancy. △ Less

Submitted 3 September, 2022; originally announced September 2022.

arXiv:2208.12392 [pdf, other]

DiVa: An Accelerator for Differentially Private Machine Learning

Authors: Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, Minsoo Rhu

Abstract: The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data. Differential privacy (DP) is rapidly gaining momentum in the industry as a practical standard for privacy protection. Despite DP's importance, however, little has been explored within the computer systems community regarding the impli… ▽ More The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data. Differential privacy (DP) is rapidly gaining momentum in the industry as a practical standard for privacy protection. Despite DP's importance, however, little has been explored within the computer systems community regarding the implication of this emerging ML algorithm on system designs. In this work, we conduct a detailed workload characterization on a state-of-the-art differentially private ML training algorithm named DP-SGD. We uncover several unique properties of DP-SGD (e.g., its high memory capacity and computation requirements vs. non-private ML), root-causing its key bottlenecks. Based on our analysis, we propose an accelerator for differentially private ML named DiVa, which provides a significant improvement in compute utilization, leading to 2.6x higher energy-efficiency vs. conventional systolic arrays. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Accepted for publication at the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO-55), 2022

arXiv:2203.00158 [pdf, other]

GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks

Authors: Ranggi Hwang, Minhoo Kang, Jiwon Lee, Dongyun Kam, Youngjoo Lee, Minsoo Rhu

Abstract: Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a… ▽ More Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a series of sparse-dense matrix multiplication. However, prior work frequently suffers from inefficient data movements, leaving significant performance left on the table. We present GROW, a GCN accelerator based on Gustavson's algorithm to architect a row-wise product based sparse-dense GEMM accelerator. GROW co-designs the software/hardware that strikes a balance in locality and parallelism for GCNs, achieving significant energy-efficiency improvements vs. state-of-the-art GCN accelerators. △ Less

Submitted 30 November, 2022; v1 submitted 28 February, 2022; originally announced March 2022.

Comments: Accepted for publication at the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023

arXiv:2111.04941 [pdf, other]

doi 10.1609/aaai.v36i4.20373

Solving PDE-constrained Control Problems Using Operator Learning

Authors: Rakhoon Hwang, Jae Yong Lee, ** Young Shin, Hyung Ju Hwang

Abstract: The modeling and control of complex physical systems are essential in real-world problems. We propose a novel framework that is generally applicable to solving PDE-constrained optimal control problems by introducing surrogate models for PDE solution operators with special regularizers. The procedure of the proposed framework is divided into two phases: solution operator learning for PDE constraint… ▽ More The modeling and control of complex physical systems are essential in real-world problems. We propose a novel framework that is generally applicable to solving PDE-constrained optimal control problems by introducing surrogate models for PDE solution operators with special regularizers. The procedure of the proposed framework is divided into two phases: solution operator learning for PDE constraints (Phase 1) and searching for optimal control (Phase 2). Once the surrogate model is trained in Phase 1, the optimal control can be inferred in Phase 2 without intensive computations. Our framework can be applied to both data-driven and data-free cases. We demonstrate the successful application of our method to various optimal control problems for different control variables with diverse PDE constraints from the Poisson equation to Burgers' equation. △ Less

Submitted 26 December, 2023; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: 15 pages, 12 figures. Published as a conference paper at Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022)

MSC Class: 68U07

arXiv:2108.11861 [pdf, other]

doi 10.1109/COMST.2021.3108618

Security and privacy for 6G: A survey on prospective technologies and challenges

Authors: Van-Linh Nguyen, Po-Ching Lin, Bo-Chao Cheng, Ren-Hung Hwang, Ying-Dar Lin

Abstract: Sixth-generation (6G) mobile networks will have to cope with diverse threats on a space-air-ground integrated network environment, novel technologies, and an accessible user information explosion. However, for now, security and privacy issues for 6G remain largely in concept. This survey provides a systematic overview of security and privacy issues based on prospective technologies for 6G in the p… ▽ More Sixth-generation (6G) mobile networks will have to cope with diverse threats on a space-air-ground integrated network environment, novel technologies, and an accessible user information explosion. However, for now, security and privacy issues for 6G remain largely in concept. This survey provides a systematic overview of security and privacy issues based on prospective technologies for 6G in the physical, connection, and service layers, as well as through lessons learned from the failures of existing security architectures and state-of-the-art defenses. Two key lessons learned are as follows. First, other than inheriting vulnerabilities from the previous generations, 6G has new threat vectors from new radio technologies, such as the exposed location of radio stripes in ultra-massive MIMO systems at Terahertz bands and attacks against pervasive intelligence. Second, physical layer protection, deep network slicing, quantum-safe communications, artificial intelligence (AI) security, platform-agnostic security, real-time adaptive security, and novel data protection mechanisms such as distributed ledgers and differential privacy are the top promising techniques to mitigate the attack magnitude and personal data breaches substantially. △ Less

Submitted 31 August, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: 45 pages, 28 figures, accepted at IEEE Communications Surveys and Tutorials, 2021

arXiv:2107.05015 [pdf]

Offloading Optimization with Delay Distribution in the 3-tier Federated Cloud, Edge, and Fog Systems

Authors: Ren-Hung Hwang, Yuan-Cheng Lai, Ying-Dar Lin

Abstract: Mobile edge computing and fog computing are promising techniques providing computation service closer to users to achieve lower latency. In this work, we study the optimal offloading strategy in the three-tier federated computation offloading system. We first present queueing models and closed-form solutions for computing the service delay distribution and the probability of the delay of a task ex… ▽ More Mobile edge computing and fog computing are promising techniques providing computation service closer to users to achieve lower latency. In this work, we study the optimal offloading strategy in the three-tier federated computation offloading system. We first present queueing models and closed-form solutions for computing the service delay distribution and the probability of the delay of a task exceeding a given threshold. We then propose an optimal offloading probability algorithm based on the sub-gradient method. Our numerical results show that our simulation results match very well with that of our closed-form solutions, and our sub-gradient-based search algorithm can find the optimal offloading probabilities. Specifically, for the given system parameters, our algorithm yields the optimal QoS violating probability of 0.188 with offloading probabilities of 0.675 and 0.37 from Fog to edge and from edge to cloud, respectively. △ Less

Submitted 11 July, 2021; originally announced July 2021.

Comments: submitted to IEEE Globecom 2021

arXiv:2005.05968 [pdf, other]

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

Authors: Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu

Abstract: Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed wo… ▽ More Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed workload characterization on personalized recommendations and identifies two significant performance limiters: memory-intensive embedding layers and compute-intensive multi-layer perceptron (MLP) layers. We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers. We implement and demonstrate our proposal on an Intel HARPv2, a package-integrated CPU+FPGA device, which shows a 1.7-17.2x performance speedup and 1.7-19.5x energy-efficiency improvement than conventional approaches. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Comments: Accepted for publication at the 47th IEEE/ACM International Symposium on Computer Architecture (ISCA-47), 2020

arXiv:1911.02723 [pdf, ps, other]

Option Compatible Reward Inverse Reinforcement Learning

Authors: Rakhoon Hwang, Han** Lee, Hyung Ju Hwang

Abstract: Reinforcement learning in complex environments is a challenging problem. In particular, the success of reinforcement learning algorithms depends on a well-designed reward function. Inverse reinforcement learning (IRL) solves the problem of recovering reward functions from expert demonstrations. In this paper, we solve a hierarchical inverse reinforcement learning problem within the options framewo… ▽ More Reinforcement learning in complex environments is a challenging problem. In particular, the success of reinforcement learning algorithms depends on a well-designed reward function. Inverse reinforcement learning (IRL) solves the problem of recovering reward functions from expert demonstrations. In this paper, we solve a hierarchical inverse reinforcement learning problem within the options framework, which allows us to utilize intrinsic motivation of the expert demonstrations. A gradient method for parametrized options is used to deduce a defining equation for the Q-feature space, which leads to a reward feature space. Using a second-order optimality condition for option parameters, an optimal reward function is selected. Experimental results in both discrete and continuous domains confirm that our recovered rewards provide a solution to the IRL problem using temporal abstraction, which in turn are effective in accelerating transfer learning tasks. We also show that our method is robust to noises contained in expert demonstrations. △ Less

Submitted 18 January, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: This paper is under consideration at Pattern Recognition Letters

arXiv:1903.05470 [pdf, other]

Preventing the attempts of abusing cheap-hosting Web-servers for monetization attacks

Authors: Van-Linh Nguyen, Po-Ching Lin, Ren-Hung Hwang

Abstract: Over the past decades, the web is always one of the most popular targets of hackers. Today, along with the popular usage of open sources such as Wordpress and Joomla, the explosion of the vulnerabilities in such frameworks causes the websites using them to face numerous security threats. Unfortunately, many clients and small companies may not be aware of these serious security threats and call a r… ▽ More Over the past decades, the web is always one of the most popular targets of hackers. Today, along with the popular usage of open sources such as Wordpress and Joomla, the explosion of the vulnerabilities in such frameworks causes the websites using them to face numerous security threats. Unfortunately, many clients and small companies may not be aware of these serious security threats and call a rescuer only when the website is hacked, compromised, or blocked by the search engines. In this paper, we present an effective counter against such threats, including monetization attempts in the less valuable targets such as small websites. △ Less

Submitted 13 March, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

arXiv:1902.02905 [pdf, other]

Mobile Artificial Intelligence Technology for Detecting Macula Edema and Subretinal Fluid on OCT Scans: Initial Results from the DATUM alpha Study

Authors: Stephen G. Odaibo, Mikelson MomPremier, Richard Y. Hwang, Salman J. Yousuf, Steven L. Williams, Joshua Grant

Abstract: Artificial Intelligence (AI) is necessary to address the large and growing deficit in retina and healthcare access globally. And mobile AI diagnostic platforms running in the Cloud may effectively and efficiently distribute such AI capability. Here we sought to evaluate the feasibility of Cloud-based mobile artificial intelligence for detection of retinal disease. And to evaluate the accuracy of a… ▽ More Artificial Intelligence (AI) is necessary to address the large and growing deficit in retina and healthcare access globally. And mobile AI diagnostic platforms running in the Cloud may effectively and efficiently distribute such AI capability. Here we sought to evaluate the feasibility of Cloud-based mobile artificial intelligence for detection of retinal disease. And to evaluate the accuracy of a particular such system for detection of subretinal fluid (SRF) and macula edema (ME) on OCT scans. A multicenter retrospective image analysis was conducted in which board-certified ophthalmologists with fellowship training in retina evaluated OCT images of the macula. They noted the presence or absence of ME or SRF, then compared their assessment to that obtained from Fluid Intelligence, a mobile AI app that detects SRF and ME on OCT scans. Investigators consecutively selected retinal OCTs, while making effort to balance the number of scans with retinal fluid and scans without. Exclusion criteria included poor scan quality, ambiguous features, macula holes, retinoschisis, and dense epiretinal membranes. Accuracy in the form of sensitivity and specificity of the AI mobile App was determined by comparing its assessments to those of the retina specialists. At the time of this submission, five centers have completed their initial studies. This consists of a total of 283 OCT scans of which 155 had either ME or SRF ("wet") and 128 did not ("dry"). The sensitivity ranged from 82.5% to 97% with a weighted average of 89.3%. The specificity ranged from 52% to 100% with a weighted average of 81.23%. CONCLUSION: Cloud-based Mobile AI technology is feasible for the detection retinal disease. In particular, Fluid Intelligence (alpha version), is sufficiently accurate as a screening tool for SRF and ME, especially in underserved areas. Further studies and technology development is needed. △ Less

Submitted 12 February, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: Initial results of the DATUM alpha Study were initially presented on August 13th 2018 in the Keynote Address at the 116th National Medical Association Annual Meeting & Scientific Assembly's New Innovations in Ophthalmology Session. The results were also presented on September 21st 2018 in a Podium Lecture during Alumni Day at the University of Michigan--Ann Arbor Kellogg Eye Center

Showing 1–13 of 13 results for author: Hwang, R