Skip to main content

Showing 1–50 of 69 results for author: Krishna, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19580  [pdf, other

    cs.AR cs.LG

    FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

    Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Puneet Gupta, Tushar Krishna

    Abstract: Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating h… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.13868  [pdf, other

    cs.LG cs.AI

    SDQ: Sparse Decomposed Quantization for LLM Inference

    Authors: Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna

    Abstract: Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of parameters, which hinder the wide adaptation of those models due to their extremely large compute and memory requirements. To resolve the issue, various model comp… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Preprint

  3. arXiv:2406.01698  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    Demystifying Platform Requirements for Diverse LLM Inference Use Cases

    Authors: Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna

    Abstract: Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts. However, deploying these parameter-heavy models efficiently for diverse inference use cases requires carefully designed hardware platforms with ample computing, memory, and network resources. With LLM deployment scenarios and models evolving at breakneck speed, the… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 12 Pages, https://github.com/abhibambhaniya/GenZ-LLM-Analyzer

  4. arXiv:2405.13170  [pdf, other

    cs.AR

    FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

    Authors: Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, Tushar Krishna

    Abstract: The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: 17 pages, 14 figures. International Symposium on Computer Architecture (ISCA), Jun 2024

  5. arXiv:2405.01736  [pdf, other

    cs.AR

    PipeOrgan: Efficient Inter-operation Pipelining with Flexible Spatial Organization and Interconnects

    Authors: Raveesh Garg, Hyoukjun Kwon, Eric Qin, Yu-Hsin Chen, Tushar Krishna, Liangzhen Lai

    Abstract: Because of the recent trends in Deep Neural Networks (DNN) models being memory-bound, inter-operator pipelining for DNN accelerators is emerging as a promising optimization. Inter-operator pipelining reduces costly on-chip global memory and off-chip memory accesses by forwarding the output of a layer as the input of the next layer within the compute array, which is proven to be an effective optimi… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  6. arXiv:2404.04173  [pdf, other

    cs.AR cs.LG

    H3DFact: Heterogeneous 3D Integrated CIM for Factorization with Holographic Perceptual Representations

    Authors: Zishen Wan, Che-Kai Liu, Mohamed Ibrahim, Hanchen Yang, Samuel Spetalnick, Tushar Krishna, Arijit Raychowdhury

    Abstract: Disentangling attributes of various sensory signals is central to human-like perception and reasoning and a critical task for higher-order cognitive and neuro-symbolic AI systems. An elegant approach to represent this intricate factorization is via high-dimensional holographic vectors drawing on brain-inspired vector symbolic architectures. However, holographic factorization involves iterative com… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: 2024 Design Automation and Test in Europe (DATE); The first two authors have equal contributions

  7. arXiv:2404.03216  [pdf, other

    cs.CR

    Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption

    Authors: Jianming Tong, **gtian Dang, Anupam Golder, Callie Hao, Arijit Raychowdhury, Tushar Krishna

    Abstract: As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU… ▽ More

    Submitted 7 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, 2024. Copyright 2024 by the author(s)

  8. arXiv:2403.07953  [pdf, other

    cs.LG cs.AI cs.AR

    Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

    Authors: Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna

    Abstract: Exploiting sparsity in deep neural networks (DNNs) has been a promising area to meet the growing computation need of modern DNNs. However, in practice, sparse DNN acceleration still faces a key challenge. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparse hardware support recently, which provides limited flexibility and requires extra model fine-tun… ▽ More

    Submitted 31 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  9. arXiv:2403.05527  [pdf, other

    cs.LG cs.AI cs.CL

    GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

    Authors: Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

    Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on drop** unimportant tokens or quantizing all entries uniformly. Such methods… ▽ More

    Submitted 11 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  10. arXiv:2403.05465  [pdf, other

    cs.AR cs.AI cs.LG cs.NE

    Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference

    Authors: Akshat Ramachandran, Zishen Wan, Geonhwa Jeong, John Gustafson, Tushar Krishna

    Abstract: Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamica… ▽ More

    Submitted 26 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: 2024 61st IEEE/ACM Design Automation Conference (DAC)

  11. arXiv:2402.04744  [pdf, other

    cs.LG cs.AR

    Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

    Authors: Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

    Abstract: N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: 18 pages, 8 figures, 17 tables. Code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity

  12. arXiv:2401.01040  [pdf, other

    cs.AI cs.AR

    Towards Cognitive AI Systems: a Survey and Prospective on Neuro-Symbolic AI

    Authors: Zishen Wan, Che-Kai Liu, Hanchen Yang, Chaojian Li, Haoran You, Yonggan Fu, Cheng Wan, Tushar Krishna, Yingyan Lin, Arijit Raychowdhury

    Abstract: The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, have significantly impacted various aspects of our lives. However, the current challenges surrounding unsustainable computational trajectories, limited robustness, and a lack of explainability call for the development of next-generation AI systems. Neuro-symbolic AI (NSAI) emerges as a promising… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: Workshop on Systems for Next-Gen AI Paradigms, 6th Conference on Machine Learning and Systems (MLSys), June 4-8, 2023, Miami, FL, USA

  13. arXiv:2311.16514  [pdf, other

    cs.CV cs.AI cs.LG

    Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

    Authors: Ayush K. Rai, Tarun Krishna, Feiyan Hu, Alexandru Drimbarean, Kevin McGuinness, Alan F. Smeaton, Noel E. O'Connor

    Abstract: Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real… ▽ More

    Submitted 7 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted in CVPRW 2024 - VAND Workshop

  14. arXiv:2306.17266  [pdf, other

    cs.DC cs.LG

    Subgraph Stationary Hardware-Software Inference Co-Design

    Authors: Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexey Tumanov

    Abstract: A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-ex… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: 16 pages; MLSYS 2023

  15. arXiv:2305.14516  [pdf, other

    cs.LG cs.DC

    Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

    Authors: Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna

    Abstract: Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once systems are fully designed and deployed. However, the pace of AI innovation demands a more agile methodol… ▽ More

    Submitted 26 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

  16. arXiv:2304.05301  [pdf, other

    cs.DC cs.LG

    TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

    Authors: William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna

    Abstract: The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the fore… ▽ More

    Submitted 29 March, 2024; v1 submitted 11 April, 2023; originally announced April 2023.

  17. arXiv:2304.03748  [pdf, other

    cs.LG cs.AI physics.comp-ph physics.data-an

    Perspectives on AI Architectures and Co-design for Earth System Predictability

    Authors: Maruti K. Mudunuru, James A. Ang, Mahantesh Halappanavar, Simon D. Hammond, Maya B. Gokhale, James C. Hoe, Tushar Krishna, Sarat S. Sreepathi, Matthew R. Norman, Ivy B. Peng, Philip W. Jones

    Abstract: Recently, the U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER), and Advanced Scientific Computing Research (ASCR) programs organized and held the Artificial Intelligence for Earth System Predictability (AI4ESP) workshop series. From this workshop, a critical conclusion that the DOE BER and ASCR community came to is the requirement to develop a new par… ▽ More

    Submitted 7 April, 2023; originally announced April 2023.

    Comments: 23 pages, 1 figure

  18. arXiv:2303.14006  [pdf, other

    cs.DC cs.LG

    ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

    Authors: William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

    Abstract: As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emergin… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

  19. arXiv:2303.11499  [pdf, other

    cs.DC cs.AR

    Exploiting Inter-Operation Data Reuse in Scientific Applications using GOGETA

    Authors: Raveesh Garg, Michael Pellauer, Sivasankaran Rajamanickam, Tushar Krishna

    Abstract: HPC applications are critical in various scientific domains ranging from molecular dynamics to chemistry to fluid dynamics. Conjugate Gradient (CG) is a popular application kernel used in iterative linear HPC solvers and has applications in numerous scientific domains. However, the HPCG benchmark shows that the peformance achieved by Top500 HPC systems on CG is a small fraction of the performance… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

  20. arXiv:2302.08687  [pdf, other

    cs.AR cs.AI cs.LG

    VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

    Authors: Geonhwa Jeong, Sana Damani, Abhimanyu Rajeshkumar Bambhaniya, Eric Qin, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

    Abstract: Deep Learning (DL) acceleration support in CPUs has recently gained a lot of traction, with several companies (Arm, Intel, IBM) announcing products with specialized matrix engines accessible via GEMM instructions. CPUs are pervasive and need to handle diverse requirements across DL workloads running in edge/HPC/cloud platforms. Therefore, as DL workloads embrace sparsity to reduce the computations… ▽ More

    Submitted 23 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: This paper is accepted to HPCA 2023

  21. arXiv:2301.10852  [pdf, other

    cs.AR

    Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing

    Authors: Francisco Muñoz-Martínez, Raveesh Garg, José L. Abellán, Michael Pellauer, Manuel E. Acacio, Tushar Krishna

    Abstract: Sparsity is a growing trend in modern DNN models. Existing Sparse-Sparse Matrix Multiplication (SpMSpM) accelerators are tailored to a particular SpMSpM dataflow (i.e., Inner Product, Outer Product or Gustavsons), that determines their overall efficiency. We demonstrate that this static decision inherently results in a suboptimal dynamic solution. This is because different SpMSpM kernels show vary… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: To appear on ASPLOS 2023

  22. arXiv:2301.09164  [pdf, other

    cs.LG cs.CV

    Unifying Synergies between Self-supervised Learning and Dynamic Computation

    Authors: Tarun Krishna, Ayush K Rai, Alexandru Drimbarean, Eric Arazo, Paul Albert, Alan F Smeaton, Kevin McGuinness, Noel E O'Connor

    Abstract: Computationally expensive training strategies make self-supervised learning (SSL) impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweightmodel, which usually involves multiple epochs of fine-tuning (or distilling steps) of a large pre-trained model, making it more computation… ▽ More

    Submitted 9 September, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

    Comments: Accepted in BMVC 2023

  23. arXiv:2211.16648  [pdf, other

    cs.DC cs.AI cs.LG

    COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

    Authors: Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis

    Abstract: Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, wi… ▽ More

    Submitted 14 March, 2024; v1 submitted 29 November, 2022; originally announced November 2022.

  24. arXiv:2211.08675  [pdf, other

    cs.LG cs.ET

    XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse

    Authors: Hyoukjun Kwon, Krishnakumar Nair, Jamin Seo, Jason Yik, Debabrata Mohapatra, Dongyuan Zhan, **ook Song, Peter Capak, Peizhao Zhang, Peter Vajda, Colby Banbury, Mark Mazumder, Liangzhen Lai, Ashish Sirasao, Tushar Krishna, Harshit Khaitan, Vikas Chandra, Vijay Janapa Reddi

    Abstract: Real-time multi-task multi-model (MTMM) workloads, a new form of deep learning inference workloads, are emerging for applications areas like extended reality (XR) to support metaverse use cases. These workloads combine user interactivity with computationally complex machine learning (ML) activities. Compared to standard ML applications, these ML workloads present unique difficulties and constraint… ▽ More

    Submitted 19 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

  25. arXiv:2210.05574  [pdf, other

    cs.CV cs.AI cs.LG

    Motion Aware Self-Supervision for Generic Event Boundary Detection

    Authors: Ayush K. Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness, Alan F. Smeaton, Noel E. O'Connor

    Abstract: The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hen… ▽ More

    Submitted 12 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023

  26. arXiv:2210.04578  [pdf, other

    cs.CV cs.LG

    Is your noise correction noisy? PLS: Robustness to label noise with two stage detection

    Authors: Paul Albert, Eric Arazo, Tarun Krishna, Noel E. O'Connor, Kevin McGuinness

    Abstract: Designing robust algorithms capable of training accurate neural networks on uncurated datasets from the web has been the subject of much research as it reduces the need for time consuming human labor. The focus of many previous research contributions has been on the detection of different types of label noise; however, this paper proposes to improve the correction accuracy of noisy samples once th… ▽ More

    Submitted 15 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: 9 pages 4 figures. Accepted at WACV 2023

  27. arXiv:2210.03731  [pdf, other

    cs.LG cs.DC

    Demystifying Map Space Exploration for NPUs

    Authors: Sheng-Chun Kao, Angshuman Parashar, Po-An Tsai, Tushar Krishna

    Abstract: Map Space Exploration is the problem of finding optimized map**s of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable. However, while there are dozens of mappers out there (all empirically claiming to find… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

  28. arXiv:2209.07617  [pdf, other

    cs.LG cs.AI cs.AR cs.PF

    Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

    Authors: Sheng-Chun Kao, Amir Yazdanbakhsh, Suvinay Subramanian, Shivani Agrawal, Utku Evci, Tushar Krishna

    Abstract: Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sp… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First two authors contributed equally

  29. arXiv:2207.12065  [pdf, other

    cs.CV

    Dynamic Channel Selection in Self-Supervised Learning

    Authors: Tarun Krishna, Ayush K. Rai, Yasser A. D. Djilali, Alan F. Smeaton, Kevin McGuinness, Noel E. O'Connor

    Abstract: Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance… ▽ More

    Submitted 16 December, 2022; v1 submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted in Irish Machine Vision and Image Processing Conference 2022

  30. arXiv:2207.10898  [pdf, other

    cs.NI cs.AI

    Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

    Authors: Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

    Abstract: RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), s… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

  31. arXiv:2206.02987  [pdf, other

    cs.AR

    A Formalism of DNN Accelerator Flexibility

    Authors: Sheng-Chun Kao, Hyoukjun Kwon, Michael Pellauer, Angshuman Parashar, Tushar Krishna

    Abstract: The high efficiency of domain-specific hardware accelerators for machine learning (ML) has come from specialization, with the trade-off of less configurability/ flexibility. There is growing interest in develo** flexible ML accelerators to make them future-proof to the rapid evolution of Deep Neural Networks (DNNs). However, the notion of accelerator flexibility has always been used in an inform… ▽ More

    Submitted 6 June, 2022; originally announced June 2022.

  32. arXiv:2201.11220  [pdf, other

    cs.NE cs.AI

    DiGamma: Domain-aware Genetic Algorithm for HW-Map** Co-optimization for DNN Accelerators

    Authors: Sheng-Chun Kao, Michael Pellauer, Angshuman Parashar, Tushar Krishna

    Abstract: The design of DNN accelerators includes two key parts: HW resource configuration and map** strategy. Intensive research has been conducted to optimize each of them independently. Unfortunately, optimizing for both together is extremely challenging due to the extremely large cross-coupled search space. To address this, in this paper, we propose a HW-Map** co-optimization framework, an efficient… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

  33. arXiv:2201.11218  [pdf, other

    cs.LG cs.AI

    DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators

    Authors: Sheng-Chun Kao, Xiaoyu Huang, Tushar Krishna

    Abstract: Dataflow/map** decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. While existing SOTA DNN map** explorations rely o… ▽ More

    Submitted 6 June, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

  34. arXiv:2201.08916  [pdf, other

    cs.AR

    Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

    Authors: Eric Qin, Raveesh Garg, Abhimanyu Bambhaniya, Michael Pellauer, Angshuman Parashar, Sivasankaran Rajamanickam, Cong Hao, Tushar Krishna

    Abstract: Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size a… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

  35. arXiv:2110.04478  [pdf, other

    cs.DC cs.AR cs.LG cs.NI

    Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

    Authors: Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

    Abstract: Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional n… ▽ More

    Submitted 7 July, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

  36. arXiv:2110.01752  [pdf, other

    cs.AR cs.AI cs.LG

    RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

    Authors: Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

    Abstract: As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: This paper is accepted to DAC 2021

  37. LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models

    Authors: William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

    Abstract: As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of… ▽ More

    Submitted 5 May, 2024; v1 submitted 24 September, 2021; originally announced September 2021.

    Comments: Contains 10 main pages, 21 figures, 3 tables

    Journal ref: Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '24)

  38. arXiv:2109.07419  [pdf, other

    cs.AR cs.DC cs.LG

    Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

    Authors: Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna

    Abstract: To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are s… ▽ More

    Submitted 6 November, 2021; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: This paper is accepted to PACT 2021

  39. arXiv:2108.08295  [pdf, other

    cs.LG cs.AI cs.AR

    AIRCHITECT: Learning Custom Architecture Design and Map** Space

    Authors: Ananda Samajdar, Jan Moritz Joseph, Matthew Denton, Tushar Krishna

    Abstract: Design space exploration is an important but costly step involved in the design/deployment of custom architectures to squeeze out maximum possible performance and energy efficiency. Conventionally, optimizations require iterative sampling of the design space using simulation or heuristic tools. In this paper we investigate the possibility of learning the optimization task using machine learning an… ▽ More

    Submitted 16 August, 2021; originally announced August 2021.

  40. arXiv:2107.06419  [pdf, other

    cs.LG cs.AR

    FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

    Authors: Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, Tushar Krishna

    Abstract: Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently… ▽ More

    Submitted 23 September, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

  41. arXiv:2106.10499  [pdf, other

    cs.DC cs.AI cs.AR

    Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

    Authors: Gordon E. Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, Tushar Krishna

    Abstract: There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strat… ▽ More

    Submitted 19 June, 2021; originally announced June 2021.

  42. arXiv:2106.10090  [pdf, other

    cs.CV cs.AI

    Discerning Generic Event Boundaries in Long-Form Wild Videos

    Authors: Ayush K Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness, Alan F Smeaton, Noel E O'Connor

    Abstract: Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding. In this paper we present a technique forgeneric event boundary detection based on a two stream in-flated 3D convolutions architecture, which can learn spatio-temporal features from videos. Our work is inspired from theGeneric Event Boundary Detection Challenge (part of… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: Technical Report for Generic Event Boundary Challenge - LOVEU Challenge (CVPR 2021)

  43. Evaluating Contrastive Models for Instance-based Image Retrieval

    Authors: Tarun Krishna, Kevin McGuinness, Noel O'Connor

    Abstract: In this work, we evaluate contrastive models for the task of image retrieval. We hypothesise that models that are learned to encode semantic similarity among instances via discriminative learning should perform well on the task of image retrieval, where relevancy is defined in terms of instances of the same object. Through our extensive evaluation, we find that representations from models trained… ▽ More

    Submitted 30 April, 2021; originally announced April 2021.

    Comments: Accepted In Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR 21)

  44. arXiv:2104.13997  [pdf, other

    cs.AR cs.AI

    MAGMA: An Optimization Framework for Map** Multiple DNNs on Multiple Accelerator Cores

    Authors: Sheng-Chun Kao, Tushar Krishna

    Abstract: As Deep Learning continues to drive a variety of applications in edge and cloud data centers, there is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets. This work looks at the problem of supporting multi-tenancy on such accelerators. In particular, we focus on the problem of map** jobs from several DNNs simultaneously on an accelerator. Given the ex… ▽ More

    Submitted 26 January, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

  45. arXiv:2103.10452  [pdf

    cs.DC

    Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

    Authors: Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, Sudarshan Srinivasan, Dipankar Das, Gordon E. Moon, Sivasankaran Rajamanickam, Tushar Krishna

    Abstract: Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enab… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

    Comments: Accepted for publication at the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)

  46. arXiv:2103.07977  [pdf, other

    cs.DC cs.AR

    Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

    Authors: Raveesh Garg, Eric Qin, Francisco Muñoz-Martínez, Robert Guirado, Akshay Jain, Sergi Abadal, José L. Abellán, Manuel E. Acacio, Eduard Alarcón, Sivasankaran Rajamanickam, Tushar Krishna

    Abstract: Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelera… ▽ More

    Submitted 6 March, 2022; v1 submitted 14 March, 2021; originally announced March 2021.

    Comments: Accepted for publication at the 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022)

  47. arXiv:2101.04799  [pdf, other

    cs.AR cs.LG

    Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

    Authors: Ananda Samajdar, Michael Pellauer, Tushar Krishna

    Abstract: With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the research community has been investigating flexible/reconfigurable accelerator substrates. This line of research has opened up two challenges. The first is to determine the appropriate amount of flexibility within an accelerator array that that can trade-off the performance benefits versus the area… ▽ More

    Submitted 23 April, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

  48. arXiv:2012.12563  [pdf, other

    cs.AR

    Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

    Authors: Jan Moritz Joseph, Ananda Samajdar, Lingjun Zhu, Rainer Leupers, Sung-Kyu Lim, Thilo Pionteck, Tushar Krishna

    Abstract: The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-acce… ▽ More

    Submitted 18 February, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

  49. arXiv:2011.14755  [pdf, other

    cs.AR

    Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package

    Authors: Robert Guirado, Hyoukjun Kwon, Sergi Abadal, Eduard Alarcón, Tushar Krishna

    Abstract: Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration… ▽ More

    Submitted 30 November, 2020; originally announced November 2020.

    Comments: ASPDAC '21

  50. arXiv:2009.02010  [pdf, other

    cs.AR cs.LG eess.SP

    ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

    Authors: Sheng-Chun Kao, Geonhwa Jeong, Tushar Krishna

    Abstract: DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a datafl… ▽ More

    Submitted 4 September, 2020; originally announced September 2020.