Skip to main content

Showing 1–50 of 60 results for author: Hwu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.04690  [pdf, other

    cs.CV cs.AI cs.LG

    Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

    Authors: Ali Hassani, Wen-Mei Hwu, Humphrey Shi

    Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infra… ▽ More

    Submitted 22 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Project page: https://github.com/SHI-Labs/NATTEN

  2. RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

    Authors: Benjamin Reidys, Yuqi Xue, Daixuan Li, Bharat Sukhwani, Wen-mei Hwu, Deming Chen, Sameh Asaad, Jian Huang

    Abstract: Software-defined networking (SDN) and software-defined flash (SDF) have been serving as the backbone of modern data centers. They are managed separately to handle I/O requests. At first glance, this is a reasonable design by following the rack-scale hierarchical design principles. However, it suffers from suboptimal end-to-end performance, due to the lack of coordination between SDN and SDF. In… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: 14 pages. Published in published in ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP'23)

  3. arXiv:2307.03760  [pdf, other

    cs.DC

    CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

    Authors: Jeongmin Park, Zaid Qureshi, Vikram Mailthody, Andrew Gacek, Shunfan Shao, Mohammad AlMasri, Isaac Gelado, **jun Xiong, Chris Newburn, I-hsin Chung, Michael Garland, Nikolay Sakharnykh, Wen-mei Hwu

    Abstract: Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data m… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  4. arXiv:2306.16384  [pdf, other

    cs.DC cs.AI cs.AR cs.LG

    Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

    Authors: Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, Wen-mei Hwu

    Abstract: Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks… ▽ More

    Submitted 6 March, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Under Submission. Source code: https://github.com/jeongminpark417/GIDS

  5. arXiv:2302.13522  [pdf, other

    cs.LG cs.AI cs.DC cs.IR

    IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

    Authors: Arpandeep Khatua, Vikram Sharma Mailthody, Bhagyashree Taleka, Tengfei Ma, Xiang Song, Wen-mei Hwu

    Abstract: Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled… ▽ More

    Submitted 21 June, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted in KDD'23 conference. This is final preprint version

    Journal ref: KDD 2023

  6. Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

    Authors: Kun Wu, Mert Hidayetoğlu, Xiang Song, Sitao Huang, Da Zheng, Israt Nisa, Wen-mei Hwu

    Abstract: Relational graph neural networks (RGNNs) are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges: inherent memory-intensive computation patterns, the gap between… ▽ More

    Submitted 9 April, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

    Comments: Accepted by ASPLOS

    ACM Class: D.1.3; D.2.11; I.2

  7. arXiv:2212.01473  [pdf, other

    cs.DC

    Parallelizing Maximal Clique Enumeration on GPUs

    Authors: Mohammad Almasri, Yen-Hsiang Chang, Izzat El Hajj, Rakesh Nagi, **jun Xiong, Wen-mei Hwu

    Abstract: We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first tra… ▽ More

    Submitted 25 October, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

  8. arXiv:2211.04194  [pdf, other

    cs.IR cs.AI

    Submission-Aware Reviewer Profiling for Reviewer Recommender System

    Authors: Omer Anjum, Alok Kamatar, Toby Liang, **jun Xiong, Wen-mei Hwu

    Abstract: Assigning qualified, unbiased and interested reviewers to paper submissions is vital for maintaining the integrity and quality of the academic publishing system and providing valuable reviews to authors. However, matching thousands of submissions with thousands of potential reviewers within a limited time is a daunting challenge for a conference program committee. Prior efforts based on topic mode… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

  9. arXiv:2210.05159  [pdf, other

    cs.CL cs.AI

    Can Language Models Be Specific? How?

    Authors: Jie Huang, Kevin Chen-Chuan Chang, **jun Xiong, Wen-mei Hwu

    Abstract: "He is a person", "Paris is located on the earth". Both statements are correct but meaningless - due to lack of specificity. In this paper, we propose to measure how specific the language of pre-trained language models (PLMs) is. To achieve this, we introduce a novel approach to build a benchmark for specificity testing by forming masked token prediction tasks with prompts. For instance, given "To… ▽ More

    Submitted 26 May, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: Findings of ACL 2023

  10. arXiv:2205.10479  [pdf, other

    cs.CL cs.AI

    DEER: Descriptive Knowledge Graph for Explaining Entity Relationships

    Authors: Jie Huang, Kerui Zhu, Kevin Chen-Chuan Chang, **jun Xiong, Wen-mei Hwu

    Abstract: We propose DEER (Descriptive Knowledge Graph for Explaining Entity Relationships) - an open and informative form of modeling entity relationships. In DEER, relationships between entities are represented by free-text relation descriptions. For instance, the relationship between entities of machine learning and algorithm can be represented as ``Machine learning explores the study and construction of… ▽ More

    Submitted 20 October, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

    Comments: Accepted to EMNLP 2022

  11. arXiv:2203.04910  [pdf, other

    cs.DC cs.AR cs.OS cs.PF

    GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture

    Authors: Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seung Won Min, Amna Masood, Jeongmin Park, **jun Xiong, CJ Newburn, Dmitri Vainbrand, I-Hsin Chung, Michael Garland, William Dally, Wen-mei Hwu

    Abstract: Graphics Processing Units (GPUs) have traditionally relied on the host CPU to initiate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their dataset to be processed in a pipelined fashion in the GPU. However, emerging applications such as graph and data analytics, recommender systems, or graph neural networks… ▽ More

    Submitted 6 February, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: This is an extension to the published conference paper at ASPLOS'23: https://dl.acm.org/doi/abs/10.1145/3575693.3575748

    Journal ref: ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

  12. arXiv:2201.02789  [pdf, other

    cs.DC cs.AR

    A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

    Authors: Mhd Ghaith Olabi, Juan Gómez Luna, Onur Mutlu, Wen-mei Hwu, Izzat El Hajj

    Abstract: Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large num… ▽ More

    Submitted 8 January, 2022; originally announced January 2022.

  13. arXiv:2111.07267  [pdf, other

    cs.CL cs.AI

    Understanding Jargon: Combining Extraction and Generation for Definition Modeling

    Authors: Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang, **jun Xiong, Wen-mei Hwu

    Abstract: Can machines know what twin prime is? From the composition of this phrase, machines may guess twin prime is a certain kind of prime, but it is still difficult to deduce exactly what twin stands for without additional knowledge. Here, twin prime is a jargon - a specialized term used by experts in a particular field. Explaining jargon is challenging since it usually requires domain knowledge to unde… ▽ More

    Submitted 20 October, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: Accepted to EMNLP 2022

  14. arXiv:2111.05894  [pdf, other

    cs.LG

    Graph Neural Network Training with Data Tiering

    Authors: Seung Won Min, Kun Wu, Mert Hidayetoğlu, **jun Xiong, Xiang Song, Wen-mei Hwu

    Abstract: Graph Neural Networks (GNNs) have shown success in learning from graph-structured data, with applications to fraud detection, recommendation, and knowledge graph reasoning. However, training GNN efficiently is challenging because: 1) GPU memory capacity is limited and can be insufficient for large datasets, and 2) the graph-based data structure causes irregular data access patterns. In this work,… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  15. arXiv:2111.05231  [pdf, other

    cs.LG

    MLHarness: A Scalable Benchmarking System for MLCommons

    Authors: Yen-Hsiang Chang, Jianhao Pu, Wen-mei Hwu, **jun Xiong

    Abstract: With the society's growing adoption of machine learning (ML) and deep learning (DL) for various intelligent solutions, it becomes increasingly imperative to standardize a common set of measures for ML/DL models with large scale open datasets under common development practices and resources so that people can benchmark and compare models quality and performance on a common ground. MLCommons has eme… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  16. arXiv:2108.09241  [pdf, other

    cs.CL cs.AI

    Open Relation Modeling: Learning to Define Relations between Entities

    Authors: Jie Huang, Kevin Chen-Chuan Chang, **jun Xiong, Wen-mei Hwu

    Abstract: Relations between entities can be represented by different instances, e.g., a sentence containing both entities or a fact in a Knowledge Graph (KG). However, these instances may not well capture the general relations between entities, may be difficult to understand by humans, even may not be found due to the incompleteness of the knowledge source. In this paper, we introduce the Open Relation Mode… ▽ More

    Submitted 2 March, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

    Comments: Accepted to Findings of ACL 2022

  17. arXiv:2105.13255  [pdf, other

    cs.CL cs.LG

    Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach

    Authors: Jie Huang, Kevin Chen-Chuan Chang, **jun Xiong, Wen-mei Hwu

    Abstract: We propose to measure fine-grained domain relevance - the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain. Such measurement is crucial for many downstream tasks in natural language processing. To handle long-tail terms, we build a core-anchored semantic graph, which uses core terms with rich description information to bridge the vast remain… ▽ More

    Submitted 27 May, 2021; originally announced May 2021.

    Comments: Accepted to ACL 2021

  18. arXiv:2104.14082  [pdf, other

    cs.CV

    Pseudo-IoU: Improving Label Assignment in Anchor-Free Object Detection

    Authors: Jiachen Li, Bowen Cheng, Rogerio Feris, **jun Xiong, Thomas S. Huang, Wen-Mei Hwu, Humphrey Shi

    Abstract: Current anchor-free object detectors are quite simple and effective yet lack accurate label assignment methods, which limits their potential in competing with classic anchor-based models that are supported by well-designed assignment methods based on the Intersection-over-Union~(IoU) metric. In this paper, we present \textbf{Pseudo-Intersection-over-Union~(Pseudo-IoU)}: a simple metric that brings… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: CVPR 2021 Workshop

  19. arXiv:2104.13209  [pdf, other

    cs.DC cs.DS

    Parallel K-Clique Counting on GPUs

    Authors: Mohammad Almasri, Izzat El Hajj, Rakesh Nagi, **jun Xiong, Wen-mei Hwu

    Abstract: Counting k-cliques in a graph is an important problem in graph analysis with many applications such as community detection and graph partitioning. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. Parallelizing k-clique counting has been well-studied on CPUs and many solutions exist. However, there are no performant solutions for k-clique countin… ▽ More

    Submitted 6 June, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

  20. arXiv:2103.03330  [pdf, other

    cs.LG

    Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

    Authors: Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, **jun Xiong, Eiman Ebrahimi, Deming Chen, Wen-mei Hwu

    Abstract: Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the C… ▽ More

    Submitted 14 August, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: Paper accepted for PVLDB Vol 14

  21. arXiv:2101.07956  [pdf, other

    cs.LG cs.PF

    PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

    Authors: Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, **jun Xiong, Eiman Ebrahimi, Deming Chen, Wen-mei Hwu

    Abstract: With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is still a challenging task. Unlike conventional neural networks, mini-batching input samples in GNNs requires complicated tasks such as traversing neighboring nodes a… ▽ More

    Submitted 19 January, 2021; originally announced January 2021.

  22. arXiv:2101.07897  [pdf, other

    cs.CR cs.CY

    Safer Illinois and RokWall: Privacy Preserving University Health Apps for COVID-19

    Authors: Vikram Sharma Mailthody, James Wei, Nicholas Chen, Mohammad Behnia, Ruihao Yao, Qihao Wang, Vedant Agrawal, Churan He, Lijian Wang, Leihao Chen, Amit Agarwal, Edward Richter, Wen-Mei Hwu, Christopher W. Fletcher, **jun Xiong, Andrew Miller, Sanjay Patel

    Abstract: COVID-19 has fundamentally disrupted the way we live. Government bodies, universities, and companies worldwide are rapidly develo** technologies to combat the COVID-19 pandemic and safely reopen society. Essential analytics tools such as contact tracing, super-spreader event detection, and exposure map** require collecting and analyzing sensitive user information. The increasing use of such po… ▽ More

    Submitted 17 March, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

    Comments: Appears in the Workshop on Secure IT Technologies against COVID-19(CoronaDef) 2021

  23. arXiv:2012.14363  [pdf, other

    cs.DC

    TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

    Authors: Carl Pearson, Kun Wu, I-Hsin Chung, **jun Xiong, Wen-Mei Hwu

    Abstract: MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implement… ▽ More

    Submitted 20 April, 2021; v1 submitted 28 December, 2020; originally announced December 2020.

    Comments: 12 pages

  24. arXiv:2011.11603  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Interpretable Visual Reasoning via Induced Symbolic Space

    Authors: Zhonghao Wang, Kai Wang, Mo Yu, **jun Xiong, Wen-mei Hwu, Mark Hasegawa-Johnson, Humphrey Shi

    Abstract: We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images; and achieve an interpretable model via working on the induced symbolic concept space. To this end, we first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning… ▽ More

    Submitted 24 August, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: ICCV 2021

  25. arXiv:2010.07185  [pdf, other

    cs.AR cs.LG

    Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices

    Authors: Cong Hao, Yao Chen, Xiaofan Zhang, Yuhong Li, **jun Xiong, Wen-mei Hwu, Deming Chen

    Abstract: High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators. To improve the overall solution quality as well as to boost the design productivity, efficient algorithm and accelerator co-design methodologies are indispensable. In this paper, we first discuss the motivations and challenges for the Algorithm/Accelerator co… ▽ More

    Submitted 15 October, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

    Comments: GLSVLSI, September 7-9, 2020

  26. arXiv:2010.01898  [pdf, other

    cs.CL cs.AI

    Exploring Semantic Capacity of Terms

    Authors: Jie Huang, Zilong Wang, Kevin Chen-Chuan Chang, Wen-mei Hwu, **jun Xiong

    Abstract: We introduce and study semantic capacity of terms. For example, the semantic capacity of artificial intelligence is higher than that of linear regression since artificial intelligence possesses a broader meaning scope. Understanding semantic capacity of terms will help many downstream tasks in natural language processing. For this purpose, we propose a two-step model to investigate semantic capaci… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted to EMNLP 2020

  27. arXiv:2009.07226  [pdf, other

    cs.DC

    Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes

    Authors: Mert Hidayetoglu, Tekin Bicer, Simon Garcia de Gonzalo, Bin Ren, Vincent De Andrade, Doga Gursoy, Raj Kettimuthu, Ian T. Foster, Wen-mei W. Hwu

    Abstract: X-ray computed tomography is a commonly used technique for noninvasive imaging at synchrotron facilities. Iterative tomographic reconstruction algorithms are often preferred for recovering high quality 3D volumetric images from 2D X-ray images, however, their use has been limited to small/medium datasets due to their computational requirements. In this paper, we propose a high-performance iterativ… ▽ More

    Submitted 15 September, 2020; originally announced September 2020.

  28. arXiv:2008.12745  [pdf, other

    cs.AR

    DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator

    Authors: Xiaofan Zhang, Hanchen Ye, Junsong Wang, Yonghua Lin, **jun Xiong, Wen-mei Hwu, Deming Chen

    Abstract: Existing FPGA-based DNN accelerators typically fall into two design paradigms. Either they adopt a generic reusable architecture to support different DNN networks but leave some performance and efficiency on the table because of the sacrifice of design specificity. Or they apply a layer-wise tailor-made architecture to optimize layer-specific demands for computation and resources but loose the sca… ▽ More

    Submitted 23 March, 2021; v1 submitted 28 August, 2020; originally announced August 2020.

    Comments: Published as a conference paper at International Conference on Computer Aided Design 2020 (ICCAD'20)

  29. arXiv:2008.10169  [pdf, other

    cs.AR cs.DC cs.PF

    Tearing Down the Memory Wall

    Authors: Zaid Qureshi, Vikram Sharma Mailthody, Seung Won Min, I-Hsin Chung, **jun Xiong, Wen-mei Hwu

    Abstract: We present a vision for the Erudite architecture that redefines the compute and memory abstractions such that memory bandwidth and capacity become first-class citizens along with compute throughput. In this architecture, we envision coupling a high-density, massively parallel memory technology like Flash with programmable near-data accelerators, like the streaming multiprocessors in modern GPUs. E… ▽ More

    Submitted 23 August, 2020; originally announced August 2020.

    Comments: SRC Techcon 2020 paper. Discusses vision of GPU-Centric architecture, Erudite

  30. arXiv:2007.14152  [pdf, other

    cs.DC cs.LG

    At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation

    Authors: Mert Hidayetoglu, Carl Pearson, Vikram Sharma Mailthody, Eiman Ebrahimi, **jun Xiong, Rakesh Nagi, Wen-Mei Hwu

    Abstract: This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footpr… ▽ More

    Submitted 2 September, 2020; v1 submitted 28 July, 2020; originally announced July 2020.

    Comments: 7 pages

    Journal ref: High Performance Extreme Computing (2020)

  31. arXiv:2006.06890  [pdf, other

    cs.DC cs.DB

    EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs

    Authors: Seung Won Min, Vikram Sharma Mailthody, Zaid Qureshi, **jun Xiong, Eiman Ebrahimi, Wen-mei Hwu

    Abstract: Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as CSR. To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, gra… ▽ More

    Submitted 14 January, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

  32. arXiv:2005.02563  [pdf, other

    cs.LG

    EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions

    Authors: Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, **jun Xiong, Wen-mei Hwu, Deming Chen

    Abstract: High quality AI solutions require joint optimization of AI algorithms and their hardware implementations. In this work, we are the first to propose a fully simultaneous, efficient differentiable DNN architecture and implementation co-search (EDD) methodology. We formulate the co-search problem by fusing DNN search variables and hardware implementation variables into one solution space, and maximiz… ▽ More

    Submitted 5 May, 2020; originally announced May 2020.

    Comments: Accepted by Design Automation Conference (DAC'2020)

  33. arXiv:2004.00794  [pdf, other

    cs.CV cs.LG

    Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation

    Authors: Zhonghao Wang, Yunchao Wei, Rogerior Feris, **jun Xiong, Wen-Mei Hwu, Thomas S. Huang, Humphrey Shi

    Abstract: Learning segmentation from synthetic data and adapting to real data can significantly relieve human efforts in labelling pixel-level masks. A key challenge of this task is how to alleviate the data distribution discrepancy between the source and target domains, i.e. reducing domain shift. The common approach to this problem is to minimize the discrepancy between feature distributions from differen… ▽ More

    Submitted 9 June, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

    Comments: CVPRW 2020

  34. arXiv:2003.08040  [pdf, other

    cs.CV cs.LG eess.IV

    Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation

    Authors: Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, **jun Xiong, Wen-mei Hwu, Thomas S. Huang, Humphrey Shi

    Abstract: We consider the problem of unsupervised domain adaptation for semantic segmentation by easing the domain shift between the source domain (synthetic data) and the target domain (real data) in this work. State-of-the-art approaches prove that performing semantic-level alignment is helpful in tackling the domain shift issue. Based on the observation that stuff categories usually share similar appeara… ▽ More

    Submitted 9 June, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

  35. arXiv:2002.11262  [pdf, other

    cs.LG cs.AI

    DLSpec: A Deep Learning Task Exchange Specification

    Authors: Abdul Dakkak, Cheng Li, **jun Xiong, Wen-Mei Hwu

    Abstract: Deep Learning (DL) innovations are being introduced at a rapid pace. However, the current lack of standard specification of DL tasks makes sharing, running, reproducing, and comparing these innovations difficult. To address this problem, we propose DLSpec, a model-, dataset-, software-, and hardware-agnostic DL specification that captures the different aspects of DL tasks. DLSpec has been tested b… ▽ More

    Submitted 25 February, 2020; originally announced February 2020.

  36. arXiv:2002.08295  [pdf, other

    cs.DC cs.LG stat.ML

    MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale

    Authors: Abdul Dakkak, Cheng Li, **jun Xiong, Wen-mei Hwu

    Abstract: Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community. This paper proposes MLModelScope, an open-… ▽ More

    Submitted 19 February, 2020; originally announced February 2020.

  37. arXiv:1912.11516  [pdf, other

    cs.DC cs.AR cs.ET eess.SP

    PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-efficient ReRAM

    Authors: Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Sapan Agarwal, Matthew Marinella, Martin Foltin, John Paul Strachan, Dejan Milojicic, Wen-mei Hwu, Kaushik Roy

    Abstract: The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature of training them. Numerous special-purpose architectures have been proposed to accelerate training: both digital and hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators have demonstrated the effectiveness of ReRAM crossbars a… ▽ More

    Submitted 24 December, 2019; originally announced December 2019.

    Comments: 13 pages, 15 figures

  38. arXiv:1911.08031  [pdf, other

    cs.DC cs.LG cs.PF stat.ML

    The Design and Implementation of a Scalable DL Benchmarking Platform

    Authors: Cheng Li, Abdul Dakkak, **jun Xiong, Wen-mei Hwu

    Abstract: The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both a… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

    Journal ref: 2020 IEEE 13th International Conference on Cloud Computing (CLOUD), 414-425

  39. arXiv:1911.07967  [pdf, other

    cs.LG cs.PF cs.SE stat.ML

    DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs (Extended)

    Authors: Cheng Li, Abdul Dakkak, **jun Xiong, Wen-mei Hwu

    Abstract: The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet latency and resource requirements. Benchmarking is a key step in this process but has been ham… ▽ More

    Submitted 11 March, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

  40. arXiv:1911.07446  [pdf, other

    cs.LG cs.CV cs.NE stat.ML

    NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

    Authors: Cong Hao, Yao Chen, Xinheng Liu, Atif Sarwari, Daryl Sew, Ashutosh Dhar, Bryan Wu, Dongdong Fu, **jun Xiong, Wen-mei Hwu, Junli Gu, Deming Chen

    Abstract: The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to b… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

    Comments: 8 pages, ICCAD 2019

  41. arXiv:1911.06922  [pdf, other

    cs.LG cs.DC cs.PF stat.ML

    Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

    Authors: Cheng Li, Abdul Dakkak, **jun Xiong, Wen-mei Hwu

    Abstract: As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities t… ▽ More

    Submitted 19 February, 2020; v1 submitted 15 November, 2019; originally announced November 2019.

  42. arXiv:1909.11258  [pdf, ps, other

    cs.CL cs.IR cs.LG

    PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

    Authors: Omer Anjum, Hongyu Gong, Suma Bhat, Wen-Mei Hwu, **jun Xiong

    Abstract: Finding the right reviewers to assess the quality of conference submissions is a time consuming process for conference organizers. Given the importance of this step, various automated reviewer-paper matching solutions have been proposed to alleviate the burden. Prior approaches, including bag-of-words models and probabilistic topic models have been inadequate to deal with the vocabulary mismatch a… ▽ More

    Submitted 24 September, 2019; originally announced September 2019.

  43. arXiv:1909.09709  [pdf, other

    cs.CV

    SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

    Authors: Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, **jun Xiong, Thomas Huang, Honghui Shi, Wen-mei Hwu, Deming Chen

    Abstract: Object detection and tracking are challenging tasks for resource-constrained embedded systems. While these tasks are among the most compute-intensive tasks from the artificial intelligence domain, they are only allowed to use limited computation and memory resources on embedded devices. In the meanwhile, such resource-constrained implementations are often required to satisfy additional demanding r… ▽ More

    Submitted 29 February, 2020; v1 submitted 20 September, 2019; originally announced September 2019.

    Comments: Published as a conference paper at Conference on Machine Learning and Systems (MLSys) 2020

  44. arXiv:1908.09798  [pdf, other

    cs.CV

    SPGNet: Semantic Prediction Guidance for Scene Parsing

    Authors: Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, **jun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui Shi

    Abstract: Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

    Comments: ICCV 2019

  45. arXiv:1908.06869  [pdf, other

    cs.LG cs.AR cs.PF stat.ML

    XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs

    Authors: Cheng Li, Abdul Dakkak, **jun Xiong, Wei Wei, Lingjie Xu, Wen-mei Hwu

    Abstract: There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost… ▽ More

    Submitted 2 June, 2020; v1 submitted 19 August, 2019; originally announced August 2019.

  46. arXiv:1908.01261  [pdf, other

    cs.AR

    Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device

    Authors: Seung Won Min, Sitao Huang, Mohamed El-Hadedy, **jun Xiong, Deming Chen, Wen-mei Hwu

    Abstract: Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patter… ▽ More

    Submitted 3 August, 2019; originally announced August 2019.

  47. arXiv:1906.10327  [pdf, other

    cs.CV

    SkyNet: A Champion Model for DAC-SDC on Low Power Object Detection

    Authors: Xiaofan Zhang, Cong Hao, Haoming Lu, Jiachen Li, Yuhong Li, Yuchen Fan, Kyle Rupnow, **jun Xiong, Thomas Huang, Honghui Shi, Wen-mei Hwu, Deming Chen

    Abstract: Develo** artificial intelligence (AI) at the edge is always challenging, since edge devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, and high inference accuracy. To overcome these challenges, we propose SkyNet, an extremely lightweight DNN with 12 convolutional (Conv) layers and on… ▽ More

    Submitted 9 July, 2019; v1 submitted 25 June, 2019; originally announced June 2019.

  48. arXiv:1906.09380  [pdf, other

    cs.AR cs.DL cs.IR

    A Retrospective Recount of Computer Architecture Research with a Data-Driven Study of Over Four Decades of ISCA Publications

    Authors: Omer Anjum, Wen-Mei Hwu, **jun Xiong

    Abstract: This study began with a research project, called DISCvR, conducted at the IBM-ILLINOIS Center for Cognitive Computing Systems Reseach. The goal of DISCvR was to build a practical NLP based AI pipeline for document understanding which will help us better understand the computation patterns and requirements of modern computing systems. While building such a prototype, an early use case came to us th… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

  49. arXiv:1905.08369  [pdf, other

    cs.CV

    A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices

    Authors: Xiaofan Zhang, Cong Hao, Yuhong Li, Yao Chen, **jun Xiong, Wen-mei Hwu, Deming Chen

    Abstract: Develo** deep learning models for resource-constrained Internet-of-Things (IoT) devices is challenging, as it is difficult to achieve both good quality of results (QoR), such as DNN model inference accuracy, and quality of service (QoS), such as inference latency, throughput, and power consumption. Existing approaches typically separate the DNN model development step from its deployment on IoT d… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Comments: Accepted by the ICML 2019 Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR)

  50. arXiv:1904.12437  [pdf, other

    cs.LG cs.AI cs.SE

    Challenges and Pitfalls of Machine Learning Evaluation and Benchmarking

    Authors: Cheng Li, Abdul Dakkak, **jun Xiong, Wen-mei Hwu

    Abstract: An increasingly complex and diverse collection of Machine Learning (ML) models as well as hardware/software stacks, collectively referred to as "ML artifacts", are being proposed - leading to a diverse landscape of ML. These ML innovations proposed have outpaced researchers' ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non-reproducible procedures f… ▽ More

    Submitted 25 June, 2019; v1 submitted 28 April, 2019; originally announced April 2019.