Skip to main content

Showing 1–12 of 12 results for author: Laudon, J

.
  1. arXiv:2306.00008  [pdf, other

    cs.LG cs.CL

    Brainformers: Trading Simplicity for Efficiency

    Authors: Yanqi Zhou, Nan Du, Yan** Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laudon, Jeff Dean

    Abstract: Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this in… ▽ More

    Submitted 25 April, 2024; v1 submitted 29 May, 2023; originally announced June 2023.

  2. arXiv:2305.14562  [pdf, other

    cs.LG eess.SY

    GiPH: Generalizable Placement Learning for Adaptive Heterogeneous Computing

    Authors: Yi Hu, Chaoran Zhang, Edward Andert, Harshul Singh, Aviral Shrivastava, James Laudon, Yanqi Zhou, Bob Iannucci, Carlee Joe-Wong

    Abstract: Careful placement of a computational application within a target device cluster is critical for achieving low application completion time. The problem is challenging due to its NP-hardness and combinatorial nature. In recent years, learning-based approaches have been proposed to learn a placement policy that can be applied to unseen applications, motivated by the problem of placing a neural networ… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: to be published in Proceedings of Machine Learning and Systems 5 (MLSys 2023)

  3. arXiv:2305.12281  [pdf, other

    cs.CL cs.LG

    Lifelong Language Pretraining with Distribution-Specialized Experts

    Authors: Wuyang Chen, Yanqi Zhou, Nan Du, Yan** Huang, James Laudon, Zhifeng Chen, Claire Cu

    Abstract: Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to en… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: ICML 2023

  4. arXiv:2202.09368  [pdf, other

    cs.LG cs.AI

    Mixture-of-Experts with Expert Choice Routing

    Authors: Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yan** Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

    Abstract: Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while kee** the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed nu… ▽ More

    Submitted 13 October, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

  5. arXiv:2112.04041  [pdf, other

    cs.LG cs.AR

    A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules

    Authors: Xinfeng Xie, Prakash Prabhu, Ulysse Beaugnon, Phitchaya Mangpo Phothilimthana, Sudip Roy, Azalia Mirhoseini, Eugene Brevdo, James Laudon, Yanqi Zhou

    Abstract: Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning (ML) accelerators while delivering performance and energy efficiency on par with a monolithic large chip. However, ML compilers targeting MCMs need to solve complex optimization problems optimally and efficiently to achieve this high performance. One such problem is the multi-chip partitioning problem where compil… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

  6. arXiv:2102.10423  [pdf, other

    cs.LG cs.AR

    An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

    Authors: Kiran Seshadri, Berkin Akin, James Laudon, Ravi Narayanaswami, Amir Yazdanbakhsh

    Abstract: Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. In this paper, we first discuss the major microarchitectural details of Edge TPUs. Then, we extensively evaluate three classes of Edge TPUs, covering different computing ecosystems, that are either currently deployed in Google products or are the produc… ▽ More

    Submitted 11 October, 2022; v1 submitted 20 February, 2021; originally announced February 2021.

    Comments: 13 pages, 15 figures, 8 tables, published in IISWC 2022

  7. arXiv:2102.08619  [pdf, other

    cs.LG cs.AR

    Rethinking Co-design of Neural Architectures and Hardware Accelerators

    Authors: Yanqi Zhou, Xuanyi Dong, Berkin Akin, Mingxing Tan, Daiyi Peng, Tianjian Meng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, James Laudon

    Abstract: Neural architectures and hardware accelerators have been two driving forces for the progress in deep learning. Previous works typically attempt to optimize hardware given a fixed model architecture or model architecture given fixed hardware. And the dominant hardware architecture explored in this prior work is FPGAs. In our work, we target the optimization of hardware and software configurations o… ▽ More

    Submitted 17 February, 2021; originally announced February 2021.

  8. arXiv:2102.01723  [pdf, other

    cs.LG cs.AR

    Apollo: Transferable Architecture Exploration

    Authors: Amir Yazdanbakhsh, Christof Angermueller, Berkin Akin, Yanqi Zhou, Albin Jones, Milad Hashemi, Kevin Swersky, Satrajit Chatterjee, Ravi Narayanaswami, James Laudon

    Abstract: The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Architecture exploration for such accelerators forms a challenging constrained optimization problem over a complex, high-dimensional, and structured input space with a costly to evaluate objective function. Existing approaches for accelera… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 10 pages, 5 figures, Accepted to Workshop on ML for Systems at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

  9. arXiv:2010.12438  [pdf, other

    cs.LG cs.DC

    Transferable Graph Optimizers for ML Compilers

    Authors: Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Mangpo Phothilimthana, Shen Wang, Anna Goldie, Azalia Mirhoseini, James Laudon

    Abstract: Most compilers for machine learning (ML) frameworks need to solve many correlated optimization problems to generate efficient machine code. Current ML compilers rely on heuristics based algorithms to solve these optimization problems one at a time. However, this approach is not only hard to maintain but often leads to sub-optimal solutions especially for newer model architectures. Existing learnin… ▽ More

    Submitted 19 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: arXiv admin note: text overlap with arXiv:1910.01578

    Journal ref: NeurIPS 2020

  10. arXiv:2004.10746  [pdf, other

    cs.LG cs.AI

    Chip Placement with Deep Reinforcement Learning

    Authors: Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter, Jeff Dean

    Abstract: In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

  11. arXiv:1910.01578  [pdf, other

    cs.LG stat.ML

    GDP: Generalized Device Placement for Dataflow Graphs

    Authors: Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter C. Ma, Qiumin Xu, Ming Zhong, Hanxiao Liu, Anna Goldie, Azalia Mirhoseini, James Laudon

    Abstract: Runtime and scalability of large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement is extremely challenging even for domain experts. Most existing automated device placement approaches are impractica… ▽ More

    Submitted 28 September, 2019; originally announced October 2019.

  12. arXiv:1704.04760  [pdf

    cs.AR cs.LG cs.NE

    In-Datacenter Performance Analysis of a Tensor Processing Unit

    Authors: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg , et al. (50 additional authors not shown)

    Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOp… ▽ More

    Submitted 16 April, 2017; originally announced April 2017.

    Comments: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 2017