-
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
Authors:
Norman P. Jouppi,
George Kurian,
Sheng Li,
Peter Ma,
Rahul Nagarajan,
Lifeng Nai,
Nishant Patil,
Suvinay Subramanian,
Andy Swing,
Brian Towles,
Cliff Young,
Xiang Zhou,
Zongwei Zhou,
David Patterson
Abstract:
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and perfo…
▽ More
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.
△ Less
Submitted 20 April, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
The Specialized High-Performance Network on Anton 3
Authors:
Keun Sup Shim,
Brian Greskamp,
Brian Towles,
Bruce Edwards,
J. P. Grossman,
David E. Shaw
Abstract:
Molecular dynamics (MD) simulation, a computationally intensive method that provides invaluable insights into the behavior of biomolecules, typically requires large-scale parallelization. Implementation of fast parallel MD simulation demands both high bandwidth and low latency for inter-node communication, but in current semiconductor technology, neither of these properties is scaling as quickly a…
▽ More
Molecular dynamics (MD) simulation, a computationally intensive method that provides invaluable insights into the behavior of biomolecules, typically requires large-scale parallelization. Implementation of fast parallel MD simulation demands both high bandwidth and low latency for inter-node communication, but in current semiconductor technology, neither of these properties is scaling as quickly as intra-node computational capacity. This disparity in scaling necessitates architectural innovations to maximize the utilization of computational units. For Anton 3, the latest in a family of highly successful special-purpose supercomputers designed for MD simulations, we thus designed and built a completely new specialized network as part of our ASIC. Tightly integrating this network with specialized computation pipelines enables Anton 3 to perform simulations orders of magnitude faster than any general-purpose supercomputer, and to outperform its predecessor, Anton 2 (the state of the art prior to Anton 3), by an order of magnitude. In this paper, we present the three key features of the network that contribute to the high performance of Anton 3. First, through architectural optimizations, the network achieves very low end-to-end inter-node communication latency for fine-grained messages, allowing for better overlap of computation and communication. Second, novel application-specific compression techniques reduce the size of most messages sent between nodes, thereby increasing effective inter-node bandwidth. Lastly, a new hardware synchronization primitive, called a network fence, supports fast fine-grained synchronization tailored to the data flow within a parallel MD application. These application-driven specializations to the network are critical for Anton 3's MD simulation performance advantage over all other machines.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
The $\textit{u}$-series: A separable decomposition for electrostatics computation with improved accuracy
Authors:
Cristian Predescu,
Adam K. Lerer,
Ross A. Lippert,
Brian Towles,
J. P. Grossman,
Robert M. Dirks,
David E. Shaw
Abstract:
The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard approach is to decompose the Coulomb potential into a near part, typically evaluated by direct summation up to a cutoff radius, and a far part, t…
▽ More
The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard approach is to decompose the Coulomb potential into a near part, typically evaluated by direct summation up to a cutoff radius, and a far part, typically evaluated in Fourier space. In practice, all decomposition approaches involve approximations---such as cutting off the near-part direct sum---but it may be possible to find new decompositions with improved tradeoffs between accuracy and performance. Here we present the $\textit{u-series}$, a new decomposition of the Coulomb potential that is more accurate than the standard (Ewald) decomposition for a given amount of computational effort, and achieves the same accuracy as the Ewald decomposition with approximately half the computational effort. These improvements, which we demonstrate numerically using a lipid membrane system, arise because the $\textit{u}$-series is smooth on the entire real axis and exact up to the cutoff radius. Additional performance improvements over the Ewald decomposition may be possible in certain situations because the far part of the $\textit{u}$-series is a sum of Gaussians, and can thus be evaluated using algorithms that require a separable convolution kernel; we describe one such algorithm that reduces communication latency at the expense of communication bandwidth and computation, a tradeoff that may be advantageous on modern massively parallel supercomputers.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
Concentration and Length Dependence of DNA Loo** in Transcriptional Regulation
Authors:
Lin Han,
Hernan G. Garcia,
Seth Blumberg,
Kevin B. Towles,
John F. Beausang,
Philip C. Nelson,
Rob Phillips
Abstract:
In many cases, transcriptional regulation involves the binding of transcription factors at sites on the DNA that are not immediately adjacent to the promoter of interest. This action at a distance is often mediated by the formation of DNA loops: Binding at two or more sites on the DNA results in the formation of a loop, which can bring the transcription factor into the immediate neighborhood of…
▽ More
In many cases, transcriptional regulation involves the binding of transcription factors at sites on the DNA that are not immediately adjacent to the promoter of interest. This action at a distance is often mediated by the formation of DNA loops: Binding at two or more sites on the DNA results in the formation of a loop, which can bring the transcription factor into the immediate neighborhood of the relevant promoter. Though there have been a variety of insights into the combinatorial aspects of transcriptional control, the mechanism of DNA loo** as an agent of combinatorial control in both prokaryotes and eukaryotes remains unclear. We use single-molecule techniques to dissect DNA loo** in the lac operon. In particular, we measure the propensity for DNA loo** by the Lac repressor as a function of the concentration of repressor protein and as a function of the distance between repressor binding sites. As with earlier single-molecule studies, we find (at least) two distinct looped states and demonstrate that the presence of these two states depends both upon the concentration of repressor protein and the distance between the two repressor binding sites. We find that loops form even at interoperator spacings considerably shorter than the DNA persistence length, without the intervention of any other proteins to prebend the DNA. The concentration measurements also permit us to use a simple statistical mechanical model of DNA loop formation to determine the free energy of DNA loo**, or equivalently, the J-factor for loo**.
△ Less
Submitted 11 June, 2008;
originally announced June 2008.
-
First-principles calculation of DNA loo** in tethered particle experiments
Authors:
Kevin B. Towles,
John F. Beausang,
Hernan G. Garcia,
Rob Phillips,
Philip C. Nelson
Abstract:
We calculate the probability of DNA loop formation mediated by regulatory proteins such as Lac repressor (LacI), using a mathematical model of DNA elasticity. Our model is adapted to calculating quantities directly observable in Tethered Particle Motion (TPM) experiments, and it accounts for all the entropic forces present in such experiments. Our model has no free parameters; it characterizes D…
▽ More
We calculate the probability of DNA loop formation mediated by regulatory proteins such as Lac repressor (LacI), using a mathematical model of DNA elasticity. Our model is adapted to calculating quantities directly observable in Tethered Particle Motion (TPM) experiments, and it accounts for all the entropic forces present in such experiments. Our model has no free parameters; it characterizes DNA elasticity using information obtained in other kinds of experiments. [...] We show how to compute both the "loo** J factor" (or equivalently, the loo** free energy) for various DNA construct geometries and LacI concentrations, as well as the detailed probability density function of bead excursions. We also show how to extract the same quantities from recent experimental data on tethered particle motion, and then compare to our model's predictions. [...] Our model successfully reproduces the detailed distributions of bead excursion, including their surprising three-peak structure, without any fit parameters and without invoking any alternative conformation of the LacI tetramer. Indeed, the model qualitatively reproduces the observed dependence of these distributions on tether length (e.g., phasing) and on LacI concentration (titration). However, for short DNA loops (around 95 basepairs) the experiments show more loo** than is predicted by the harmonic-elasticity model, echoing other recent experimental results. Because the experiments we study are done in vitro, this anomalously high loo** cannot be rationalized as resulting from the presence of DNA-bending proteins or other cellular machinery. We also show that it is unlikely to be the result of a hypothetical "open" conformation of the LacI tetramer.
△ Less
Submitted 13 October, 2008; v1 submitted 9 June, 2008;
originally announced June 2008.