-
Observation of gigantic spin conversion anisotropy in bismuth
Authors:
Naoki Fukumoto,
Ryo Ohshima,
Motomi Aoki,
Yuki Fuseya,
Masayuki Matsushima,
Ei Shigematsu,
Teruya Shinjo,
Yuichiro Ando,
Shoya Sakamoto,
Masanobu Shiga,
Shinji Miwa,
Masashi Shiraishi
Abstract:
Whilst the g-factor can be anisotropic due to the spin-orbit interaction (SOI), its existence in solids cannot be simply asserted from a band structure, which hinders progress on studies from such the viewpoints. The g-factor in bismuth (Bi) is largely anisotropic; especially for holes at T-point, the g-factor perpendicular to the trigonal axis is negligibly small (< 0.112), whereas the g-factor a…
▽ More
Whilst the g-factor can be anisotropic due to the spin-orbit interaction (SOI), its existence in solids cannot be simply asserted from a band structure, which hinders progress on studies from such the viewpoints. The g-factor in bismuth (Bi) is largely anisotropic; especially for holes at T-point, the g-factor perpendicular to the trigonal axis is negligibly small (< 0.112), whereas the g-factor along the trigonal axis is very large (62.7). We clarified in this work that the large g- factor anisotropy gives rise to the gigantic spin conversion anisotropy in Bi from experimental and theoretical approaches. Spin-torque ferromagnetic resonance was applied to estimate the spin conversion efficiency in rhombohedral (110) Bi to be 17%, which is unlike the negligibly small efficiency in Bi(111). Harmonic Hall measurements supports the large spin conversion efficiency in Bi(110). This is the first observation of gigantic spin conversion anisotropy as the clear manifestation of the g-factor anisotropy. Beyond the emblematic case of Bi, our study unveiled the significance of the g-factor anisotropy in condensed-matter physics and can pave a pathway toward establishing novel spin physics under g-factor control.
△ Less
Submitted 15 August, 2022; v1 submitted 31 July, 2022;
originally announced August 2022.
-
mpiQulacs: A Distributed Quantum Computer Simulator for A64FX-based Cluster Systems
Authors:
Satoshi Imamura,
Masafumi Yamazaki,
Takumi Honda,
Akihiko Kasagi,
Akihiro Tabuchi,
Hiroshi Nakao,
Naoto Fukumoto,
Kohta Nakashima
Abstract:
Quantum computer simulators running on classical computers are essential for develo** real quantum computers and emerging quantum applications. In particular, state vector simulators, which store a full state vector in memory and update it in every quantum operation, are available to simulate an arbitrary form of quantum circuits, debug quantum applications, and validate future quantum computers…
▽ More
Quantum computer simulators running on classical computers are essential for develo** real quantum computers and emerging quantum applications. In particular, state vector simulators, which store a full state vector in memory and update it in every quantum operation, are available to simulate an arbitrary form of quantum circuits, debug quantum applications, and validate future quantum computers. However, the time and space complexity grows exponentially with the number of qubits and easily exceeds the capability of a single machine.
Therefore, we develop a distributed state vector simulator, $mpiQulacs$, that is optimized for large-scale simulation on A64FX-based cluster systems. A64FX is an ARM-based CPU that is also equipped in the world's top Fugaku supercomputer. We evaluate weak and strong scaling of mpiQulacs with up to 36 qubits on a new 64-node A64FX-based cluster system named $Todoroki$. By comparing mpiQulacs with existing distributed state vector simulators, we show that mpiQulacs achieves the highest performance for large-scale simulation on tens of nodes while sustaining a nearly ideal scalability. Besides, we define a new metric, $quantum B/F ratio$, and use it to demonstrate that mpiQulacs running on Todoroki fits the requirements of distributed state vector simulation rather than the existing simulators running on general purpose CPU-based or GPU-based cluster systems.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems
Authors:
Steven Farrell,
Murali Emani,
Jacob Balma,
Lukas Drescher,
Aleksandr Drozd,
Andreas Fink,
Geoffrey Fox,
David Kanter,
Thorsten Kurth,
Peter Mattson,
Dawei Mu,
Amit Ruhela,
Kento Sato,
Koichi Shirahata,
Tsuguchika Tabaru,
Aristeidis Tsaris,
Jan Balewski,
Ben Cumming,
Takumi Danjo,
Jens Domke,
Takaaki Fukai,
Naoto Fukumoto,
Tatsuya Fukushi,
Balazs Gerofi,
Takumi Honda
, et al. (18 additional authors not shown)
Abstract:
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning appli…
▽ More
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.
△ Less
Submitted 26 October, 2021; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds
Authors:
Masafumi Yamazaki,
Akihiko Kasagi,
Akihiro Tabuchi,
Takumi Honda,
Masahiro Miwa,
Naoto Fukumoto,
Tsuguchika Tabaru,
Atsushi Ike,
Kohta Nakashima
Abstract:
There has been a strong demand for algorithms that can execute machine learning as faster as possible and the speed of deep learning has accelerated by 30 times only in the past two years. Distributed deep learning using the large mini-batch is a key technology to address the demand and is a great challenge as it is difficult to achieve high scalability on large clusters without compromising accur…
▽ More
There has been a strong demand for algorithms that can execute machine learning as faster as possible and the speed of deep learning has accelerated by 30 times only in the past two years. Distributed deep learning using the large mini-batch is a key technology to address the demand and is a great challenge as it is difficult to achieve high scalability on large clusters without compromising accuracy. In this paper, we introduce optimization methods which we applied to this challenge. We achieved the training time of 74.7 seconds using 2,048 GPUs on ABCI cluster applying these methods. The training throughput is over 1.73 million images/sec and the top-1 validation accuracy is 75.08%.
△ Less
Submitted 29 March, 2019;
originally announced March 2019.