Search | arXiv e-print repository

Distributed astrophysics simulations using Octo-Tiger with RISC-V CPUs using HPX and Kokkos

Authors: Patrick Diehl, Gregor Daiß, Steven R. Brandt, Alireza Kheirkhahan, Srinivas Yadav Singanaboina, Dominic Marcello, Chris Taylor, John Leidel, Hartmut Kaiser

Abstract: In recent years, interest in RISC-V computing architectures have moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a point of concern. The results presented in this paper are part of a longer-term evaluation of RISC-V's viability for HPC applications. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D… ▽ More In recent years, interest in RISC-V computing architectures have moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a point of concern. The results presented in this paper are part of a longer-term evaluation of RISC-V's viability for HPC applications. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application as the bases for our analysis. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. We also compare the application's performance, scalability, and power consumption on RISC-V to an A64FX system. △ Less

Submitted 10 May, 2024; originally announced July 2024.

arXiv:2405.13101 [pdf, other]

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Authors: Patrick Diehl, Noujoud Nader, Steve Brandt, Hartmut Kaiser

Abstract: This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat eq… ▽ More This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly. △ Less

Submitted 5 July, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

Comments: 9 pages, 3 figures

arXiv:2405.00016 [pdf, ps, other]

doi 10.1007/978-3-031-61763-8_17

HPX with Spack and Singularity Containers: Evaluating Overheads for HPX/Kokkos using an astrophysics application

Authors: Patrick Diehl, Steven R. Brandt, Gregor Daiß, Hartmut Kaiser

Abstract: Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers… ▽ More Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers is exemplified by the fact that Microsoft Azure's Eagle cloud service machine is number three on the November 23 Top 500 list. For cloud services, the HPC application and dependencies are installed in containers, e.g. Docker, Singularity, or something else, and these containers are executed on the physical hardware. Although containerization leverages the existing Linux kernel and should not impose overheads on the computation, there is the possibility that machine-specific optimizations might be lost, particularly machine-specific installs of commonly used packages. In this paper, we will use an astrophysics application using HPX-Kokkos and measure overheads on homogeneous resources, e.g. Supercomputer Fugaku, using CPUs only and on heterogenous resources, e.g. LSU's hybrid CPU and GPU system. We will report on challenges in compiling, running, and using the containers as well as performance performance differences. △ Less

Submitted 7 May, 2024; v1 submitted 11 February, 2024; originally announced May 2024.

arXiv:2404.06864 [pdf, other]

Hydrodynamic simulations of WD-WD mergers and the origin of RCB stars

Authors: Sagiv Shiber, Orsola De Marco, Patrick M. Motl, Bradley Munson, Dominic C. Marcello, Juhan Frank, Patrick Diehl, Geoffrey C. Clayton, Bennett N. Skinner, Hartmut Kaiser, Gregor Daiss, Dirk Pfluger, Jan E. Staff

Abstract: We study the properties of double white dwarf (DWD) mergers by performing hydrodynamic simulations using the new and improved adaptive mesh refinement code Octo-Tiger. We follow the orbital evolution of DWD systems of mass ratio q=0.7 for tens of orbits until and after the merger to investigate them as a possible origin for R Coronae Borealis (RCB) type stars. We reproduce previous results, findin… ▽ More We study the properties of double white dwarf (DWD) mergers by performing hydrodynamic simulations using the new and improved adaptive mesh refinement code Octo-Tiger. We follow the orbital evolution of DWD systems of mass ratio q=0.7 for tens of orbits until and after the merger to investigate them as a possible origin for R Coronae Borealis (RCB) type stars. We reproduce previous results, finding that during the merger, the Helium WD donor star is tidally disrupted within 20-80 minutes since the beginning of the simulation onto the accretor Carbon-Oxygen WD, creating a high temperature shell around the accretor. We investigate the possible Helium burning in this shell and the merged object's general structure. Specifically, we are interested in the amount of Oxygen-16 dredged-up to the hot shell and the amount of Oxygen-18 produced. This is critical as the discovery of very low Oxygen-16 to Oxygen-18 ratios in RCB stars pointed out the merger scenario as a favorable explanation for their origin. A small amount of hydrogen in the donor may help keep the Oxygen-16 to Oxygen-18 ratios within observational bounds, even if moderate dredge-up from the accretor occurs. In addition, we perform a resolution study to reconcile the difference found in the amount of Oxygen-16 dredge-up between smoothed-particle hydrodynamics and grid-based simulations. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 27 pages, Submitted to MNRAS. Comments are welcome

arXiv:2403.04818 [pdf, other]

Storm Surge Modeling in the AI ERA: Using LSTM-based Machine Learning for Enhancing Forecasting Accuracy

Authors: Stefanos Giaremis, Noujoud Nader, Clint Dawson, Hartmut Kaiser, Carola Kaiser, Efstratios Nikidis

Abstract: Physics simulation results of natural processes usually do not fully capture the real world. This is caused for instance by limits in what physical processes are simulated and to what accuracy. In this work we propose and analyze the use of an LSTM-based deep learning network machine learning (ML) architecture for capturing and predicting the behavior of the systemic error for storm surge forecast… ▽ More Physics simulation results of natural processes usually do not fully capture the real world. This is caused for instance by limits in what physical processes are simulated and to what accuracy. In this work we propose and analyze the use of an LSTM-based deep learning network machine learning (ML) architecture for capturing and predicting the behavior of the systemic error for storm surge forecast models with respect to real-world water height observations from gauge stations during hurricane events. The overall goal of this work is to predict the systemic error of the physics model and use it to improve the accuracy of the simulation results post factum. We trained our proposed ML model on a dataset of 61 historical storms in the coastal regions of the U.S. and we tested its performance in bias correcting modeled water level data predictions from hurricane Ian (2022). We show that our model can consistently improve the forecasting accuracy for hurricane Ian -- unknown to the ML model -- at all gauge station coordinates used for the initial data. Moreover, by examining the impact of using different subsets of the initial training dataset, containing a number of relatively similar or different hurricanes in terms of hurricane track, we found that we can obtain similar quality of bias correction by only using a subset of six hurricanes. This is an important result that implies the possibility to apply a pre-trained ML model to real-time hurricane forecasting results with the goal of bias correcting and improving the produced simulation accuracy. The presented work is an important first step in creating a bias correction system for real-time storm surge forecasting applicable to the full simulation area. It also presents a highly transferable and operationally applicable methodology for improving the accuracy in a wide range of physics simulation scenarios beyond storm surge forecasting. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2401.03353 [pdf, other]

HPX -- An open source C++ Standard Library for Parallelism and Concurrency

Authors: Thomas Heller, Patrick Diehl, Zachary Byerly, John Biddiscombe, Hartmut Kaiser

Abstract: To achieve scalability with today's heterogeneous HPC resources, we need a dramatic shift in our thinking; MPI+X is not enough. Asynchronous Many Task (AMT) runtime systems break down the global barriers imposed by the Bulk Synchronous Programming model. HPX is an open-source, C++ Standards compliant AMT runtime system that is developed by a diverse international community of collaborators called… ▽ More To achieve scalability with today's heterogeneous HPC resources, we need a dramatic shift in our thinking; MPI+X is not enough. Asynchronous Many Task (AMT) runtime systems break down the global barriers imposed by the Bulk Synchronous Programming model. HPX is an open-source, C++ Standards compliant AMT runtime system that is developed by a diverse international community of collaborators called The Ste||ar Group. HPX provides features which allow application developers to naturally use key design patterns, such as overlap** communication and computation, decentralizing of control flow, oversubscribing execution resources and sending work to data instead of data to work. The Ste||ar Group comprises physicists, engineers, and computer scientists; men and women from many different institutions and affiliations, and over a dozen different countries. We are committed to advancing the development of scalable parallel applications by providing a platform for collaborating and exchanging ideas. In this paper, we give a detailed description of the features HPX provides and how they help achieve scalability and programmability, a list of applications of HPX including two large NSF funded collaborations (STORM, for storm surge forecasting; and STAR (OctoTiger) an astro-physics project which runs at 96.8% parallel efficiency on 643,280 cores), and we end with a description of how HPX and the Ste||ar Group fit into the open source community. △ Less

Submitted 11 August, 2023; originally announced January 2024.

Journal ref: Proceedings of OpenSuCo 2017, Denver, Colorado USA, November 2017 (OpenSuCo 17)

arXiv:2309.06530 [pdf, other]

doi 10.1145/3624062.3624230

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger

Authors: Parick Diehl, Gregor Daiss, Steven R. Brandt, Alireza Kheirkhahan, Hartmut Kaiser, Christopher Taylor, John Leidel

Abstract: In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-… ▽ More In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is essential. In this paper, we describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. Considering the (limited) capabilities of the RISC-V test systems we used, Octo-Tiger already shows promising results and good scaling. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results. △ Less

Submitted 17 August, 2023; originally announced September 2023.

arXiv:2307.01117 [pdf, other]

doi 10.1007/978-3-031-48803-0_11

Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Authors: Patrick Diehl, Steven R. Brandt, Max Morris, Nikunj Gupta, Hartmut Kaiser

Abstract: Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focu… ▽ More Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asynchronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing platforms were C++, Rust, Chapel, Charm++, and HPX. △ Less

Submitted 10 July, 2023; v1 submitted 18 May, 2023; originally announced July 2023.

arXiv:2304.11002 [pdf, other]

doi 10.1109/IPDPSW59300.2023.00116

Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku

Authors: Patrick Diehl, Gregor Daiß, Kevin Huck, Dominic Marcello, Sagiv Shiber, Hartmut Kaiser, Dirk Pflüger

Abstract: The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different… ▽ More The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C++ △ Less

Submitted 15 March, 2023; originally announced April 2023.

arXiv:2303.08058 [pdf, other]

doi 10.1145/3585341.3585354

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Authors: Gregor Daiß, Patrick Diehl, Hartmut Kaiser, Dirk Pflüger

Abstract: Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us… ▽ More Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue. △ Less

Submitted 8 May, 2023; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2302.07191 [pdf, ps, other]

doi 10.1007/978-3-031-32316-4_3

Shared memory parallelism in Modern C++ and HPX

Authors: Patrick Diehl, Steven R. Brandt, Hartmut Kaiser

Abstract: Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for… ▽ More Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for these various tools and extensions is available to a varying degree. In recent years, the C++ standards committee has worked to refine the language features and libraries needed to support parallel programming on a single computational node. Eventually, all major vendors and compilers will provide robust and performant implementations of these standards. Until then, the HPX library and runtime provides cutting edge implementations of the standards, as well as proposed standards and extensions. Because of these advances, it is now possible to write high performance parallel code without custom extensions to C++. We provide an overview of modern parallel programming in C++, describing the language and library features, and providing brief examples of how to use them. △ Less

Submitted 9 August, 2023; v1 submitted 16 January, 2023; originally announced February 2023.

Comments: Extended paper for the special issue

arXiv:2210.06439 [pdf, other]

doi 10.1109/ESPM256814.2022.00007

From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types

Authors: Gregor Daiß, Srinivas Yadav Singanaboina, Patrick Diehl, Hartmut Kaiser, Dirk Pflüger

Abstract: Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to… ▽ More Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU. △ Less

Submitted 8 May, 2023; v1 submitted 26 September, 2022; originally announced October 2022.

arXiv:2210.06438 [pdf, other]

doi 10.1109/P3HPC56579.2022.00014

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Authors: Gregor Daiß, Patrick Diehl, Dominic Marcello, Alireza Kheirkhahan, Hartmut Kaiser, Dirk Pflüger

Abstract: Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the com… ▽ More Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups. △ Less

Submitted 4 March, 2023; v1 submitted 26 September, 2022; originally announced October 2022.

arXiv:2210.06437 [pdf, other]

Distributed, combined CPU and GPU profiling within HPX using APEX

Authors: Patrick Diehl, Gregor Daiss, Kevin Huck, Dominic Marcello, Sagiv Shiber, Hartmut Kaiser, Juhan Frank, Geoffrey C. Clayton, Dirk Pflueger

Abstract: Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a… ▽ More Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a simulation built on HPX, a highly scalable, distributed AMT runtime. We examine the performance of the astrophysics simulation carried-out by Octo-Tiger on two different supercomputing architectures. We analyze the results of scaling and measurement overheads. In addition, we look in-depth at two similarly configured executions on the two systems to study how architectural differences affect performance and identify opportunities for optimization. As one such opportunity, we optimize the communication for the hydro solver and investigated its performance impact. △ Less

Submitted 21 September, 2022; originally announced October 2022.

arXiv:2208.00109 [pdf, other]

Traveler: Navigating Task Parallel Traces for Performance Analysis

Authors: Sayef Azad Sakin, Alex Bigelow, R. Tohid, Connor Scully-Allison, Carlos Scheidegger, Steven R. Brandt, Christopher Taylor, Kevin A. Huck, Hartmut Kaiser, Katherine E. Isaacs

Abstract: Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activit… ▽ More Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activity during execution. As traces represent the full history, developers can discover a wide array of possibly previously unknown performance issues, making them an important artifact for exploratory performance analysis. However, interactive trace visualization is difficult due to issues of data size and complexity of meaning. Traces represent nanosecond-level events across many parallel processes, meaning the collected data is often large and difficult to explore. The rise of asynchronous task parallel programming paradigms complicates the relation between events and their probable cause. To address these challenges, we conduct a continuing design study in collaboration with high performance computing researchers. We develop diverse and hierarchical ways to navigate and represent execution trace data in support of their trace analysis tasks. Through an iterative design process, we developed Traveler, an integrated visualization platform for task parallel traces. Traveler provides multiple linked interfaces to help navigate trace data from multiple contexts. We evaluate the utility of Traveler through feedback from users and a case study, finding that integrating multiple modes of navigation in our design supported performance analysis tasks and led to the discovery of previously unknown behavior in a distributed array library. △ Less

Submitted 3 September, 2022; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: IEEE VIS 2022

arXiv:2207.12127 [pdf, other]

doi 10.1007/978-3-031-31209-0_1

Quantifying Overheads in Charm++ and HPX using Task Bench

Authors: Nanmiao Wu, Ioannis Gonidelis, Simeng Liu, Zane Fink, Nikunj Gupta, Karame Mohammadiporshokooh, Patrick Diehl, Hartmut Kaiser, Laxmikant V. Kale

Abstract: Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting st… ▽ More Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C++ library for concurrency and parallelism, exposing C++ standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency. △ Less

Submitted 21 July, 2022; originally announced July 2022.

arXiv:2206.06302 [pdf, other]

doi 10.1007/978-3-319-46079-6_2

Closing the Performance Gap with Modern C++

Authors: Thomas Heller, Hartmut Kaiser, Patrick Diehl, Dietmar Fey, Marc Alexander Schweitzer

Abstract: On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware… ▽ More On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware execution models, such as GPGPU's, SIMD vector units, and general purpose cores which conventionally have to be programmed using separate tool chains representing non-overlap** programming models. The recent revival of interest in the industry and the wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications in the arena of concurrency and parallelism. This recently includes an increasing amount of discussion around the need for a uniform, higher-level abstraction and programming model for parallelism in the C++ standard targeting heterogeneous and distributed computing. Such an abstraction should perfectly blend with existing, already standardized language and library features, but should also be generic enough to support future hardware developments. In this paper, we present the results from develo** such a higher-level programming abstraction for parallelism in C++ which aims at enabling code and performance portability over a wide range of architectures and for various types of parallelism. We present and compare performance data obtained from running the well-known STREAM benchmark ported to our higher level C++ abstraction with the corresponding results from running it natively. We show that our abstractions enable performance at least as good as the comparable base-line benchmarks while providing a uniform programming API on all compared target architectures. △ Less

Submitted 30 May, 2022; originally announced June 2022.

arXiv:2107.10987 [pdf, other]

doi 10.1109/Cluster48925.2021.00059

Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit

Authors: Patrick Diehl, Gregor Daiß, Dominic Marcello, Kevin Huck, Sagiv Shiber, Hartmut Kaiser, Juhan Frank, Dirk Pflüger

Abstract: Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver.… ▽ More Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger's hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL's Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger's new hydro scheme with its old hydro scheme, using a rotating star as a test problem. △ Less

Submitted 26 July, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Accepted to IEEE Cluster

arXiv:2105.00027 [pdf, other]

Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

Authors: Weile Wei, Eduardo D'Azevedo, Kevin Huck, Arghya Chatterjee, Oscar Hernandez, Hartmut Kaiser

Abstract: Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper disc… ▽ More Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper discusses how device memory-bound challenges were successfully reduced by proposing an effective "all-to-all" communication method -- a ring communication algorithm. This implementation takes advantage of acceleration on GPUs and remote direct memory access (RDMA) for fast data exchange between GPUs. Additionally, the ring algorithm was optimized with sub-ring communicators and multi-threaded support to further reduce communication overhead and expose more concurrency, respectively. The computation and communication were also analyzed by using the Autonomic Performance Environment for Exascale (APEX) profiling tool, and this paper further discusses the performance trade-off for the ring algorithm implementation. The memory analysis on the ring algorithm shows that the allocation size for the authors' most memory-intensive data structure per GPU is now reduced to 1/p of the original size, where p is the number of GPUs in the ring communicator. The communication analysis suggests that the distributed Quantum Monte Carlo execution time grows linearly as sub-ring size increases, and the cost of messages passing through the network interface connector could be a limiting factor. △ Less

Submitted 13 May, 2021; v1 submitted 30 April, 2021; originally announced May 2021.

arXiv:2102.00223 [pdf, other]

doi 10.1109/MCSE.2021.3073626

Performance Measurements within Asynchronous Task-based Runtime Systems: A Double White Dwarf Merger as an Application

Authors: Patrick Diehl, Dominic Marcello, Parsa Amini, Hartmut Kaiser, Sagiv Shiber, Geoffrey C. Clayton, Juhan Frank, Gregor Daiß, Dirk Pflüger, David Eder, Alice Koniges, Kevin Huck

Abstract: Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Oct… ▽ More Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Octo-Tiger full 3D AMR astrophysics application. This enables the combined visualization of physical and performance data to highlight bottlenecks with respect to different solvers. We examine the overhead introduced by these measurements, which is around 1%, with respect to the overall application runtime. We perform a convergence study for four different levels of refinement and analyze the application's performance with respect to adaptive grid refinement. The measurements' overheads are small, enabling the combined use of performance data and physical properties with the goal of improving the code's performance. All of these measurements were obtained on NERSC's Cori, Louisiana Optical Network Infrastructure's QueenBee2, and Indiana University's Big Red 3. △ Less

Submitted 9 June, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

arXiv:2101.08226 [pdf, other]

doi 10.1093/mnras/stab937

Octo-Tiger: A New, 3D Hydrodynamic Code for Stellar Mergers that uses HPX Parallelisation

Authors: Dominic C. Marcello, Sagiv Shiber, Orsola De Marco, Juhan Frank, Geoffrey C. Clayton, Patrick M. Motl, Patrick Diehl, Hartmut Kaiser

Abstract: OCTO-TIGER is an astrophysics code to simulate the evolution of self-gravitating and rotat-ing systems of arbitrary geometry based on the fast multipole method, using adaptive mesh refinement. OCTO-TIGER is currently optimised to simulate the merger of well-resolved stars that can be approximated by barotropic structures, such as white dwarfs or main sequence stars. The gravity solver conserves an… ▽ More OCTO-TIGER is an astrophysics code to simulate the evolution of self-gravitating and rotat-ing systems of arbitrary geometry based on the fast multipole method, using adaptive mesh refinement. OCTO-TIGER is currently optimised to simulate the merger of well-resolved stars that can be approximated by barotropic structures, such as white dwarfs or main sequence stars. The gravity solver conserves angular momentum to machine precision, thanks to a correction algorithm. This code uses HPX parallelization, allowing the overlap of work and communication and leading to excellent scaling properties, allowing for the computation of large problems in reasonable wall-clock times. In this paper, we investigate the code performance and precision by running benchmarking tests. These include simple problems, such as the Sod shock tube, as well as sophisticated, full, white-dwarf binary simulations. Results are compared to analytic solutions, when known, and to other grid based codes such as FLASH. We also compute the interaction between two white dwarfs from the early mass transfer through to the merger and compare with past simulations of similar systems. We measure OCTO-TIGERs scaling properties up to a core count of 80,000, showing excellent performance for large problems. Finally, we outline the current and planned areas of development aimed at tackling a number of physical phenomena connected to observations of transients. △ Less

Submitted 10 August, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

Comments: 38 pages, 24 figures, Co-Lead Authors: Dominic C. Marcello and Sagiv Shiber

arXiv:2010.10930 [pdf, ps, other]

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Authors: Nikunj Gupta, Jackson R. Mayo, Adrian S. Lemoine, Hartmut Kaiser

Abstract: Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper,… ▽ More Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: arXiv admin note: text overlap with arXiv:2004.07203

Report number: SAND2020-11278 C

arXiv:2010.07098 [pdf, other]

Performance Analysis of a Quantum Monte Carlo Application on Multiple Hardware Architectures Using the HPX Runtime

Authors: Weile Wei, Arghya Chatterjee, Kevin Huck, Oscar Hernandez, Hartmut Kaiser

Abstract: This paper describes how we successfully used the HPX programming model to port the DCA++ application on multiple architectures that include POWER9, x86, ARM v8, and NVIDIA GPUs. We describe the lessons we can learn from this experience as well as the benefits of enabling the HPX in the application to improve the CPU threading part of the code, which led to an overall 21% improvement across archit… ▽ More This paper describes how we successfully used the HPX programming model to port the DCA++ application on multiple architectures that include POWER9, x86, ARM v8, and NVIDIA GPUs. We describe the lessons we can learn from this experience as well as the benefits of enabling the HPX in the application to improve the CPU threading part of the code, which led to an overall 21% improvement across architectures. We also describe how we used HPX-APEX to raise the level of abstraction to understand performance issues and to identify tasking optimization opportunities in the code, and how these relate to CPU/GPU utilization counters, device memory allocation over time, and CPU kernel-level context switches on a given architecture. △ Less

Submitted 19 October, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

arXiv:2010.04106 [pdf, other]

doi 10.1109/ESPM251964.2020.00007

Deploying a Task-based Runtime System on Raspberry Pi Clusters

Authors: Nikunj Gupta, Steve R. Brandt, Bibek Wagle, Nanmiao, Alireza Kheirkhahan, Patrick Diehl, Hartmut Kaiser, Felix W. Baumann

Abstract: Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx pla… ▽ More Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx platform (normally intended for use with HPC applications) and document the lessons we learned. First, we highlight the required changes in the configuration of the Pi to gain performance. Second, we explore how limited memory bandwidth limits the use of all cores in our shared memory benchmarks. Third, we evaluate whether low network bandwidth affects distributed performance. Fourth, we discuss the power consumption and the resulting trade-off in cost of operation and performance. △ Less

Submitted 9 April, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2010.03012 [pdf, other]

doi 10.1109/DLS51937.2020.00008

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Authors: Bita Hasheminezhad, Shahrzad Shirzad, Nanmiao Wu, Patrick Diehl, Hannes Schulz, Hartmut Kaiser

Abstract: Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the preliminary designs of most available d… ▽ More Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the preliminary designs of most available distributed deep learning frameworks, and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx offers a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system. △ Less

Submitted 19 April, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

arXiv:2004.07203 [pdf, other]

Implementing Software Resiliency in HPX for Extreme Scale Computing

Authors: Nikunj Gupta, Jackson R. Mayo, Adrian S. Lemoine, Hartmut Kaiser

Abstract: Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this pa… ▽ More Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we implement software resilience in HPX, an Asynchronous Many-Task Runtime system. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay will reschedule a task up to n-times until a valid output is returned. Furthermore, we introduce an API that allows the application to verify the returned result with a user provided predicate. We test the APIs with both artificial workloads and a dataflow based stencil application. We demonstrate that only minor overheads are incurred when utilizing these resiliency features for work loads where the task size is greater than 200 $μ$s. We also show that most of the added execution time arises from the replay or replication of the tasks themselves and not by the implementation of the APIs. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: 7 pages, 5 figures

Report number: SAND2020-3975 R

arXiv:2002.07970 [pdf, other]

Supporting OpenMP 5.0 Tasks in hpxMP -- A study of an OpenMP implementation within Task Based Runtime Systems

Authors: Tianyi Zhang, Shahrzad Shirzad, Bibek Wagle, Adrian S. Lemoine, Patrick Diehl, Hartmut Kaiser

Abstract: OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing applications. One of the major challenges of this new paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly optimized OpenMP-based librar… ▽ More OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing applications. One of the major challenges of this new paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly optimized OpenMP-based libraries do not perform well when coupled with AMTs because the threading of both libraries will compete for resources. This paper is a follow-up paper on the fundamental implementation of hpxMP, an implementation of the OpenMP standard which utilizes the C++ standard library for Parallelism and Concurrency (HPX) to schedule and manage tasks. In this paper, we present the implementation of task features, e.g. taskgroup, task depend, and task_reduction, of the OpenMP 5.0 standard and optimization of the #pragma omp parallel for pragma. We use the daxpy benchmark, the Barcelona OpenMP Tasks Suite, Parallel research kernels, and OpenBLAS benchmarks to compare the different OpenMp implementations: hpxMP, llvm-OpenMP, and GOMP. △ Less

Submitted 18 February, 2020; originally announced February 2020.

arXiv:1909.03947 [pdf, other]

doi 10.1109/MLHPC49564.2019.00009

Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

Authors: G. Laberge, S. Shirzad, P. Diehl, H. Kaiser, S. Prudhomme, A. Lemoine

Abstract: Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and… ▽ More Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations. △ Less

Submitted 25 September, 2019; v1 submitted 9 September, 2019; originally announced September 2019.

Comments: Accepted at HPCML19

arXiv:1908.03121 [pdf, other]

doi 10.1145/3295500.3356221

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions

Authors: Gregor Daiß, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, Dirk Pflüger

Abstract: We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems,… ▽ More We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems, Octo-Tiger relies on high-level programming abstractions. We use HPX with its futurization capabilities to ensure scalability both between nodes and within, and present first results replacing MPI with libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous GPU-accelerated supercomputers, demonstrating node-level performance and portability. We show scalability up to full system runs on Piz Daint. For the scenario's maximum resolution, the compute-critical parts (hydrodynamics and gravity) achieve 68.1% parallel efficiency at 2048 nodes. △ Less

Submitted 9 August, 2019; v1 submitted 8 August, 2019; originally announced August 2019.

Comments: Accepted at SC19

arXiv:1903.03023 [pdf, other]

doi 10.1145/3318170.3318191

An Introduction to hpxMP: A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System

Authors: Tianyi Zhang, Shahrzad Shirzad, Patrick Diehl, R. Tohid, Weile Wei, Hartmut Kaiser

Abstract: Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards confor… ▽ More Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards conforming runtime systems such as HPX, giving users the ability to simply utilize fork-join parallelism in their own codes. Despite innovations in runtime systems and standardization efforts users face enormous challenges porting legacy applications. Not only must users port their own codes, but often users rely on highly optimized libraries such as BLAS and LAPACK which use OpenMP for parallization. Current efforts to create smooth migration paths have struggled with these challenges, especially as the threading systems of AMT libraries often compete with the treading system of OpenMP. To overcome these issues, our team has developed hpxMP, an implementation of the OpenMP standard, which utilizes the underlying AMT system to schedule and manage tasks. This approach leverages the C++ interfaces exposed by HPX and allows users to execute their applications on an AMT system without changing their code. In this work, we compare hpxMP with Clang's OpenMP library with four linear algebra benchmarks of the Blaze C++ library. While hpxMP is often not able to reach the same performance, we demonstrate viability for providing a smooth migration for applications but have to be extended to benefit from a more general task based programming model. △ Less

Submitted 5 July, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1810.11482 [pdf, other]

doi 10.1109/ESPM2.2018.00006

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Authors: Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser

Abstract: Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for dist… ▽ More Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices. △ Less

Submitted 26 October, 2018; originally announced October 2018.

arXiv:1810.07591 [pdf, other]

doi 10.1109/ESPM2.2018.00009

Asynchronous Execution of Python Code on Task Based Runtime Systems

Authors: R. Tohid, Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini, Katy Williams, Kate Isaacs, Kevin Huck, Steven Brandt, Hartmut Kaiser

Abstract: Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenienc… ▽ More Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by map** Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards. △ Less

Submitted 22 October, 2018; v1 submitted 17 October, 2018; originally announced October 2018.

arXiv:1806.06917 [pdf, other]

doi 10.1007/s42452-020-03784-x

An asynchronous and task-based implementation of Peridynamics utilizing HPX -- the C++ standard library for parallelism and concurrency

Authors: Patrick Diehl, Prashant K. Jha, Hartmut Kaiser, Robert Lipton, Martin Levesque

Abstract: On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with the focus on handle the fine-grain parallelism of this increasing amount of cores per computational node is needed. Asynchronous Many Task (AMT) run time systems represent an emerging paradigm for… ▽ More On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with the focus on handle the fine-grain parallelism of this increasing amount of cores per computational node is needed. Asynchronous Many Task (AMT) run time systems represent an emerging paradigm for addressing fine-grain parallelism since they handle the increasing amount of threads per node and concurrency. HPX, a open source C++ standard library for parallelism and concurrency, is one AMT which is confirm with the C++ standard. Which means that HPX's Application Programming Interface (API) is confirm with its definition by the C++ standard committee. For example for the concept of futurization the hpx:future can be replaced by std::future without breaking the API. Peridynamics is a non-local generalization of continuum mechanics tailored to address discontinuous displacement fields arising in fracture mechanics. As many non-local approaches, peridynamics requires considerable computing resources to solve practical problems. This paper investigates the implementation of a peridynamics EMU nodal discretization in an asynchronous task-based fashion. The scalability of asynchronous task-based implementation is to be in agreement with theoretical estimations. In addition, to the scalabilty the code is convergent for implicit time integration and recovers theoretical solutions. Explicit time integration, convergence results are presented to showcase the agreement of results with theoretical claims in previous works. △ Less

Submitted 28 October, 2020; v1 submitted 18 June, 2018; originally announced June 2018.

arXiv:1711.01519 [pdf, other]

doi 10.1145/3152041.3152084

HPX Smart Executors

Authors: Zahra Khatami, Lukas Troska, Hartmut Kaiser, J. Ramanujam, Adrian Serio

Abstract: The performance of many parallel applications depends on loop-level parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance. We illustrate how machi… ▽ More The performance of many parallel applications depends on loop-level parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method by introducing HPX smart executors for determining the execution policy, chunk size, and prefetching distance of an HPX loop to achieve higher possible performance by feeding static information captured during compilation and runtime-based dynamic information to our learning model. Our evaluated execution results show that using these smart executors can speed up the HPX execution process by around 12%-35% for the Matrix Multiplication, Stream and $2D$ Stencil benchmarks compared to setting their HPX loop's execution policy/parameters manually or using HPX auto-parallelization techniques. △ Less

Submitted 4 November, 2017; originally announced November 2017.

Comments: In Proceedings of ESPM2'17: Third International Workshop on Extreme Scale Programming Models and Middleware, Denver, CO, USA, November 12-17,,2017 (ESPM2'17), 8 pages

arXiv:1703.09264 [pdf, other]

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Authors: Zahra Khatami, Hartmut Kaiser, J. Ramanujam

Abstract: Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, a… ▽ More Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance. △ Less

Submitted 27 March, 2017; originally announced March 2017.

Comments: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017)

arXiv:1611.00463 [pdf, other]

A Load-Balanced Parallel and Distributed Sorting Algorithm Implemented with PGX.D

Authors: Zahra Khatami, Sungpack Hong, **soo Lee, Siegfried Depner, Hassan Chafi, J. Ramanujam, Hartmut Kaiser

Abstract: Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel sorting implementation, however achieving desired performance on different types of the architectures with large number of processors is still a challenging issue. Maximizing parallelism level in appl… ▽ More Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel sorting implementation, however achieving desired performance on different types of the architectures with large number of processors is still a challenging issue. Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalance and waiting time due to memory latencies. In this paper, we present a distributed sorting algorithm implemented in PGX.D, a fast distributed graph processing system, which outperforms the Spark's distributed sorting implementation by around 2x-3x by hiding communication latencies and minimizing unnecessary overheads. Furthermore, it shows that the proposed PGX.D sorting method handles dataset containing many duplicated data entries efficiently and always results in kee** balanced workloads for different input data distribution types. △ Less

Submitted 14 January, 2017; v1 submitted 1 November, 2016; originally announced November 2016.

Comments: 8 pages, 12 figures

arXiv:1510.05804 [pdf, other]

doi 10.1073/pnas.1612964114

Mermin-Wagner fluctuations in 2D amorphous solids

Authors: Bernd Illing, Sebastian Frischi, Herbert Kaiser, Christian Klix, Georg Maret, Peter Keim

Abstract: In a recent comment, M. Kosterlitz described how the discrepancy about the lack of broken translational symmetry in two dimensions - doubting the existence of 2D crystals - and the first computer simulations foretelling 2D crystals at least in tiny systems, motivated him and D. Thouless to investigate melting and suprafluidity in two dimensions [Jour. of Phys. Cond. Matt. \textbf{28}, 481001 (2016… ▽ More In a recent comment, M. Kosterlitz described how the discrepancy about the lack of broken translational symmetry in two dimensions - doubting the existence of 2D crystals - and the first computer simulations foretelling 2D crystals at least in tiny systems, motivated him and D. Thouless to investigate melting and suprafluidity in two dimensions [Jour. of Phys. Cond. Matt. \textbf{28}, 481001 (2016)]. The lack of broken symmetries proposed by D. Mermin and H. Wagner is caused by long wavelength density fluctuations. Those fluctuations do not only have structural impact but additionally a dynamical one: They cause the Lindemann criterion to fail in 2D and the mean squared displacement not to be limited. Comparing experimental data from 3D and 2D amorphous solids with 2D crystals we disentangle Mermin-Wagner fluctuations from glassy structural relaxations. Furthermore we can demonstrate with computer simulations the logarithmic increase of displacements predicted by Mermin and Wagner: periodicity is not a requirement for Mermin-Wagner fluctuations which conserve the homogeneity of space on long scales. △ Less

Submitted 30 October, 2016; v1 submitted 20 October, 2015; originally announced October 2015.

Comments: 7 pages, 4 figures

Journal ref: Proc. Natl. Acad. Sci. 114, 2440 (2017)

arXiv:1205.5055 [pdf, other]

Neutron Star Evolutions using Tabulated Equations of State with a New Execution Model

Authors: Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser, Bryce Adelstein-Lelbach, Thomas Sterling

Abstract: The addition of nuclear and neutrino physics to general relativistic fluid codes allows for a more realistic description of hot nuclear matter in neutron star and black hole systems. This additional microphysics requires that each processor have access to large tables of data, such as equations of state, and in large simulations the memory required to store these tables locally can become excessiv… ▽ More The addition of nuclear and neutrino physics to general relativistic fluid codes allows for a more realistic description of hot nuclear matter in neutron star and black hole systems. This additional microphysics requires that each processor have access to large tables of data, such as equations of state, and in large simulations the memory required to store these tables locally can become excessive unless an alternative execution model is used. In this work we present relativistic fluid evolutions of a neutron star obtained using a message driven multi-threaded execution model known as ParalleX. These neutron star simulations would require substantial memory overhead dedicated entirely to the equation of state table if using a more traditional execution model. We introduce a ParalleX component based on Futures for accessing large tables of data, including out-of-core sized tables, which does not require substantial memory overhead and effectively hides any increased network latency. △ Less

Submitted 22 May, 2012; originally announced May 2012.

Comments: 9 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:1110.1131

arXiv:1110.1131 [pdf, other]

Adaptive Mesh Refinement for Astrophysics Applications with ParalleX

Authors: Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser, Bryce Adelstein-Lelbach, Thomas Sterling

Abstract: Several applications in astrophysics require adequately resolving many physical and temporal scales which vary over several orders of magnitude. Adaptive mesh refinement techniques address this problem effectively but often result in constrained strong scaling performance. The ParalleX execution model is an experimental execution model that aims to expose new forms of program parallelism and elimi… ▽ More Several applications in astrophysics require adequately resolving many physical and temporal scales which vary over several orders of magnitude. Adaptive mesh refinement techniques address this problem effectively but often result in constrained strong scaling performance. The ParalleX execution model is an experimental execution model that aims to expose new forms of program parallelism and eliminate any global barriers present in a scaling-impaired application such as adaptive mesh refinement. We present two astrophysics applications using the ParalleX execution model: a tabulated equation of state component for neutron star evolutions and a cosmology model evolution. Performance and strong scaling results from both simulations are presented. The tabulated equation of state data are distributed with transparent access over the nodes of the cluster. This allows seamless overlap** of computation with the latencies introduced by the remote access to the table. Because of the expected size increases to the equation of state table, this type of table partitioning for neutron star simulations is essential while the implementation is greatly simplified by ParalleX semantics. △ Less

Submitted 5 October, 2011; originally announced October 2011.

arXiv:1109.5201 [pdf, other]

An Application Driven Analysis of the ParalleX Execution Model

Authors: Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser, Thomas Sterling

Abstract: Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of parallel applications. It is suggested that the root… ▽ More Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of parallel applications. It is suggested that the root cause of these challenges is the incoherent coupling between the current enabling technologies, such as Non-Uniform Memory Access of present multicore nodes equipped with optional hardware accelerators and the decades older execution model, i.e., the Communicating Sequential Processes (CSP) model best exemplified by the message passing interface (MPI) application programming interface. A new execution model, ParalleX, is introduced as an alternative to the CSP model. In this paper, an overview of the ParalleX execution model is presented along with details about a ParalleX-compliant runtime system implementation called High Performance ParalleX (HPX). Scaling and performance results for an adaptive mesh refinement numerical relativity application developed using HPX are discussed. The performance results of this HPX-based application are compared with a counterpart MPI-based mesh refinement code. The overheads associated with HPX are explored and hardware solutions are introduced for accelerating the runtime system. △ Less

Submitted 23 September, 2011; originally announced September 2011.

Comments: 9 Figures

arXiv:1109.5190 [pdf, ps, other]

doi 10.1177/1094342012440585

Improving the scalability of parallel N-body applications with an event driven constraint based execution model

Authors: Chirag Dekate, Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser, Bryce Adelstein-Lelbach, Thomas Sterling

Abstract: The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel exe… ▽ More The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing. △ Less

Submitted 23 September, 2011; originally announced September 2011.

Comments: 11 figures

Journal ref: International Journal of High Performance Computing Applications, April 11, 2012

arXiv:0803.4170 [pdf, ps, other]

doi 10.1016/j.nima.2007.12.044

Neutronic Design and Measured Performance of the Low Energy Neutron Source (LENS) Target Moderator Reflector Assembly

Authors: C. M. Lavelle, D. V. Baxter, A. Bogdanov, V. P. Derenchuk, H. Kaiser, M. B. Leuschner, M. A. Lone, W. Lozowski, H. Nann, B. v. Przewoski, N. Remmes, T. Rinckel, Y. Shin, W. M. Snow, P. E. Sokol

Abstract: The Low Energy Neutron Source (LENS) is an accelerator-based pulsed cold neutron facility under construction at the Indiana University Cyclotron Facility (IUCF). The idea behind LENS is to produce pulsed cold neutron beams starting with ~MeV neutrons from (p,n) reactions in Be which are moderated to meV energies and extracted from a small solid angle for use in neutron instruments which can oper… ▽ More The Low Energy Neutron Source (LENS) is an accelerator-based pulsed cold neutron facility under construction at the Indiana University Cyclotron Facility (IUCF). The idea behind LENS is to produce pulsed cold neutron beams starting with ~MeV neutrons from (p,n) reactions in Be which are moderated to meV energies and extracted from a small solid angle for use in neutron instruments which can operate efficiently with relatively broad (~1 msec) neutron pulse widths. Although the combination of the features and operating parameters of this source is unique at present, the neutronic design possesses several features similar to those envisioned for future neutron facilities such as long-pulsed spallation sources (LPSS) and very cold neutron (VCN) sources. We describe the underlying ideas and design details of the target/moderator/reflector system (TMR) and compare measurements of its brightness, energy spectrum, and emission time distribution under different moderator configurations with MCNP simulations. Brightness measurements using an ambient temperature water moderator agree with MCNP simulations within the 20% accuracy of the measurement. The measured neutron emission time distribution from a solid methane moderator is in agreement with simulation and the cold neutron flux is sufficient for neutron scattering studies of materials. We describe some possible modifications to the existing design which would increase the cold neutron brightness with negligible effect on the emission time distribution. △ Less

Submitted 28 March, 2008; originally announced March 2008.

Comments: This is a preprint version of an article which has been published in Nuclear Instruments and Methods in Physics Research A 587 (2008) 324-341. http://dx.doi.org/10.1016/j.nima.2007.12.044

Journal ref: Nucl.Instrum.Meth.A587:324-341,2008

arXiv:math/0701132 [pdf, ps, other]

Classical solutions of drift-diffusion equations for semiconductor devices: the 2d case

Authors: Hans-Christoph Kaiser, Hagen Neidhardt, Joachim Rehberg

Abstract: We regard drift-diffusion equations for semiconductor devices in Lebesgue spaces. To that end we reformulate the (generalized) van Roosbroeck system as an evolution equation for the potentials to the driving forces of the currents of electrons and holes. This evolution equation falls into a class of quasi-linear parabolic systems which allow unique, local in time solution in certain Lebesgue spa… ▽ More We regard drift-diffusion equations for semiconductor devices in Lebesgue spaces. To that end we reformulate the (generalized) van Roosbroeck system as an evolution equation for the potentials to the driving forces of the currents of electrons and holes. This evolution equation falls into a class of quasi-linear parabolic systems which allow unique, local in time solution in certain Lebesgue spaces. In particular, it turns out that the divergence of the electron and hole current is an integrable function. Hence, Gauss' theorem applies, and gives the foundation for space discretization of the equations by means of finite volume schemes. Moreover, the strong differentiability of the electron and hole density in time is constitutive for the implicit time discretization scheme. Finite volume discretization of space, and implicit time discretization are accepted custom in engineering and scientific computing.--This investigation puts special emphasis on non-smooth spatial domains, mixed boundary conditions, and heterogeneous material compositions, as required in electronic device simulation. △ Less

Submitted 4 January, 2007; originally announced January 2007.

Report number: WIAS Preprint No. 1189 (2006) MSC Class: 35K45; 35K50; 35K55; 35K57; 78A35

arXiv:nucl-ex/0509018 [pdf, ps, other]

doi 10.1016/j.physb.2006.05.187

Measuring the Neutron's Mean Square Charge Radius Using Neutron Interferometry

Authors: F. E. Wietfeldt, M. Huber, T. C. Black, H. Kaiser, M. Arif, D. L. Jacobson, S. A. Werner

Abstract: The neutron is electrically neutral, but its substructure consists of charged quarks so it may have an internal charge distribution. In fact it is known to have a negative mean square charge radius (MSCR), the second moment of the radial charge density. In other words the neutron has a positive core and negative skin. In the first Born approximation the neutron MSCR can be simply related to the… ▽ More The neutron is electrically neutral, but its substructure consists of charged quarks so it may have an internal charge distribution. In fact it is known to have a negative mean square charge radius (MSCR), the second moment of the radial charge density. In other words the neutron has a positive core and negative skin. In the first Born approximation the neutron MSCR can be simply related to the neutron-electron scattering length b_ne. In the past this important quantity has been extracted from the energy dependence of the total transmission cross-section of neutrons on high-Z targets, a very difficult and complicated process. A few years ago S.A. Werner proposed a novel approach to measuring b_ne from the neutron's dynamical phase shift in a perfect crystal close to the Bragg condition. We are conducting an experiment based on this method at the NIST neutron interferometer which may lead to a five-fold improvement in precision of b_ne and hence the neutron MSCR. △ Less

Submitted 14 September, 2005; originally announced September 2005.

Comments: 5 pages, 2 figures

arXiv:quant-ph/0508182 [pdf, ps, other]

doi 10.1016/j.physb.2006.05.207

Inertia of Intrinsic Spin

Authors: Bahram Mashhoon, Helmut Kaiser

Abstract: The state of a particle in space and time is characterized by its mass and spin, which therefore determine the inertial properties of the particle. The coupling of intrinsic spin with rotation is examined and the corresponding inertial effects of intrinsic spin are studied. An experiment to measure directly the spin-rotation coupling via neutron interferometry is analyzed in detail. The state of a particle in space and time is characterized by its mass and spin, which therefore determine the inertial properties of the particle. The coupling of intrinsic spin with rotation is examined and the corresponding inertial effects of intrinsic spin are studied. An experiment to measure directly the spin-rotation coupling via neutron interferometry is analyzed in detail. △ Less

Submitted 18 November, 2005; v1 submitted 24 August, 2005; originally announced August 2005.

Comments: 3 pages, 1 figure, contribution to Festschrift honoring Samuel A. Werner; v2: slightly expanded version accepted for publication in Proc. Int. Conf. Neutron Scattering 2005 (scheduled for publication in the regular edition of Physica B, July 2006)

Journal ref: Physica B385 (2006) 1381-1383

arXiv:nucl-ex/0306012 [pdf, ps, other]

doi 10.1103/PhysRevC.67.044005

Precision neutron interferometric measurements and updated evaluations of the n-p and n-d coherent neutron scattering lengths

Authors: K. Schoen, D. L. Jacobson, M. Arif, P. R. Huffman, T. C. Black, W. M. Snow, S. K. Lamoreaux, H. Kaiser, S. A. Werner

Abstract: We have performed high-precision measurements of the coherent neutron scattering lengths of gas phase molecular hydrogen and deuterium using neutron interferometry. After correcting for molecular binding and multiple scattering from the molecule, we find b_{np} = (-3.7384 +/- 0.0020) fm and b_{nd} = (6.6649 +/- 0.0040) fm. Our results are in agreement with the world average of previous Measureme… ▽ More We have performed high-precision measurements of the coherent neutron scattering lengths of gas phase molecular hydrogen and deuterium using neutron interferometry. After correcting for molecular binding and multiple scattering from the molecule, we find b_{np} = (-3.7384 +/- 0.0020) fm and b_{nd} = (6.6649 +/- 0.0040) fm. Our results are in agreement with the world average of previous Measurements, b_{np} = (-3.7410 +/- 0.0010) fm and b_{nd} = (6.6727 +/- 0.0045) fm. The new world averages for the n-p and n-d coherent scattering lengths, including our new results, are b_{np} = (-3.7405 +/- 0.0009) fm and b_{nd} = (6.6683 +/- 0.0030) fm. We compare bnd with the calculations of the doublet and quartet scattering lengths of several nucleon-nucleon potential models and show that almost all known calculations are in disagreement with the precisely measured linear combination corresponding to the coherent scattering length. Combining the world data on b_{nd} with the modern high-precision theoretical calculations of the quartet n-d scattering lengths recently summarized by Friar et al., we deduce a new value for the doublet scattering length of ^{2}a_{nd} = [0.645 +/- 0.003(expt) +/- 0.007(theory)] fm. This value is a factor of 4, more precise than the previously accepted value of ^{2}a_{nd} = [0.65 +/- 0.04(expt)] fm. The current state of knowledge of scattering lengths in the related p-d system, ideas for improving by a factor of 5 the accuracy of the b_{np} and b_{nd} measurements using neutron interferometry, and possibilities for further improvement of our knowledge of the coherent neutron scattering lengths of 3H, 3He, and 4He are discussed. △ Less

Submitted 10 June, 2003; originally announced June 2003.

Comments: 22 pages, 19 figures

Journal ref: Physical Review C 67, 044005 (2003)

arXiv:nucl-ex/0305019 [pdf, ps, other]

doi 10.1103/PhysRevLett.90.192502

Precision neutron interferometric measurement of the nd coherent neutron scattering length and consequences for models of three-nucleon forces

Authors: T. C. Black, P. R. Huffman, D. L. Jacobson, W. M. Snow, K. Schoen, M. Arif, H. Kaiser, S. K. Lamoreaux, S. A. Werner

Abstract: We have performed the first high precision measurement of the coherent neutron scattering length of deuterium in a pure sample using neutron interferometry. We find b_nd = (6.665 +/- 0.004) fm in agreement with the world average of previous measurements using different techniques, b_nd = (6.6730 +/- 0.0045) fm. We compare the new world average for the nd coherent scattering length b_nd = (6.669… ▽ More We have performed the first high precision measurement of the coherent neutron scattering length of deuterium in a pure sample using neutron interferometry. We find b_nd = (6.665 +/- 0.004) fm in agreement with the world average of previous measurements using different techniques, b_nd = (6.6730 +/- 0.0045) fm. We compare the new world average for the nd coherent scattering length b_nd = (6.669 +/- 0.003) fm to calculations of the doublet and quartet scattering lengths from several modern nucleon-nucleon potential models with three-nucleon force (3NF) additions and show that almost all theories are in serious disagreement with experiment. This comparison is a more stringent test of the models than past comparisons with the less precisely-determined nuclear doublet scattering length of a_nd = (0.65 +/- 0.04) fm. △ Less

Submitted 20 May, 2003; originally announced May 2003.

Comments: 4 pages, 4 figures

Journal ref: Phys.Rev.Lett. 90 (2003) 192502

arXiv:cond-mat/0008244 [pdf, ps, other]

doi 10.1103/PhysRevB.62.9784

Orientation of Vortices in a Superconducting Thin-Film: Quantitative Comparison of Spin-Polarized Neutron Reflectivity and Magnetization

Authors: S. -W. Han, J. Farmer, H. Kaiser, P. F. Miceli, I. V. Roshchin, L. H. Greene

Abstract: We present a quantitative comparison of the magnetization measured by spin-polarized neutron reflectivity (SPNR) and DC magnetometry on a 1370 Å -thick Nb superconducting film. As a function of magnetic field applied in the film plane, SPNR exhibits reversible behavior whereas the DC magnetization shows substantial hysteresis. The difference between these measurements is attributed to a rotation… ▽ More We present a quantitative comparison of the magnetization measured by spin-polarized neutron reflectivity (SPNR) and DC magnetometry on a 1370 Å -thick Nb superconducting film. As a function of magnetic field applied in the film plane, SPNR exhibits reversible behavior whereas the DC magnetization shows substantial hysteresis. The difference between these measurements is attributed to a rotation of vortex magnetic field out of the film plane as the applied field is reduced. Since SPNR measures only the magnetization parallel to the film plane whereas DC magnetization is strongly influenced by the perpendicular component of magnetization when there is a slight sample tilt, combining the two techniques allows one to distinguish two components of magnetization in a thin film. △ Less

Submitted 17 August, 2000; v1 submitted 16 August, 2000; originally announced August 2000.

Comments: 12 pages, 8 figures, It will be printed in PRB, Oct. 2000

Journal ref: Phys. Rev. B 62, 9784 (2000)

Showing 1–48 of 48 results for author: Kaiser, H