Search | arXiv e-print repository

Distributed astrophysics simulations using Octo-Tiger with RISC-V CPUs using HPX and Kokkos

Authors: Patrick Diehl, Gregor Daiß, Steven R. Brandt, Alireza Kheirkhahan, Srinivas Yadav Singanaboina, Dominic Marcello, Chris Taylor, John Leidel, Hartmut Kaiser

Abstract: In recent years, interest in RISC-V computing architectures have moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a point of concern. The results presented in this paper are part of a longer-term evaluation of RISC-V's viability for HPC applications. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D… ▽ More In recent years, interest in RISC-V computing architectures have moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a point of concern. The results presented in this paper are part of a longer-term evaluation of RISC-V's viability for HPC applications. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application as the bases for our analysis. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. We also compare the application's performance, scalability, and power consumption on RISC-V to an A64FX system. △ Less

Submitted 10 May, 2024; originally announced July 2024.

arXiv:2405.00016 [pdf, ps, other]

doi 10.1007/978-3-031-61763-8_17

HPX with Spack and Singularity Containers: Evaluating Overheads for HPX/Kokkos using an astrophysics application

Authors: Patrick Diehl, Steven R. Brandt, Gregor Daiß, Hartmut Kaiser

Abstract: Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers… ▽ More Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers is exemplified by the fact that Microsoft Azure's Eagle cloud service machine is number three on the November 23 Top 500 list. For cloud services, the HPC application and dependencies are installed in containers, e.g. Docker, Singularity, or something else, and these containers are executed on the physical hardware. Although containerization leverages the existing Linux kernel and should not impose overheads on the computation, there is the possibility that machine-specific optimizations might be lost, particularly machine-specific installs of commonly used packages. In this paper, we will use an astrophysics application using HPX-Kokkos and measure overheads on homogeneous resources, e.g. Supercomputer Fugaku, using CPUs only and on heterogenous resources, e.g. LSU's hybrid CPU and GPU system. We will report on challenges in compiling, running, and using the containers as well as performance performance differences. △ Less

Submitted 7 May, 2024; v1 submitted 11 February, 2024; originally announced May 2024.

arXiv:2309.06530 [pdf, other]

doi 10.1145/3624062.3624230

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger

Authors: Parick Diehl, Gregor Daiss, Steven R. Brandt, Alireza Kheirkhahan, Hartmut Kaiser, Christopher Taylor, John Leidel

Abstract: In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-… ▽ More In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is essential. In this paper, we describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. Considering the (limited) capabilities of the RISC-V test systems we used, Octo-Tiger already shows promising results and good scaling. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results. △ Less

Submitted 17 August, 2023; originally announced September 2023.

arXiv:2308.03161 [pdf, other]

Precise Benchmarking of Explainable AI Attribution Methods

Authors: Rafaël Brandt, Daan Raatjens, Georgi Gaydadjiev

Abstract: The rationale behind a deep learning model's output is often difficult to understand by humans. EXplainable AI (XAI) aims at solving this by develo** methods that improve interpretability and explainability of machine learning models. Reliable evaluation metrics are needed to assess and compare different XAI methods. We propose a novel evaluation approach for benchmarking state-of-the-art XAI at… ▽ More The rationale behind a deep learning model's output is often difficult to understand by humans. EXplainable AI (XAI) aims at solving this by develo** methods that improve interpretability and explainability of machine learning models. Reliable evaluation metrics are needed to assess and compare different XAI methods. We propose a novel evaluation approach for benchmarking state-of-the-art XAI attribution methods. Our proposal consists of a synthetic classification model accompanied by its derived ground truth explanations allowing high precision representation of input nodes contributions. We also propose new high-fidelity metrics to quantify the difference between explanations of the investigated XAI method and those derived from the synthetic model. Our metrics allow assessment of explanations in terms of precision and recall separately. Also, we propose metrics to independently evaluate negative or positive contributions of inputs. Our proposal provides deeper insights into XAI methods output. We investigate our proposal by constructing a synthetic convolutional image classification model and benchmarking several widely used XAI attribution methods using our evaluation approach. We compare our results with established prior XAI evaluation metrics. By deriving the ground truth directly from the constructed model in our method, we ensure the absence of bias, e.g., subjective either based on the training set. Our experimental results provide novel insights into the performance of Guided-Backprop and Smoothgrad XAI methods that are widely in use. Both have good precision and recall scores among positively contributing pixels (0.7, 0.76 and 0.7, 0.77, respectively), but poor precision scores among negatively contributing pixels (0.44, 0.61 and 0.47, 0.75, resp.). The recall scores in the latter case remain close. We show that our metrics are among the fastest in terms of execution time. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2307.01117 [pdf, other]

doi 10.1007/978-3-031-48803-0_11

Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Authors: Patrick Diehl, Steven R. Brandt, Max Morris, Nikunj Gupta, Hartmut Kaiser

Abstract: Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focu… ▽ More Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asynchronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing platforms were C++, Rust, Chapel, Charm++, and HPX. △ Less

Submitted 10 July, 2023; v1 submitted 18 May, 2023; originally announced July 2023.

arXiv:2302.07191 [pdf, ps, other]

doi 10.1007/978-3-031-32316-4_3

Shared memory parallelism in Modern C++ and HPX

Authors: Patrick Diehl, Steven R. Brandt, Hartmut Kaiser

Abstract: Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for… ▽ More Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for these various tools and extensions is available to a varying degree. In recent years, the C++ standards committee has worked to refine the language features and libraries needed to support parallel programming on a single computational node. Eventually, all major vendors and compilers will provide robust and performant implementations of these standards. Until then, the HPX library and runtime provides cutting edge implementations of the standards, as well as proposed standards and extensions. Because of these advances, it is now possible to write high performance parallel code without custom extensions to C++. We provide an overview of modern parallel programming in C++, describing the language and library features, and providing brief examples of how to use them. △ Less

Submitted 9 August, 2023; v1 submitted 16 January, 2023; originally announced February 2023.

Comments: Extended paper for the special issue

arXiv:2208.00109 [pdf, other]

Traveler: Navigating Task Parallel Traces for Performance Analysis

Authors: Sayef Azad Sakin, Alex Bigelow, R. Tohid, Connor Scully-Allison, Carlos Scheidegger, Steven R. Brandt, Christopher Taylor, Kevin A. Huck, Hartmut Kaiser, Katherine E. Isaacs

Abstract: Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activit… ▽ More Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activity during execution. As traces represent the full history, developers can discover a wide array of possibly previously unknown performance issues, making them an important artifact for exploratory performance analysis. However, interactive trace visualization is difficult due to issues of data size and complexity of meaning. Traces represent nanosecond-level events across many parallel processes, meaning the collected data is often large and difficult to explore. The rise of asynchronous task parallel programming paradigms complicates the relation between events and their probable cause. To address these challenges, we conduct a continuing design study in collaboration with high performance computing researchers. We develop diverse and hierarchical ways to navigate and represent execution trace data in support of their trace analysis tasks. Through an iterative design process, we developed Traveler, an integrated visualization platform for task parallel traces. Traveler provides multiple linked interfaces to help navigate trace data from multiple contexts. We evaluate the utility of Traveler through feedback from users and a case study, finding that integrating multiple modes of navigation in our design supported performance analysis tasks and led to the discovery of previously unknown behavior in a distributed array library. △ Less

Submitted 3 September, 2022; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: IEEE VIS 2022

arXiv:2010.04106 [pdf, other]

doi 10.1109/ESPM251964.2020.00007

Deploying a Task-based Runtime System on Raspberry Pi Clusters

Authors: Nikunj Gupta, Steve R. Brandt, Bibek Wagle, Nanmiao, Alireza Kheirkhahan, Patrick Diehl, Hartmut Kaiser, Felix W. Baumann

Abstract: Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx pla… ▽ More Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx platform (normally intended for use with HPC applications) and document the lessons we learned. First, we highlight the required changes in the configuration of the Pi to gain performance. Second, we explore how limited memory bandwidth limits the use of all cores in our shared memory benchmarks. Third, we evaluate whether low network bandwidth affects distributed performance. Fourth, we discuss the power consumption and the resulting trade-off in cost of operation and performance. △ Less

Submitted 9 April, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2006.15373 [pdf, other]

MTStereo 2.0: improved accuracy of stereo depth estimation withMax-trees

Authors: Rafael Brandt, Nicola Strisciuglio, Nicolai Petkov

Abstract: Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose a stereo matching method, called MTStereo 2.0, for… ▽ More Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose a stereo matching method, called MTStereo 2.0, for limited-resource systems that require efficient and accurate depth estimation. It is based on a Max-tree hierarchical representation of image pairs, which we use to identify matching regions along image scan-lines. The method includes a cost function that considers similarity of region contextual information based on the Max-trees and a disparity border preserving cost aggregation approach. MTStereo 2.0 improves on its predecessor MTStereo 1.0 as it a) deploys a more robust cost function, b) performs more thorough detection of incorrect matches, c) computes disparity maps with pixel-level rather than node-level precision. MTStereo provides accurate sparse and semi-dense depth estimation and does not require intensive GPU computations like methods based on CNNs. Thus it can run on embedded and robotics devices with low-power requirements. We tested the proposed approach on several benchmark data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets, and achieved competitive accuracy and efficiency. The code is available at https://github.com/rbrandt1/MaxTreeS. △ Less

Submitted 27 June, 2020; originally announced June 2020.

arXiv:1910.09902 [pdf]

Theory-Software Translation: Research Challenges and Future Directions

Authors: Caroline Jay, Robert Haines, Daniel S. Katz, Jeffrey Carver, James C. Phillips, Anshu Dubey, Sandra Gesing, Matthew Turk, Hui Wan, Hubertus van Dam, James Howison, Vitali Morozov, Steven R. Brandt

Abstract: The Theory-Software Translation Workshop, held in New Orleans in February 2019, explored in depth the process of both instantiating theory in software - for example, implementing a mathematical model in code as part of a simulation - and using the outputs of software - such as the behavior of a simulation - to advance knowledge. As computation within research is now ubiquitous, the workshop provid… ▽ More The Theory-Software Translation Workshop, held in New Orleans in February 2019, explored in depth the process of both instantiating theory in software - for example, implementing a mathematical model in code as part of a simulation - and using the outputs of software - such as the behavior of a simulation - to advance knowledge. As computation within research is now ubiquitous, the workshop provided a timely opportunity to reflect on the particular challenges of research software engineering - the process of develo** and maintaining software for scientific discovery. In addition to the general challenges common to all software development projects, research software additionally must represent, manipulate, and provide data for complex theoretical constructs. Ensuring this process is robust is essential to maintaining the integrity of the science resulting from it, and the workshop highlighted a number of areas where the current approach to research software engineering would benefit from an evidence base that could be used to inform best practice. The workshop brought together expert research software engineers and academics to discuss the challenges of Theory-Software Translation over a two-day period. This report provides an overview of the workshop activities, and a synthesises of the discussion that was recorded. The body of the report presents a thematic analysis of the challenges of Theory-Software Translation as identified by workshop participants, summarises these into a set of research areas, and provides recommendations for the future direction of this work. △ Less

Submitted 22 October, 2019; originally announced October 2019.

arXiv:1904.08500 [pdf, other]

Machine Vision for Natural Gas Methane Emissions Detection Using an Infrared Camera

Authors: **gfan Wang, Lyne P. Tchapmi, Arvind P. Ravikumara, Mike McGuire, Clay S. Bell, Daniel Zimmerle, Silvio Savarese, Adam R. Brandt

Abstract: It is crucial to reduce natural gas methane emissions, which can potentially offset the climate benefits of replacing coal with gas. Optical gas imaging (OGI) is a widely-used method to detect methane leaks, but is labor-intensive and cannot provide leak detection results without operators' judgment. In this paper, we develop a computer vision approach to OGI-based leak detection using convolution… ▽ More It is crucial to reduce natural gas methane emissions, which can potentially offset the climate benefits of replacing coal with gas. Optical gas imaging (OGI) is a widely-used method to detect methane leaks, but is labor-intensive and cannot provide leak detection results without operators' judgment. In this paper, we develop a computer vision approach to OGI-based leak detection using convolutional neural networks (CNN) trained on methane leak images to enable automatic detection. First, we collect ~1 M frames of labeled video of methane leaks from different leaking equipment for building CNN model, covering a wide range of leak sizes (5.3-2051.6 gCH4/h) and imaging distances (4.6-15.6 m). Second, we examine different background subtraction methods to extract the methane plume in the foreground. Third, we then test three CNN model variants, collectively called GasNet, to detect plumes in videos taken at other pieces of leaking equipment. We assess the ability of GasNet to perform leak detection by comparing it to a baseline method that uses optical-flow based change detection algorithm. We explore the sensitivity of results to the CNN structure, with a moderate-complexity variant performing best across distances. We find that the detection accuracy can reach as high as 99%, the overall detection accuracy can exceed 95% for a case across all leak sizes and imaging distances. Binary detection accuracy exceeds 97% for large leaks (~710 gCH4/h) imaged closely (~5-7 m). At closer imaging distances (~5-10 m), CNN-based models have greater than 94% accuracy across all leak sizes. At farthest distances (~13-16 m), performance degrades rapidly, but it can achieve above 95% accuracy to detect large leaks (>950 gCH4/h). The GasNet-based computer vision approach could be deployed in OGI surveys to allow automatic vigilance of methane leak detection with high detection accuracy in the real world. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: This paper was submitted to Applied Energy

arXiv:1604.05550 [pdf, other]

Joint Coordinated Precoding and Discrete Rate Selection in Multicell MIMO Networks

Authors: Rasmus Brandt, Mats Bengtsson

Abstract: Many practical wireless communications systems select their transmit rate from a finite set of modulation and coding schemes, which correspond to a set of discrete rates. In this paper, we therefore formulate a joint coordinated precoding and discrete rate selection problem for multiple-input multiple-output (MIMO) multicell networks. Compared to the common assumption of using the continuous Shann… ▽ More Many practical wireless communications systems select their transmit rate from a finite set of modulation and coding schemes, which correspond to a set of discrete rates. In this paper, we therefore formulate a joint coordinated precoding and discrete rate selection problem for multiple-input multiple-output (MIMO) multicell networks. Compared to the common assumption of using the continuous Shannon rates as the user utilities, explicitly accounting for the discrete rates more accurately models practical wireless communication systems. The optimization problem that we formulate is combinatorial and non-convex, however, and is thus hard to solve. We therefore rewrite the problem using a discontinuous rate function, which we then bound using its concave envelope in some domain. Based on block coordinate descent, we provide a convergent resource allocation algorithm which can be implemented in a semi-distributed fashion. Numerical performance evaluation shows performance gains when the discrete rates are optimized using our model, as compared to the traditional methods which use the continuous Shannon rates as the user utilities. △ Less

Submitted 19 April, 2016; originally announced April 2016.

Comments: Submitted to IEEE Signal Processing Letters

arXiv:1602.08273 [pdf, other]

doi 10.1109/LSP.2016.2536159

Globally Optimal Base Station Clustering in Interference Alignment-Based Multicell Networks

Authors: Rasmus Brandt, Rami Mochaourab, Mats Bengtsson

Abstract: Coordinated precoding based on interference alignment is a promising technique for improving the throughputs in future wireless multicell networks. In small networks, all base stations can typically jointly coordinate their precoding. In large networks however, base station clustering is necessary due to the otherwise overwhelmingly high channel state information (CSI) acquisition overhead. In thi… ▽ More Coordinated precoding based on interference alignment is a promising technique for improving the throughputs in future wireless multicell networks. In small networks, all base stations can typically jointly coordinate their precoding. In large networks however, base station clustering is necessary due to the otherwise overwhelmingly high channel state information (CSI) acquisition overhead. In this work, we provide a branch and bound algorithm for finding the globally optimal base station clustering. The algorithm is mainly intended for benchmarking existing suboptimal clustering schemes. We propose a general model for the user throughputs, which only depends on the long-term CSI statistics. The model assumes intracluster interference alignment and is able to account for the CSI acquisition overhead. By enumerating a search tree using a best-first search and pruning sub-trees in which the optimal solution provably cannot be, the proposed method converges to the optimal solution. The pruning is done using specifically derived bounds, which exploit some assumed structure in the throughput model. It is empirically shown that the proposed method has an average complexity which is orders of magnitude lower than that of exhaustive search. △ Less

Submitted 26 February, 2016; originally announced February 2016.

Comments: Accepted in IEEE Signal Processing Letters. (c) 2016 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE

Journal ref: IEEE Signal Processing Letters, Year: 2016, Volume: 23, Issue: 4, Pages: 512 - 516

arXiv:1602.07859 [pdf, other]

Distributed Long-Term Base Station Clustering in Cellular Networks using Coalition Formation

Authors: Rasmus Brandt, Rami Mochaourab, Mats Bengtsson

Abstract: Interference alignment (IA) is a promising technique for interference mitigation in multicell networks due to its ability to completely cancel the intercell interference through linear precoding and receive filtering. In small networks, the amount of required channel state information (CSI) is modest and IA is therefore typically applied jointly over all base stations. In large networks, where the… ▽ More Interference alignment (IA) is a promising technique for interference mitigation in multicell networks due to its ability to completely cancel the intercell interference through linear precoding and receive filtering. In small networks, the amount of required channel state information (CSI) is modest and IA is therefore typically applied jointly over all base stations. In large networks, where the channel coherence time is short in comparison to the time needed to obtain the required CSI, base station clustering must be applied however. We model such clustered multicell networks as a set of coalitions, where CSI acquisition and IA precoding is performed independently within each coalition. We develop a long-term throughput model which includes both CSI acquisition overhead and the level of interference mitigation ability as a function of the coalition structure. Given the throughput model, we formulate a coalitional game where the involved base stations are the rational players. Allowing for individual deviations by the players, we formulate a distributed coalition formation algorithm with low complexity and low communication overhead that leads to an individually stable coalition structure. The dynamic clustering is performed using only long-term CSI, but we also provide a robust short-term precoding algorithm which accounts for the intercoalition interference when spectrum sharing is applied between coalitions. Numerical simulations show that the distributed coalition formation is generally able to reach long-term sum throughputs within 10 % of the global optimum. △ Less

Submitted 25 February, 2016; originally announced February 2016.

Comments: Submitted to IEEE Transactions on Signal and Information Processing over Networks

arXiv:1602.02296 [pdf, other]

doi 10.5334/jors.118

Report on the Third Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE3)

Authors: Daniel S. Katz, Sou-Cheng T. Choi, Kyle E. Niemeyer, James Hetherington, Frank Löffler, Dan Gunter, Ray Idaszak, Steven R. Brandt, Mark A. Miller, Sandra Gesing, Nick D. Jones, Nic Weber, Suresh Marru, Gabrielle Allen, Birgit Penzenstadler, Colin C. Venters, Ethan Davis, Lorraine Hwang, Ilian Todorov, Abani Patra, Miguel de Val-Borro

Abstract: This report records and discusses the Third Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE3). The report includes a description of the keynote presentation of the workshop, which served as an overview of sustainable scientific software. It also summarizes a set of lightning talks in which speakers highlighted to-the-point lessons and challenges pertaining to sustain… ▽ More This report records and discusses the Third Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE3). The report includes a description of the keynote presentation of the workshop, which served as an overview of sustainable scientific software. It also summarizes a set of lightning talks in which speakers highlighted to-the-point lessons and challenges pertaining to sustaining scientific software. The final and main contribution of the report is a summary of the discussions, future steps, and future organization for a set of self-organized working groups on topics including develo** pathways to funding scientific software; constructing useful common metrics for crediting software stakeholders; identifying principles for sustainable software engineering design; reaching out to research software organizations around the world; and building communities for software sustainability. For each group, we include a point of contact and a landing page that can be used by those who want to join that group's future activities. The main challenge left by the workshop is to see if the groups will execute these activities that they have scheduled, and how the WSSSPE community can encourage this to happen. △ Less

Submitted 6 February, 2016; originally announced February 2016.

arXiv:1511.04126 [pdf, other]

doi 10.1109/ACSSC.2015.7421307

Interference Alignment-Aided Base Station Clustering using Coalition Formation

Authors: Rasmus Brandt, Rami Mochaourab, Mats Bengtsson

Abstract: Base station clustering is necessary in large interference networks, where the channel state information (CSI) acquisition overhead otherwise would be overwhelming. In this paper, we propose a novel long-term throughput model for the clustered users which addresses the balance between interference mitigation capability and CSI acquisition overhead. The model only depends on statistical CSI, thus e… ▽ More Base station clustering is necessary in large interference networks, where the channel state information (CSI) acquisition overhead otherwise would be overwhelming. In this paper, we propose a novel long-term throughput model for the clustered users which addresses the balance between interference mitigation capability and CSI acquisition overhead. The model only depends on statistical CSI, thus enabling long-term clustering. Based on notions from coalitional game theory, we propose a low-complexity distributed clustering method. The algorithm converges in a couple of iterations, and only requires limited communication between base stations. Numerical simulations show the viability of the proposed approach. △ Less

Submitted 12 November, 2015; originally announced November 2015.

arXiv:1504.06794 [pdf, other]

Overhead-Aware Distributed CSI Selection in the MIMO Interference Channel

Authors: Rami Mochaourab, Rasmus Brandt, Hadi Ghauch, Mats Bengtsson

Abstract: We consider a MIMO interference channel in which the transmitters and receivers operate in frequency-division duplex mode. In this setting, interference management through coordinated transceiver design necessitates channel state information at the transmitters (CSI-T). The acquisition of CSI-T is done through feedback from the receivers, which entitles a loss in degrees of freedom, due to trainin… ▽ More We consider a MIMO interference channel in which the transmitters and receivers operate in frequency-division duplex mode. In this setting, interference management through coordinated transceiver design necessitates channel state information at the transmitters (CSI-T). The acquisition of CSI-T is done through feedback from the receivers, which entitles a loss in degrees of freedom, due to training and feedback. This loss increases with the amount of CSI-T. In this work, after formulating an overhead model for CSI acquisition at the transmitters, we propose a distributed mechanism to find for each transmitter a subset of the complete CSI, which is used to perform interference management. The mechanism is based on many-to-many stable matching. We prove the existence of a stable matching and exploit an algorithm to reach it. Simulation results show performance improvement compared to full and minimal CSI-T. △ Less

Submitted 6 July, 2015; v1 submitted 26 April, 2015; originally announced April 2015.

Comments: 5 pages, 2 figures. to appear at EUSIPCO 2015, Special Session on Algorithms for Distributed Coordination and Learning

arXiv:1410.1764 [pdf, other]

Chemora: A PDE Solving Framework for Modern HPC Architectures

Authors: Erik Schnetter, Marek Blazewicz, Steven R. Brandt, David M. Koppelman, Frank Löffler

Abstract: Modern HPC architectures consist of heterogeneous multi-core, many-node systems with deep memory hierarchies. Modern applications employ ever more advanced discretisation methods to study multi-physics problems. Develo** such applications that explore cutting-edge physics on cutting-edge HPC systems has become a complex task that requires significant HPC knowledge and experience. Unfortunately,… ▽ More Modern HPC architectures consist of heterogeneous multi-core, many-node systems with deep memory hierarchies. Modern applications employ ever more advanced discretisation methods to study multi-physics problems. Develo** such applications that explore cutting-edge physics on cutting-edge HPC systems has become a complex task that requires significant HPC knowledge and experience. Unfortunately, this combined knowledge is currently out of reach for all but a few groups of application developers. Chemora is a framework for solving systems of Partial Differential Equations (PDEs) that targets modern HPC architectures. Chemora is based on Cactus, which sees prominent usage in the computational relativistic astrophysics community. In Chemora, PDEs are expressed either in a high-level \LaTeX-like language or in Mathematica. Discretisation stencils are defined separately from equations, and can include Finite Differences, Discontinuous Galerkin Finite Elements (DGFE), Adaptive Mesh Refinement (AMR), and multi-block systems. We use Chemora in the Einstein Toolkit to implement the Einstein Equations on CPUs and on accelerators, and study astrophysical systems such as black hole binaries, neutron stars, and core-collapse supernovae. △ Less

Submitted 3 October, 2014; originally announced October 2014.

arXiv:1406.7756 [pdf, ps, other]

Optimal Scheduling for Interference Mitigation by Range Information

Authors: Vijaya Yajnanarayana, Klas E. G. Magnusson, Rasmus Brandt, Satyam Dwivedi, Peter Händel

Abstract: The multiple access scheduling decides how the channel is shared among the nodes in the network. Typical scheduling algorithms aims at increasing the channel utilization and thereby throughput of the network. This paper describes several algorithms for generating an optimal schedule in terms of channel utilization for multiple access by utilizing range information in a fully connected network. We… ▽ More The multiple access scheduling decides how the channel is shared among the nodes in the network. Typical scheduling algorithms aims at increasing the channel utilization and thereby throughput of the network. This paper describes several algorithms for generating an optimal schedule in terms of channel utilization for multiple access by utilizing range information in a fully connected network. We also provide detailed analysis for the proposed algorithms performance in terms of their complexity, convergence, and effect of non-idealities in the network. The performance of the proposed schemes are compared with non-aided methods to quantify the benefits of using the range information in the communication. The proposed methods have several favorable properties for the scalable systems. We show that the proposed techniques yields better channel utilization and throughput as the number of nodes in the network increases. We provide simulation results in support of this claim. The proposed methods indicate that the throughput can be increased on average by 3-10 times for typical network configurations. △ Less

Submitted 1 September, 2016; v1 submitted 30 June, 2014; originally announced June 2014.

arXiv:1309.1812 [pdf, other]

Cactus: Issues for Sustainable Simulation Software

Authors: Frank Löffler, Steven R. Brandt, Gabrielle Allen, Erik Schnetter

Abstract: The Cactus Framework is an open-source, modular, portable programming environment for the collaborative development and deployment of scientific applications using high-performance computing. Its roots reach back to 1996 at the National Center for Supercomputer Applications and the Albert Einstein Institute in Germany, where its development jumpstarted. Since then, the Cactus framework has witness… ▽ More The Cactus Framework is an open-source, modular, portable programming environment for the collaborative development and deployment of scientific applications using high-performance computing. Its roots reach back to 1996 at the National Center for Supercomputer Applications and the Albert Einstein Institute in Germany, where its development jumpstarted. Since then, the Cactus framework has witnessed major changes in hardware infrastructure as well as its own community. This paper describes its endurance through these past changes and, drawing upon lessons from its past, also discusses future △ Less

Submitted 15 September, 2013; v1 submitted 6 September, 2013; originally announced September 2013.

Comments: submitted to the Workshop on Sustainable Software for Science: Practice and Experiences 2013

arXiv:1307.6488 [pdf, other]

doi 10.3233/SPR-130360

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

Authors: Marek Blazewicz, Ian Hinder, David M. Koppelman, Steven R. Brandt, Milosz Ciznicki, Michal Kierzynka, Frank Löffler, Erik Schnetter, Jian Tao

Abstract: Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applicatio… ▽ More Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms. △ Less

Submitted 24 July, 2013; originally announced July 2013.

Comments: 18 pages, 4 figures, accepted for publication in Scientific Programming

Report number: AEI-2013-227

arXiv:1201.2118 [pdf, other]

A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems

Authors: Marek Blazewicz, Steven R. Brandt, Peter Diener, David M. Koppelman, Krzysztof Kurowski, Frank Löffler, Erik Schnetter, Jian Tao

Abstract: Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge (a library based framework for heterogeneous multi-core systems), Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a new progr… ▽ More Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge (a library based framework for heterogeneous multi-core systems), Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a new programming language for general purpose computation on the GPU) and CUDA-lite (an enhancement to CUDA that transforms code based on annotations). In addition, efforts are underway to improve compiler tools for automatic parallelization and optimization of affine loop nests for GPUs and for automatic translation of OpenMP parallelized codes to CUDA. In this paper we present an alternative approach: a new computational framework for the development of massively data parallel scientific codes applications suitable for use on such petascale/exascale hybrid systems built upon the highly scalable Cactus framework. As the first non-trivial demonstration of its usefulness, we successfully developed a new 3D CFD code that achieves improved performance. △ Less

Submitted 10 January, 2012; originally announced January 2012.

Comments: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011, Ghent, Belgium

Showing 1–22 of 22 results for author: Brandt, R