-
A Newcomer In The PGAS World -- UPC++ vs UPC: A Comparative Study
Authors:
Jérémie Lagravière,
Johannes Langguth,
Martina Prugger,
Phuong H. Ha,
Xing Cai
Abstract:
A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different opti…
▽ More
A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different optimal access methods resemble those used in modern C++. In this study we provide two kernels implemented in UPC++: a sparse-matrix vector multiplication (SpMV) as part of a Partial-Differential Equation solver, and an implementation of the Heat Equation on a 2D-domain. Code listings of these two kernels are available in the article in order to show the differences in programming style between UPC and UPC++. We provide a performance comparison between UPC and UPC++ using single-node, multi-node hardware and many-core hardware (Intel Xeon Phi Knight's Landing).
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
Performance optimization and modeling of fine-grained irregular communication in UPC
Authors:
Jérémie Lagravière,
Johannes Langguth,
Martina Prugger,
Lukas Einkemmer,
Phuong H. Ha,
Xing Cai
Abstract:
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can com…
▽ More
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
Evaluation of the Partitioned Global Address Space (PGAS) model for an inviscid Euler solver
Authors:
Martina Prugger,
Lukas Einkemmer,
Alexander Ostermann
Abstract:
In this paper we evaluate the performance of Unified Parallel C (which implements the partitioned global address space programming model) using a numerical method that is widely used in fluid dynamics. In order to evaluate the incremental approach to parallelization (which is possible with UPC) and its performance characteristics, we implement different levels of optimization of the UPC code and c…
▽ More
In this paper we evaluate the performance of Unified Parallel C (which implements the partitioned global address space programming model) using a numerical method that is widely used in fluid dynamics. In order to evaluate the incremental approach to parallelization (which is possible with UPC) and its performance characteristics, we implement different levels of optimization of the UPC code and compare it with an MPI parallelization on four different clusters of the Austrian HPC infrastructure (LEO3, LEO3E, VSC2, VSC3) and on an Intel Xeon Phi. We find that UPC is significantly easier to develop in compared to MPI and that the performance achieved is comparable to MPI in most situations. The obtained results show worse performance (on VSC2), competitive performance (on LEO3, LEO3E and VSC3), and superior performance (on the Intel Xeon Phi).
△ Less
Submitted 12 November, 2016; v1 submitted 14 January, 2016;
originally announced January 2016.