Search | arXiv e-print repository

Lessons Learned on MPI+Threads Communication

Authors: Rohit Zambre, Aparna Chandramowlishwaran

Abstract: Hybrid MPI+threads programming is gaining prominence, but, in practice, applications perform slower with it compared to the MPI everywhere model. The most critical challenge to the parallel efficiency of MPI+threads applications is slow MPI_THREAD_MULTIPLE performance. MPI libraries have recently made significant strides on this front, but to exploit their capabilities, users must expose the commu… ▽ More Hybrid MPI+threads programming is gaining prominence, but, in practice, applications perform slower with it compared to the MPI everywhere model. The most critical challenge to the parallel efficiency of MPI+threads applications is slow MPI_THREAD_MULTIPLE performance. MPI libraries have recently made significant strides on this front, but to exploit their capabilities, users must expose the communication parallelism in their MPI+threads applications. Recent studies show that MPI 4.0 provides users with new performance-oriented options to do so, but our evaluation of these new mechanisms shows that they pose several challenges. An alternative design is MPI Endpoints. In this paper, we present a comparison of the different designs from the perspective of MPI's end-users: domain scientists and application developers. We evaluate the mechanisms on metrics beyond performance such as usability, scope, and portability. Based on the lessons learned, we make a case for a future direction. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC), Dallas, TX, USA, November 2022

ACM Class: C.2.4

arXiv:2005.00263 [pdf, other]

doi 10.1145/3392717.3392773

How I Learned to Stop Worrying About User-Visible Endpoints and Love MPI

Authors: Rohit Zambre, Aparna Chandramowlishwaran, Pavan Balaji

Abstract: MPI+threads is gaining prominence as an alternative to the traditional MPI everywhere model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. Typically, MPI users do no… ▽ More MPI+threads is gaining prominence as an alternative to the traditional MPI everywhere model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. Typically, MPI users do not expose logical communication parallelism. Consequently, MPI libraries use conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to parallel network resources and hurting performance. To enhance MP+threads' communication performance, researchers have proposed MPI Endpoints as a user-visible extension to MPI-3.1. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could allow each thread to have a dedicated communication path to the network and improve performance. The onus of map** threads to endpoints, however, would then be on domain scientists. We play the role of devil's advocate and question the need for user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new MPI-3.1 implementation that uses virtual communication interfaces (VCIs). VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving domain scientists from endpoints. We identify cases where VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: In Proceedings of the 34th ACM International Conference on Supercomputing (ICS), Barcelona, Spain, June 2020

ACM Class: C.2.4

arXiv:2002.03850 [pdf, other]

doi 10.1109/HiPC.2016.013

Parallel Performance-Energy Predictive Modeling of Browsers: Case Study of Servo

Authors: Rohit Zambre, Lars Bergstrom, Laleh Aghababaie Beni, Aparna Chandramowliswharan

Abstract: Mozilla Research is develo** Servo, a parallel web browser engine, to exploit the benefits of parallelism and concurrency in the web rendering pipeline. Parallelization results in improved performance for pinterest.com but not for google.com. This is because the workload of a browser is dependent on the web page it is rendering. In many cases, the overhead of creating, deleting, and coordinating… ▽ More Mozilla Research is develo** Servo, a parallel web browser engine, to exploit the benefits of parallelism and concurrency in the web rendering pipeline. Parallelization results in improved performance for pinterest.com but not for google.com. This is because the workload of a browser is dependent on the web page it is rendering. In many cases, the overhead of creating, deleting, and coordinating parallel work outweighs any of its benefits. In this paper, we model the relationship between web page primitives and a web browser's parallel performance using supervised learning. We discover a feature space that is representative of the parallelism available in a web page and characterize it using seven key features. Additionally, we consider energy usage trade-offs for different levels of performance improvements using automated labeling algorithms. Such a model allows us to predict the degree of parallelism available in a web page and decide whether or not to render a web page in parallel. This modeling is critical for improving the browser's performance and minimizing its energy usage. We evaluate our model by using Servo's layout stage as a case study. Experiments on a quad-core Intel Ivy Bridge (i7-3615QM) laptop show that we can improve performance and energy usage by up to 94.52% and 46.32% respectively on the 535 web pages considered in this study. Looking forward, we identify opportunities to apply this model to other stages of a browser's architecture as well as other performance- and energy-critical devices. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: In Proceedings of the 23rd IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), Hyderabad, India, December 2016

arXiv:2002.02563 [pdf, other]

doi 10.1145/3337821

Breaking Band: A Breakdown of High-performance Communication

Authors: Rohit Zambre, Megan Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis

Abstract: The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In th… ▽ More The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: In Proceedings of the 48th ACM International Conference on Parallel Processing (ICPP), Kyoto, Japan, August 2019

ACM Class: C.2.4

Journal ref: In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019

arXiv:2002.02509 [pdf, other]

doi 10.1109/PADSW.2018.8645059

Scalable Communication Endpoints for MPI+Threads Applications

Authors: Rohit Zambre, Aparna Chandramowlishwaran, Pavan Balaji

Abstract: Hybrid MPI+threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. In the MPI-everywhere model, each MP… ▽ More Hybrid MPI+threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. In the MPI-everywhere model, each MPI process has a dedicated set of communication resources (also known as endpoints), which is ideal for performance but is resource wasteful. With MPI+threads, current MPI implementations share a single communication endpoint for all threads, which is ideal for resource usage but is hurtful for performance. In this paper, we explore the tradeoff space between performance and communication resource usage in MPI+threads environments. We first demonstrate the two extreme cases---one where all threads share a single communication endpoint and another where each thread gets its own dedicated communication endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies in both these cases. Next, we perform a thorough analysis of the different levels of resource sharing in the context of Mellanox InfiniBand. Using the lessons learned from this analysis, we design an improved resource-sharing model to produce \emph{scalable communication endpoints} that can achieve the same performance as with dedicated communication resources per thread but using just a third of the resources. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: In Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Sentosa, Singapore, December 2018. Best Poster Award

Journal ref: In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 803-812. IEEE, 2018

Showing 1–5 of 5 results for author: Zambre, R