Search | arXiv e-print repository

Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

Authors: Lionel Eyraud-Dubois, Guillaume Iooss, Julien Langou, Fabrice Rastello

Abstract: When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itsel… ▽ More When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itself. The objective of I/O complexity analysis is to compute, for a given program, its minimal I/O requirement among all valid schedules. We consider a sequential execution model with two memories, an infinite one, and a small one of size S on which the computations retrieve and produce data. The I/O is the number of reads and writes between the two memories. We identify a common "hourglass pattern" in the dependency graphs of several common linear algebra kernels. Using the properties of this pattern, we mathematically prove tighter lower bounds on their I/O complexity, which improves the previous state-of-the-art bound by a parametric ratio. This proof was integrated inside the IOLB automatic lower bound derivation tool. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Journal ref: 36th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '24), Jun 2024, Nantes, France

arXiv:2202.10435 [pdf, ps, other]

Survey on Large Scale Neural Network Training

Authors: Julia Gusak, Daria Cherniuk, Alena Shilova, Alexander Katrutsa, Daniel Bershatsky, Xunyi Zhao, Lionel Eyraud-Dubois, Oleg Shlyazhko, Denis Dimitrov, Ivan Oseledets, Olivier Beaumont

Abstract: Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. Hence, many models do not fit one GPU device or can be trained using only a small per-GPU batch size. This survey provides a systematic overview of the approaches that enable more efficient DNNs training. We analyze techniques that save memory and make good us… ▽ More Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. Hence, many models do not fit one GPU device or can be trained using only a small per-GPU batch size. This survey provides a systematic overview of the approaches that enable more efficient DNNs training. We analyze techniques that save memory and make good use of computation and communication resources on architectures with a single or several GPUs. We summarize the main categories of strategies and compare strategies within and across categories. Along with approaches proposed in the literature, we discuss available implementations. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.10217 [pdf, ps, other]

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

Authors: Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou

Abstract: In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and… ▽ More In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}}$ for the communication volume of the Cholesky factorization of an $N\times N$ symmetric positive definite matrix, and of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}}$ for the SYRK computation of $\mat{A}\cdot\transpose{\mat{A}}$, where $\mathbf{A}$ is an $N\times M$ matrix. Both bounds improve the best known lower bounds from the literature by a factor $\sqrt{2}$.In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} + \bigo{NM\log N}$, and \LBC for Cholesky, with a volume of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} + \bigo{N^{5/2}}$. Both algorithms improve over the best known algorithms from the literature by a factor $\sqrt{2}$, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor $\sqrt{2}$) than that of corresponding non-symmetric kernels (GEMM and LU factorization). △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:1911.13214 [pdf, other]

Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory

Authors: Julien Herrmann, Olivier Beaumont, Lionel Eyraud-Dubois, Julien Hermann, Alexis Joly, Alena Shilova

Abstract: This paper introduces a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm. Similarly to checkpoint-ing techniques coming from the literature on Automatic Differentiation, it consists in dynamically selecting the forward activations that are saved during the training phase, and then automati… ▽ More This paper introduces a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm. Similarly to checkpoint-ing techniques coming from the literature on Automatic Differentiation, it consists in dynamically selecting the forward activations that are saved during the training phase, and then automatically recomputing missing activations from those previously recorded. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs (this uses more memory, but requires fewer recomputations in the backward phase), and we provide an algorithm to compute the optimal computation sequence for this model. This paper also describes a PyTorch implementation that processes the entire chain, dealing with any sequential DNN whose internal layers may be arbitrarily complex and automatically executing it according to the optimal checkpointing strategy computed given a memory limit. Through extensive experiments, we show that our implementation consistently outperforms existing checkpoint-ing approaches for a large class of networks, image sizes and batch sizes. △ Less

Submitted 27 November, 2019; originally announced November 2019.

arXiv:1909.11365 [pdf, other]

doi 10.1145/3387110

Scheduling on Two Types of Resources: a Survey

Authors: Olivier Beaumont, Louis-claude Canon, Lionel Eyraud-Dubois, Giorgio Lucarelli, Loris Marchal, Clément Mommessin, Bertrand Simon, Denis Trystram

Abstract: The evolution in the design of modern parallel platforms leads to revisit the scheduling jobs on distributed heterogeneous resources. The goal of this survey is to present the main existing algorithms, to classify them based on their underlying principles and to propose unified implementations to enable their fair comparison, both in terms of running time and quality of schedules, on a large set o… ▽ More The evolution in the design of modern parallel platforms leads to revisit the scheduling jobs on distributed heterogeneous resources. The goal of this survey is to present the main existing algorithms, to classify them based on their underlying principles and to propose unified implementations to enable their fair comparison, both in terms of running time and quality of schedules, on a large set of common benchmarks that we made available for the community. Beyond this comparison, our goal is also to understand the main difficulties that heterogeneity conveys and the shared principles that guide the design of efficient algorithms. △ Less

Submitted 30 July, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Journal ref: ACM Computing Survey, Vol. 53, No. 3, 2020

arXiv:1904.06825 [pdf, other]

doi 10.1145/3337821.3337921

Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels

Authors: Suraj Kumar, Lionel Eyraud-Dubois, Sriram Krishnamoorthy

Abstract: With increasing complexity of hardwares, systems with different memory nodes are ubiquitous in High Performance Computing (HPC). It is paramount to develop strategies to overlap the data transfers between memory nodes with computations in order to exploit the full potential of these systems. In this article, we consider the problem of deciding the order of data transfers between two memory nodes f… ▽ More With increasing complexity of hardwares, systems with different memory nodes are ubiquitous in High Performance Computing (HPC). It is paramount to develop strategies to overlap the data transfers between memory nodes with computations in order to exploit the full potential of these systems. In this article, we consider the problem of deciding the order of data transfers between two memory nodes for a set of independent tasks with the objective to minimize the makespan. We prove that with limited memory capacity, obtaining the optimal order of data transfers is a NP-complete problem. We propose several heuristics for this problem and provide details about their favorable situations. We present an analysis of our heuristics on traces, obtained by running 2 molecular chemistry kernels, namely, Hartree-Fock (HF) and Coupled Cluster Single Double (CCSD) on 10 nodes of an HPC system. Our results show that some of our heuristics achieve significant overlap for moderate memory capacities and are very close to the lower bound of makespan. △ Less

Submitted 14 April, 2019; originally announced April 2019.

Journal ref: https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/

arXiv:1410.0329 [pdf, other]

Parallel scheduling of task trees with limited memory

Authors: Lionel Eyraud-Dubois, Loris Marchal, Oliver Sinnen, Frédéric Vivien

Abstract: This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matri… ▽ More This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while kee** the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees. △ Less

Submitted 1 October, 2014; originally announced October 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1210.2580

Report number: RR-8606

Journal ref: N° RR-8606 (2014)

arXiv:1310.5255 [pdf, other]

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Authors: Olivier Beaumont, Lionel Eyraud-Dubois, Paul Renaud-Goud

Abstract: We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform that runs a relatively small set of large and independent services. These services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the cas… ▽ More We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform that runs a relatively small set of large and independent services. These services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the case of a public Cloud or fixed by the administrator in the case of a private Cloud. This quality of service defines the required robustness of the service, by setting an upper limit on the probability that the provider fails to allocate the required quantity of resources. This maximum probability of failure can be transparently turned into a pair (price,penalty). Failures can indeed hit the platform, and resilience is provided through service replication. Our contribution is two-fold. First, we propose a resource allocation strategy whose complexity is logarithmic in the number of resources, what makes it very efficient for large platforms. Second, we propose an efficient algorithm based on rare events detection techniques in order to estimate the robustness of an allocation, a problem that has been proven to be P-complete. Finally, we provide an analysis of the proposed strategy through an extensive set of simulations, both in terms of the overall number of allocated resources and in terms of time necessary to compute the allocation. △ Less

Submitted 19 October, 2013; originally announced October 2013.

arXiv:cs/0702076 [pdf, ps, other]

A First Step Towards Automatically Building Network Representations

Authors: Lionel Eyraud-Dubois, Arnaud Legrand, Martin Quinson, Frédéric Vivien

Abstract: To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this method… ▽ More To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this methodology to representatives of the main classes of model builders and to two new algorithms. We show that none of the main existing techniques build models that enable to accurately predict the running time of simple application kernels for actual platforms. However some of the new algorithms we propose give excellent results in a wide range of situations. △ Less

Submitted 28 June, 2007; v1 submitted 13 February, 2007; originally announced February 2007.

Showing 1–9 of 9 results for author: Eyraud-Dubois, L