Search | arXiv e-print repository

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Authors: Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric Vivien

Abstract: Scientific workloads are often described as directed acyclic task graphs. In this paper, we focus on the multifrontal factorization of sparse matrices, whose task graph is structured as a tree of parallel tasks. Among the existing models for parallel tasks, the concept of malleable tasks is especially powerful as it allows each task to be processed on a time-varying number of processors. Following… ▽ More Scientific workloads are often described as directed acyclic task graphs. In this paper, we focus on the multifrontal factorization of sparse matrices, whose task graph is structured as a tree of parallel tasks. Among the existing models for parallel tasks, the concept of malleable tasks is especially powerful as it allows each task to be processed on a time-varying number of processors. Following the model advocated by Prasanna and Musicus for matrix computations, we consider malleable tasks whose speedup is $p^α$, where $p$ is the fractional share of processors on which a task executes, and $α$ ($0 < α\leq 1$) is a parameter which does not depend on the task. We first motivate the relevance of this model for our application with actual experiments on multicore platforms. Then, we study the optimal allocation proposed by Prasanna and Musicus for makespan minimization using optimal control theory. We largely simplify their proofs by resorting only to pure scheduling arguments. Building on the insight gained thanks to these new proofs, we extend the study to distributed multicore platforms. There, a task cannot be distributed among several distributed nodes. In such a distributed setting (homogeneous or heterogeneous), we prove the NP-completeness of the corresponding scheduling problem, and propose some approximation algorithms. We finally assess the relevance of our approach by simulations on realistic trees. We show that the average performance gain of our allocations with respect to existing solutions (that are thus unaware of the actual speedup functions) is up to 16% for $α=0.9$ (the value observed in the real experiments). △ Less

Submitted 4 June, 2015; v1 submitted 27 October, 2014; originally announced October 2014.

Comments: Paper accepted for publication at EuroPar 2015

arXiv:1410.0329 [pdf, other]

Parallel scheduling of task trees with limited memory

Authors: Lionel Eyraud-Dubois, Loris Marchal, Oliver Sinnen, Frédéric Vivien

Abstract: This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matri… ▽ More This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while kee** the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees. △ Less

Submitted 1 October, 2014; originally announced October 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1210.2580

Report number: RR-8606

Journal ref: N° RR-8606 (2014)

arXiv:1310.8486 [pdf, other]

On the Combination of Silent Error Detection and Checkpointing

Authors: Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Abstract: In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential dis… ▽ More In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. △ Less

Submitted 31 October, 2013; originally announced October 2013.

Comments: This work was accepted to be published in PRDC'13. Work supported by ANR Rescue

Report number: INRIA RR-8319

arXiv:1302.4558 [pdf, other]

Checkpointing strategies with prediction windows

Authors: Guillaume Aupy, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Abstract: This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows instead of exact predictions, which dramatically complicates the analysis of the checkpointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside… ▽ More This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows instead of exact predictions, which dramatically complicates the analysis of the checkpointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside prediction windows, whenever the size of these windows is large enough. We are able to compute the best period for any size of the prediction windows, thereby deriving the scheduling strategy that minimizes platform waste. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the model and the accuracy of the approach. △ Less

Submitted 19 February, 2013; originally announced February 2013.

Comments: 35 pages, work supported by ANR Rescue. arXiv admin note: substantial text overlap with arXiv:1207.6936, arXiv:1302.3752

Report number: INRIA RR-8239

arXiv:1302.3752 [pdf, other]

doi 10.1016/j.jpdc.2013.10.010

Checkpointing algorithms and fault prediction

Authors: Guillaume Aupy, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Abstract: This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointin… ▽ More This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. △ Less

Submitted 3 December, 2013; v1 submitted 15 February, 2013; originally announced February 2013.

Comments: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.6936

Report number: INRIA RR-8237

Journal ref: Journal of Parallel and Distributed Computing, Available online 7 November 2013, ISSN 0743-7315

arXiv:1210.2580 [pdf, ps, other]

Scheduling tree-shaped task graphs to minimize memory and makespan

Authors: Loris Marchal, Oliver Sinnen, Frédéric Vivien

Abstract: This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents a large IO file. A task can only be executed if all input and output files fit into memory, and a file can only be removed from memory after it has been consumed. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The maximum amount… ▽ More This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents a large IO file. A task can only be executed if all input and output files fit into memory, and a file can only be removed from memory after it has been consumed. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The maximum amount of memory needed depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With the multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide an inapproximability result even for unit weight trees. Several heuristics are proposed, each with a different optimization focus, and they are analyzed in an extensive experimental evaluation using realistic trees. △ Less

Submitted 9 October, 2012; originally announced October 2012.

arXiv:1207.6936 [pdf, other]

Impact of fault prediction on checkpointing strategies

Authors: Guillaume Aupy, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Abstract: This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing… ▽ More This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results. △ Less

Submitted 9 October, 2012; v1 submitted 30 July, 2012; originally announced July 2012.

Comments: 20 pages

Report number: INRIA Report 8023

arXiv:1106.4985 [pdf, other]

Dynamic Fractional Resource Scheduling vs. Batch Scheduling

Authors: Henri Casanova, Mark Stillwell, Frédéric Vivien

Abstract: We propose a novel job scheduling approach for homogeneous cluster computing platforms. Its key feature is the use of virtual machine technology to share fractional node resources in a precise and controlled manner. Other VM-based scheduling approaches have focused primarily on technical issues or on extensions to existing batch scheduling systems, while we take a more aggressive approach and seek… ▽ More We propose a novel job scheduling approach for homogeneous cluster computing platforms. Its key feature is the use of virtual machine technology to share fractional node resources in a precise and controlled manner. Other VM-based scheduling approaches have focused primarily on technical issues or on extensions to existing batch scheduling systems, while we take a more aggressive approach and seek to find heuristics that maximize an objective metric correlated with job performance. We derive absolute performance bounds and develop algorithms for the online, non-clairvoyant version of our scheduling problem. We further evaluate these algorithms in simulation against both synthetic and real-world HPC workloads and compare our algorithms to standard batch scheduling approaches. We find that our approach improves over batch scheduling by orders of magnitude in terms of job stretch, while leading to comparable or better resource utilization. Our results demonstrate that virtualization technology coupled with lightweight online scheduling strategies can afford dramatic improvements in performance for executing HPC workloads. △ Less

Submitted 24 June, 2011; originally announced June 2011.

Comments: N° RR-7659 (2011)

Report number: RR-7659

arXiv:1006.5376 [pdf, other]

Resource Allocation using Virtual Clusters

Authors: Mark Stillwell, David Schanzenbach, Frédéric Vivien, Henri Casanova

Abstract: In this report we demonstrate the potential utility of resource allocation management systems that use virtual machine technology for sharing parallel computing resources among competing jobs. We formalize the resource allocation problem with a number of underlying assumptions, determine its complexity, propose several heuristic algorithms to find near-optimal solutions, and evaluate these algorit… ▽ More In this report we demonstrate the potential utility of resource allocation management systems that use virtual machine technology for sharing parallel computing resources among competing jobs. We formalize the resource allocation problem with a number of underlying assumptions, determine its complexity, propose several heuristic algorithms to find near-optimal solutions, and evaluate these algorithms in simulation. We find that among our algorithms one is very efficient and also leads to the best resource allocations. We then describe how our approach can be made more general by removing several of the underlying assumptions. △ Less

Submitted 28 June, 2010; originally announced June 2010.

Comments: University of Hawai'i at M{ā}noa Department of Information and Computer Sciences Technical Report

Report number: ICS2008-09-01

arXiv:0706.4038 [pdf, ps, other]

Scheduling multiple divisible loads on a linear processor network

Authors: Matthieu Gallet, Yves Robert, Frédéric Vivien

Abstract: Min, Veeravalli, and Barlas have recently proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that their approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal schedule for a… ▽ More Min, Veeravalli, and Barlas have recently proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that their approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal schedule for any instance, once the number of installments per load is given. Then, we formally state that any optimal schedule has an infinite number of installments under a linear cost model as the one assumed in the original papers. Therefore, such a cost model cannot be used to design practical multi-installment strategies. Finally, through extensive simulations we confirmed that the best solution is always produced by the linear programming approach, while solutions of the original papers can be far away from the optimal. △ Less

Submitted 28 June, 2007; v1 submitted 27 June, 2007; originally announced June 2007.

arXiv:cs/0702076 [pdf, ps, other]

A First Step Towards Automatically Building Network Representations

Authors: Lionel Eyraud-Dubois, Arnaud Legrand, Martin Quinson, Frédéric Vivien

Abstract: To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this method… ▽ More To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this methodology to representatives of the main classes of model builders and to two new algorithms. We show that none of the main existing techniques build models that enable to accurately predict the running time of simple application kernels for actual platforms. However some of the new algorithms we propose give excellent results in a wide range of situations. △ Less

Submitted 28 June, 2007; v1 submitted 13 February, 2007; originally announced February 2007.

arXiv:cs/0702066 [pdf, ps, other]

Comments on "Design and performance evaluation of load distribution strategies for multiple loads on heterogeneous linear daisy chain networks''

Authors: Matthieu Gallet, Yves Robert, Frédéric Vivien

Abstract: Min, Veeravalli, and Barlas proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that the proposed approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal scheduling for any in… ▽ More Min, Veeravalli, and Barlas proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that the proposed approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal scheduling for any instance, once the number of installments per load is given. Finally, we formally prove that under a linear cost model, as in the original paper, an optimal schedule has an infinite number of installments. Such a cost model can therefore not be sed to design practical multi-installment strategies. △ Less

Submitted 10 February, 2007; originally announced February 2007.

arXiv:cs/0612036 [pdf, ps, other]

Revisiting Matrix Product on Master-Worker Platforms

Authors: Jack Dongarra, Jean-Francois Pineau, Yves Robert, Zhiao Shi, Frederic Vivien

Abstract: This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files or… ▽ More This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms. △ Less

Submitted 6 December, 2006; originally announced December 2006.

ACM Class: F.2.2

arXiv:cs/0610131 [pdf, ps, other]

Scheduling and data redistribution strategies on star platforms

Authors: Loris Marchal, Veronika Rehn, Yves Robert, Frédéric Vivien

Abstract: In this work we are interested in the problem of scheduling and redistributing data on master-slave platforms. We consider the case were the workers possess initial loads, some of which having to be redistributed in order to balance their completion times. We examine two different scenarios. The first model assumes that the data consists of independent and identical tasks. We prove the NP-comple… ▽ More In this work we are interested in the problem of scheduling and redistributing data on master-slave platforms. We consider the case were the workers possess initial loads, some of which having to be redistributed in order to balance their completion times. We examine two different scenarios. The first model assumes that the data consists of independent and identical tasks. We prove the NP-completeness in the strong sense for the general case, and we present two optimal algorithms for special platform types. Furthermore we propose three heuristics for the general case. Simulations consolidate the theoretical results. The second data model is based on Divisible Load Theory. This problem can be solved in polynomial time by a combination of linear programming and simple analytical manipulations. △ Less

Submitted 23 October, 2006; originally announced October 2006.

Showing 1–14 of 14 results for author: Vivien, F