Search | arXiv e-print repository

arXiv:2010.10930 [pdf, ps, other]

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Authors: Nikunj Gupta, Jackson R. Mayo, Adrian S. Lemoine, Hartmut Kaiser

Abstract: Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper,… ▽ More Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: arXiv admin note: text overlap with arXiv:2004.07203

Report number: SAND2020-11278 C

arXiv:2005.05910 [pdf, other]

doi 10.1016/j.parco.2018.07.006

DMR API: Improving cluster productivity by turning applications into malleable

Authors: Sergio Iserte, Rafael Mayo, Enrique S. Quintana-Orti, Vicenc Beltran, Antonio J. Peña

Abstract: Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the… ▽ More Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the resource allocation---and the parallel runtime---able to transparently handle the processes and the program data---is crucial for our throughput-aware malleability methodology. Hence, when a job triggers a reconfiguration, the resource manager will check the cluster status and return an action: an expansion, if there are spare resources; a shrink, if queued jobs can be initiated; or none, if no change can improve the global productivity. In this paper, we describe the internals of our framework and how it is capable of reducing the global workload completion time along with providing a smarter usage of the underlying resources. For this purpose, we present a thorough study of the adaptive workloads processing by showing the detailed behavior of our framework in representative experiments and the low overhead that our reconfiguration involves. △ Less

Submitted 28 May, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

Journal ref: S. Iserte, R. Mayo, E. S. Quintana-Orti, V. Beltran, and A. J. Peña, "DMR API: Improving cluster productivity by turning applications into malleable", Parallel Computing, Elsevier, vol. 78, pp. 54-66, Oct. 2018

arXiv:2004.07203 [pdf, other]

Implementing Software Resiliency in HPX for Extreme Scale Computing

Authors: Nikunj Gupta, Jackson R. Mayo, Adrian S. Lemoine, Hartmut Kaiser

Abstract: Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this pa… ▽ More Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we implement software resilience in HPX, an Asynchronous Many-Task Runtime system. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay will reschedule a task up to n-times until a valid output is returned. Furthermore, we introduce an API that allows the application to verify the returned result with a user provided predicate. We test the APIs with both artificial workloads and a dataflow based stencil application. We demonstrate that only minor overheads are incurred when utilizing these resiliency features for work loads where the task size is greater than 200 $μ$s. We also show that most of the added execution time arises from the replay or replication of the tasks themselves and not by the implementation of the APIs. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: 7 pages, 5 figures

Report number: SAND2020-3975 R

arXiv:1507.05129 [pdf, ps, other]

Performance and Energy Optimization of Matrix Multiplication on Asymmetric big.LITTLE Processors

Authors: Sandra Catalán, Francisco D. Igual, Rafael Mayo, Luis Piñuel, Enrique S. Quintana-Ortí, Rafael Rodríguez-Sánchez

Abstract: Asymmetric processors have emerged as an appealing technology for severely energy-constrained environments, especially in the mobile market where heterogeneity in applications is mainstream. In addition, given the growing interest on ultra low-power architectures for high performance computing, this type of platforms are also being investigated in the road towards the implementation of energy- eff… ▽ More Asymmetric processors have emerged as an appealing technology for severely energy-constrained environments, especially in the mobile market where heterogeneity in applications is mainstream. In addition, given the growing interest on ultra low-power architectures for high performance computing, this type of platforms are also being investigated in the road towards the implementation of energy- efficient high-performance scientific applications. In this paper, we propose a first step towards a complete implementation of the BLAS interface adapted to asymmetric ARM big.LITTLE processors, analyzing the trade-offs between performance and energy efficiency when compared to existing homogeneous (symmetric) multi-threaded BLAS implementations. Our experimental results reveal important gains in performance while maintaining the energy efficiency of homogeneous solutions by efficiently exploiting all the resources of the asymmetric processor. △ Less

Submitted 17 July, 2015; originally announced July 2015.

Comments: Presented at HiPEAC 2015, Amsterdam. Foundation of the Asymmetric BLIS implementation

arXiv:1506.08988 [pdf, other]

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Authors: Sandra Catalán, Francisco D. Igual, Rafael Mayo, Rafael Rodríguez-Sánchez, Enrique S. Quintana-Ortí

Abstract: Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt o… ▽ More Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency. △ Less

Submitted 30 June, 2015; originally announced June 2015.

arXiv:1502.06564 [pdf]

Challenges and characterization of a Biological system on Grid by means of the PhyloGrid application

Authors: Raul Isea, Esther Montes, Antonio J. Rubio-Montero, Rafael Mayo

Abstract: In this work we present a new application that is being developed. PhyloGrid is able to perform large-scale phylogenetic calculations as those that have been made for estimating the phylogeny of all the sequences already stored in the public NCBI database. The further analysis has been focused on checking the origin of the HIV-1 disease by means of a huge number of sequences that sum up to 2900 ta… ▽ More In this work we present a new application that is being developed. PhyloGrid is able to perform large-scale phylogenetic calculations as those that have been made for estimating the phylogeny of all the sequences already stored in the public NCBI database. The further analysis has been focused on checking the origin of the HIV-1 disease by means of a huge number of sequences that sum up to 2900 taxa. Such a study has been able to be done by the implementation of a workflow in Taverna. △ Less

Submitted 5 December, 2014; originally announced February 2015.

Comments: 8 pages, 3 figures, appears in Proceedings of the First EELA-2 Conference, 2009

arXiv:1107.3792 [pdf, other]

doi 10.1103/PhysRevLett.107.108701

Influence and Dynamic Behavior in Random Boolean Networks

Authors: C. Seshadhri, Yevgeniy Vorobeychik, Jackson R. Mayo, Robert C. Armstrong, Joseph R. Ruthruff

Abstract: We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean networks. We precisely connect the short-run dynamic behavior of a Boolean ne… ▽ More We present a rigorous mathematical framework for analyzing dynamics of a broad class of Boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in Boolean network analysis, and offer analogous characterizations for novel classes of random Boolean networks. We precisely connect the short-run dynamic behavior of a Boolean network to the average influence of the transfer functions. We show that some of the assumptions traditionally made in the more common mean-field analysis of Boolean networks do not hold in general. For example, we offer some evidence that imbalance, or expected internal inhomogeneity, of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than previously observed. △ Less

Submitted 19 July, 2011; originally announced July 2011.

Comments: To appear as a Letter in Physical Review Letters 8 pages, 4 figures

Journal ref: Phys. Rev. Lett. 107, 108701 (2011)

arXiv:1012.3956 [pdf]

Advances in the Biomedical Applications of the EELA Project

Authors: Vicente Hernández, Ignacio Blanquer, Gabriel Aparicio, Raul Isea, Juan Luis Chavés, Álvaro Hernández, Henry Ricardo Mora, Manuel Fernández, Alicia Acero, Esther Montes, Rafael Mayo

Abstract: In the last years an increasing demand for Grid Infrastructures has resulted in several international collaborations. This is the case of the EELA Project, which has brought together collaborating groups of Latin America and Europe. One year ago we presented this e-infrastructure used, among others, by the Biomedical groups for the studies of oncological analysis, neglected diseases, sequence alig… ▽ More In the last years an increasing demand for Grid Infrastructures has resulted in several international collaborations. This is the case of the EELA Project, which has brought together collaborating groups of Latin America and Europe. One year ago we presented this e-infrastructure used, among others, by the Biomedical groups for the studies of oncological analysis, neglected diseases, sequence alignments and computation phylogenetics. After this period, the achieved advances are summarised in this paper. △ Less

Submitted 17 December, 2010; originally announced December 2010.

Comments: 5 pages

Journal ref: Proceedings of the NETTAB Conference (2007). Vol. 7, pp. 145-156

arXiv:1012.3953 [pdf]

PhyloGrid: a development for a workflow in Phylogeny

Authors: Esther Montes, Raul Isea, Rafael Mayo

Abstract: In this work we present the development of a workflow based on Taverna which is going to be implemented for calculations in Phylogeny by means of the MrBayes tool. It has a friendly interface developed with the Gridsphere framework. The user is able to define the parameters for doing the Bayesian calculation, determine the model of evolution, check the accuracy of the results in the intermediate s… ▽ More In this work we present the development of a workflow based on Taverna which is going to be implemented for calculations in Phylogeny by means of the MrBayes tool. It has a friendly interface developed with the Gridsphere framework. The user is able to define the parameters for doing the Bayesian calculation, determine the model of evolution, check the accuracy of the results in the intermediate stages as well as do a multiple alignment of the sequences previously to the final result. To do this, no knowledge from his/her side about the computational procedure is required. △ Less

Submitted 17 December, 2010; originally announced December 2010.

Comments: 6 pages, ISBN: 978-84-9745-288-5

Journal ref: Iberian Grid Infrastructure Conf. Proceeding (2008) Vol. 2, pp. 378-387

Showing 1–9 of 9 results for author: Mayo, R