-
pPython Performance Study
Authors:
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on a single-node (e.g., a laptop) running Window…
▽ More
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on a single-node (e.g., a laptop) running Windows, Linux, or MacOS operating systems or on any combination of heterogeneous systems that support Python, including on a cluster through a Slurm scheduler interface so that pPython can be executed in a massively parallel computing environment. It is interesting to see what performance pPython can achieve compared to the traditional socket-based MPI communication because of its unique file-based messaging implementation. In this paper, we present the point-to-point and collective communication performances of pPython and compare them with those obtained by using mpi4py with OpenMPI. For large messages, pPython demonstrates comparable performance as compared to mpi4py.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
pPython for Parallel Python Programming
Authors:
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Kurt Keville,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a map c…
▽ More
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a map construct. Communication operations between distributed arrays are abstracted away from the user and pPython transparently supports redistribution between any block-cyclic-overlapped distributions in up to four dimensions. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on any combination of heterogeneous systems that support Python, including Windows, Linux, and MacOS operating systems. In addition to running transparently on single-node (e.g., a laptop), pPython provides a scheduler interface, so that pPython can be executed in a massively parallel computing environment. The initial implementation uses the Slurm scheduler. Performance of pPython on the HPC Challenge benchmark suite demonstrates both ease of programming and scalability.
△ Less
Submitted 31 August, 2022;
originally announced August 2022.
-
3D Real-Time Supercomputer Monitoring
Authors:
Bill Bergeron,
Matthew Hubbell,
Dylan Sequeira,
Winter Williams,
William Arcand,
David Bestor,
Chansup,
Byun,
Vijay Gadepally,
Michael Houle,
Michael Jones,
Anna Klien,
Peter Michaleas,
Lauren Milechin,
Julie Mullen Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
Supercomputers are complex systems producing vast quantities of performance data from multiple sources and of varying types. Performance data from each of the thousands of nodes in a supercomputer tracks multiple forms of storage, memory, networks, processors, and accelerators. Optimization of application performance is critical for cost effective usage of a supercomputer and requires efficient me…
▽ More
Supercomputers are complex systems producing vast quantities of performance data from multiple sources and of varying types. Performance data from each of the thousands of nodes in a supercomputer tracks multiple forms of storage, memory, networks, processors, and accelerators. Optimization of application performance is critical for cost effective usage of a supercomputer and requires efficient methods for effectively viewing performance data. The combination of supercomputing analytics and 3D gaming visualization enables real-time processing and visual data display of massive amounts of information that humans can process quickly with little training. Our system fully utilizes the capabilities of modern 3D gaming environments to create novel representations of computing hardware which intuitively represent the physical attributes of the supercomputer while displaying real-time alerts and component utilization. This system allows operators to quickly assess how the supercomputer is being used, gives users visibility into the resources they are consuming, and provides instructors new ways to interactively teach the computing architecture concepts necessary for efficient computing
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs
Authors:
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler. This paper presents a novel node-based scheduling approach for large scale simulations of short running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while simultaneously providing…
▽ More
Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler. This paper presents a novel node-based scheduling approach for large scale simulations of short running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while simultaneously providing fast launch and release of large-scale short running jobs. The node-based scheduling approach has demonstrated up to 100 times faster scheduler performance that other state-of-the-art systems.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Accuracy and Performance Comparison of Video Action Recognition Approaches
Authors:
Matthew Hutchinson,
Siddharth Samsi,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Micheal Houle,
Matthew Hubbell,
Micheal Jones,
Jeremy Kepner,
Andrew Kirby,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Albert Reuther,
Charles Yee,
Vijay Gadepally
Abstract:
Over the past few years, there has been significant interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-t…
▽ More
Over the past few years, there has been significant interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-the-art models by ensuring consistency in these training characteristics in order to provide readers with a meaningful comparison across different types of video action recognition algorithms. Accuracy of the models is evaluated using standard Top-1 and Top-5 accuracy metrics in addition to a proposed new accuracy metric. Additionally, we compare computational performance of distributed training from two to sixty-four GPUs on a state-of-the-art HPC system.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Benchmarking network fabrics for data distributed training of deep neural networks
Authors:
Siddharth Samsi,
Andrew Prout,
Michael Jones,
Andrew Kirby,
Bill Arcand,
Bill Bergeron,
David Bestor,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Antonio Rosa,
Charles Yee,
Albert Reuther,
Jeremy Kepner
Abstract:
Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simp…
▽ More
Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
Best of Both Worlds: High Performance Interactive and Batch Launching
Authors:
Chansup Byun,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Andrew Kirby,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther
Abstract:
Rapid launch of thousands of jobs is essential for effective interactive supercomputing, big data analysis, and AI algorithm development. Achieving thousands of launches per second has required hardware to be available to receive these jobs. This paper presents a novel preemptive approach to implement spot jobs on MIT SuperCloud systems allowing the resources to be fully utilized for both long run…
▽ More
Rapid launch of thousands of jobs is essential for effective interactive supercomputing, big data analysis, and AI algorithm development. Achieving thousands of launches per second has required hardware to be available to receive these jobs. This paper presents a novel preemptive approach to implement spot jobs on MIT SuperCloud systems allowing the resources to be fully utilized for both long running batch jobs while still providing fast launch for interactive jobs. The new approach separates the job preemption and scheduling operations and can achieve 100 times faster performance in the scheduling of a job with preemption when compared to using the standard scheduler-provided automatic preemption-based capability. The results demonstrate that the new approach can schedule interactive jobs preemptively at a performance comparable to when the required computing resources are idle and available. The spot job capability can be deployed without disrupting the interactive user experience while increasing the overall system utilization.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Large Scale Parallelization Using File-Based Communications
Authors:
Chansup Byun,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther
Abstract:
In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and r…
▽ More
In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
Securing HPC using Federated Authentication
Authors:
Andrew Prout,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther,
Jeremy Kepner
Abstract:
Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user's more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with m…
▽ More
Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user's more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and InCommon Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.
-
Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers
Authors:
Julia Mullen,
Albert Reuther,
William Arcand,
Bill Bergeron,
David Bestor,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Andrew Prout,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
For decades, the use of HPC systems was limited to those in the physical sciences who had mastered their domain in conjunction with a deep understanding of HPC architectures and algorithms. During these same decades, consumer computing device advances produced tablets and smartphones that allow millions of children to interactively develop and share code projects across the globe. As the HPC commu…
▽ More
For decades, the use of HPC systems was limited to those in the physical sciences who had mastered their domain in conjunction with a deep understanding of HPC architectures and algorithms. During these same decades, consumer computing device advances produced tablets and smartphones that allow millions of children to interactively develop and share code projects across the globe. As the HPC community faces the challenges associated with guiding researchers from disciplines using high productivity interactive tools to effective use of HPC systems, it seems appropriate to revisit the assumptions surrounding the necessary skills required for access to large computational systems. For over a decade, MIT Lincoln Laboratory has been supporting interactive, on-demand high performance computing by seamlessly integrating familiar high productivity tools to provide users with an increased number of design turns, rapid prototy** capability, and faster time to insight. In this paper, we discuss the lessons learned while supporting interactive, on-demand high performance computing from the perspectives of the users and the team supporting the users and the system. Building on these lessons, we present an overview of current needs and the technical solutions we are building to lower the barrier to entry for new users from the humanities, social, and biological sciences.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud
Authors:
Vijay Gadepally,
Jeremy Kepner,
Lauren Milechin,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Matthew Hubbell,
Micheal Houle,
Micheal Jones,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Siddharth Samsi,
Albert Reuther
Abstract:
Detecting anomalous behavior in network traffic is a major challenge due to the volume and velocity of network traffic. For example, a 10 Gigabit Ethernet connection can generate over 50 MB/s of packet headers. For global network providers, this challenge can be amplified by many orders of magnitude. Development of novel computer network traffic analytics requires: high level programming environme…
▽ More
Detecting anomalous behavior in network traffic is a major challenge due to the volume and velocity of network traffic. For example, a 10 Gigabit Ethernet connection can generate over 50 MB/s of packet headers. For global network providers, this challenge can be amplified by many orders of magnitude. Development of novel computer network traffic analytics requires: high level programming environments, massive amount of packet capture (PCAP) data, and diverse data products for "at scale" algorithm pipeline development. D4M (Dynamic Distributed Dimensional Data Model) combines the power of sparse linear algebra, associative arrays, parallel processing, and distributed databases (such as SciDB and Apache Accumulo) to provide a scalable data and computation system that addresses the big data problems associated with network analytics development. Combining D4M with the MIT SuperCloud manycore processors and parallel storage system enables network analysts to interactively process massive amounts of data in minutes. To demonstrate these capabilities, we have implemented a representative analytics pipeline in D4M and benchmarked it on 96 hours of Gigabit PCAP data with MIT SuperCloud. The entire pipeline from uncompressing the raw files to database ingest was implemented in 135 lines of D4M code and achieved speedups of over 20,000.
△ Less
Submitted 25 August, 2018;
originally announced August 2018.
-
Interactive Launch of 16,000 Microsoft Windows Instances on a Supercomputer
Authors:
Michael Jones,
Jeremy Kepner,
Bradley Orchard,
Albert Reuther,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Anna Klein,
Lauren Milechin,
Julia Mullen,
Andrew Prout,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Peter Michaleas
Abstract:
Simulation, machine learning, and data analysis require a wide range of software which can be dependent upon specific operating systems, such as Microsoft Windows. Running this software interactively on massively parallel supercomputers can present many challenges. Traditional methods of scaling Microsoft Windows applications to run on thousands of processors have typically relied on heavyweight v…
▽ More
Simulation, machine learning, and data analysis require a wide range of software which can be dependent upon specific operating systems, such as Microsoft Windows. Running this software interactively on massively parallel supercomputers can present many challenges. Traditional methods of scaling Microsoft Windows applications to run on thousands of processors have typically relied on heavyweight virtual machines that can be inefficient and slow to launch on modern manycore processors. This paper describes a unique approach using the Lincoln Laboratory LLMapReduce technology in combination with the Wine Windows compatibility layer to rapidly and simultaneously launch and run Microsoft Windows applications on thousands of cores on a supercomputer. Specifically, this work demonstrates launching 16,000 Microsoft Windows applications in 5 minutes running on 16,000 processor cores. This capability significantly broadens the range of applications that can be run at large scale on a supercomputer.
△ Less
Submitted 13 August, 2018;
originally announced August 2018.
-
Measuring the Impact of Spectre and Meltdown
Authors:
Andrew Prout,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther,
Jeremy Kepner
Abstract:
The Spectre and Meltdown flaws in modern microprocessors represent a new class of attacks that have been difficult to mitigate. The mitigations that have been proposed have known performance impacts. The reported magnitude of these impacts varies depending on the industry sector and expected workload characteristics. In this paper, we measure the performance impact on several workloads relevant to…
▽ More
The Spectre and Meltdown flaws in modern microprocessors represent a new class of attacks that have been difficult to mitigate. The mitigations that have been proposed have known performance impacts. The reported magnitude of these impacts varies depending on the industry sector and expected workload characteristics. In this paper, we measure the performance impact on several workloads relevant to HPC systems. We show that the impact can be significant on both synthetic and realistic workloads. We also show that the performance penalties are difficult to avoid even in dedicated systems where security is a lesser concern.
△ Less
Submitted 23 July, 2018;
originally announced July 2018.
-
Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
Authors:
Albert Reuther,
Jeremy Kepner,
Chansup Byun,
Siddharth Samsi,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Lauren Milechin,
Julia Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Peter Michaleas
Abstract:
Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to…
▽ More
Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges - in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds. These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms.
△ Less
Submitted 20 July, 2018;
originally announced July 2018.
-
Design, Generation, and Validation of Extreme Scale Power-Law Graphs
Authors:
Jeremy Kepner,
Siddharth Samsi,
William Arcand,
David Bestor,
Bill Bergeron,
Tim Davis,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Anna Klein,
Peter Michaleas,
Roger Pearce,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Geoff Sanders,
Charles Yee,
Albert Reuther
Abstract:
Massive power-law graphs drive many fields: metagenomics, brain map**, Internet-of-things, cybersecurity, and sparse machine learning. The development of novel algorithms and systems to process these data requires the design, generation, and validation of enormous graphs with exactly known properties. Such graphs accelerate the proper testing of new algorithms and systems and are a prerequisite…
▽ More
Massive power-law graphs drive many fields: metagenomics, brain map**, Internet-of-things, cybersecurity, and sparse machine learning. The development of novel algorithms and systems to process these data requires the design, generation, and validation of enormous graphs with exactly known properties. Such graphs accelerate the proper testing of new algorithms and systems and are a prerequisite for success on real applications. Many random graph generators currently exist that require realizing a graph in order to know its exact properties: number of vertices, number of edges, degree distribution, and number of triangles. Designing graphs using these random graph generators is a time-consuming trial-and-error process. This paper presents a novel approach that uses Kronecker products to allow the exact computation of graph properties prior to graph generation. In addition, when a real graph is desired, it can be generated quickly in memory on a parallel computer with no-interprocessor communication. To test this approach, graphs with $10^{12}$ edges are generated on a 40,000+ core supercomputer in 1 second and exactly agree with those predicted by the theory. In addition, to demonstrate the extensibility of this approach, decetta-scale graphs with up to $10^{30}$ edges are simulated in a few minutes on a laptop.
△ Less
Submitted 3 March, 2018;
originally announced March 2018.
-
Performance Measurements of Supercomputing and Cloud Storage Solutions
Authors:
Michael Jones,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Peter Michaleas,
Andrew Prout,
Albert Reuther,
Siddharth Samsi,
Paul Monticiollo
Abstract:
Increasing amounts of data from varied sources, particularly in the fields of machine learning and graph analytics, are causing storage requirements to grow rapidly. A variety of technologies exist for storing and sharing these data, ranging from parallel file systems used by supercomputers to distributed block storage systems found in clouds. Relatively few comparative measurements exist to infor…
▽ More
Increasing amounts of data from varied sources, particularly in the fields of machine learning and graph analytics, are causing storage requirements to grow rapidly. A variety of technologies exist for storing and sharing these data, ranging from parallel file systems used by supercomputers to distributed block storage systems found in clouds. Relatively few comparative measurements exist to inform decisions about which storage systems are best suited for particular tasks. This work provides these measurements for two of the most popular storage technologies: Lustre and Amazon S3. Lustre is an open-source, high performance, parallel file system used by many of the largest supercomputers in the world. Amazon's Simple Storage Service, or S3, is part of the Amazon Web Services offering, and offers a scalable, distributed option to store and retrieve data from anywhere on the Internet. Parallel processing is essential for achieving high performance on modern storage systems. The performance tests used span the gamut of parallel I/O scenarios, ranging from single-client, single-node Amazon S3 and Lustre performance to a large-scale, multi-client test designed to demonstrate the capabilities of a modern storage appliance under heavy load. These results show that, when parallel I/O is used correctly (i.e., many simultaneous read or write processes), full network bandwidth performance is achievable and ranged from 10 gigabits/s over a 10 GigE S3 connection to 0.35 terabits/s using Lustre on a 1200 port 10 GigE switch. These results demonstrate that S3 is well-suited to sharing vast quantities of data over the Internet, while Lustre is well-suited to processing large quantities of data locally.
△ Less
Submitted 1 August, 2017;
originally announced August 2017.
-
MIT SuperCloud Portal Workspace: Enabling HPC Web Application Deployment
Authors:
Andrew Prout,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Matthew Hubbell,
Michael Houle,
Michael Jones,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Antonio Rosa,
Siddharth Samsi,
Albert Reuther,
Jeremy Kepner
Abstract:
The MIT SuperCloud Portal Workspace enables the secure exposure of web services running on high performance computing (HPC) systems. The portal allows users to run any web application as an HPC job and access it from their workstation while providing authentication, encryption, and access control at the system level to prevent unintended access. This capability permits users to seamlessly utilize…
▽ More
The MIT SuperCloud Portal Workspace enables the secure exposure of web services running on high performance computing (HPC) systems. The portal allows users to run any web application as an HPC job and access it from their workstation while providing authentication, encryption, and access control at the system level to prevent unintended access. This capability permits users to seamlessly utilize existing and emerging tools that present their user interface as a website on an HPC system creating a portal workspace. Performance measurements indicate that the MIT SuperCloud Portal Workspace incurs marginal overhead when compared to a direct connection of the same service.
△ Less
Submitted 18 July, 2017;
originally announced July 2017.
-
Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
Authors:
Chansup Byun,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther
Abstract:
Knights Landing (KNL) is the code name for the second-generation Intel Xeon Phi product family. KNL has generated significant interest in the data analysis and machine learning communities because its new many-core architecture targets both of these workloads. The KNL many-core vector processor design enables it to exploit much higher levels of parallelism. At the Lincoln Laboratory Supercomputing…
▽ More
Knights Landing (KNL) is the code name for the second-generation Intel Xeon Phi product family. KNL has generated significant interest in the data analysis and machine learning communities because its new many-core architecture targets both of these workloads. The KNL many-core vector processor design enables it to exploit much higher levels of parallelism. At the Lincoln Laboratory Supercomputing Center (LLSC), the majority of users are running data analysis applications such as MATLAB and Octave. More recently, machine learning applications, such as the UC Berkeley Caffe deep learning framework, have become increasingly important to LLSC users. Thus, the performance of these applications on KNL systems is of high interest to LLSC users and the broader data analysis and machine learning communities. Our data analysis benchmarks of these application on the Intel KNL processor indicate that single-core double-precision generalized matrix multiply (DGEMM) performance on KNL systems has improved by ~3.5x compared to prior Intel Xeon technologies. Our data analysis applications also achieved ~60% of the theoretical peak performance. Also a performance comparison of a machine learning application, Caffe, between the two different Intel CPUs, Xeon E5 v3 and Xeon Phi 7210, demonstrated a 2.7x improvement on a KNL node.
△ Less
Submitted 11 July, 2017;
originally announced July 2017.
-
Scalable System Scheduling for HPC and Big Data
Authors:
Albert Reuther,
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Matthew Hubbell,
Michael Jones,
Peter Michaleas,
Andrew Prout,
Antonio Rosa,
Jeremy Kepner
Abstract:
In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations…
▽ More
In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) are conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler $t_s$ and a nonlinear exponent $α_s$. For all four schedulers, the utilization of the computing system decreases to < 10\% for computations lasting only a few seconds. Multilevel schedulers that transparently aggregate short computations can improve utilization for these short computations to > 90\% for all four of the schedulers that were tested.
△ Less
Submitted 8 May, 2017;
originally announced May 2017.
-
Benchmarking SciDB Data Import on HPC Systems
Authors:
Siddharth Samsi,
Laura Brattain,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner,
Albert Reuther
Abstract:
SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity ha…
▽ More
SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.
△ Less
Submitted 23 September, 2016;
originally announced September 2016.
-
Scheduler Technologies in Support of High Performance Data Analysis
Authors:
Albert Reuther,
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Matthew Hubbell,
Michael Jones,
Peter Michaleas,
Andrew Prout,
Antonio Rosa,
Jeremy Kepner
Abstract:
Job schedulers are a key component of scalable computing infrastructures. They orchestrate all of the work executed on the computing infrastructure and directly impact the effectiveness of the system. Recently, job workloads have diversified from long-running, synchronously-parallel simulations to include short-duration, independently parallel high performance data analysis (HPDA) jobs. Each of th…
▽ More
Job schedulers are a key component of scalable computing infrastructures. They orchestrate all of the work executed on the computing infrastructure and directly impact the effectiveness of the system. Recently, job workloads have diversified from long-running, synchronously-parallel simulations to include short-duration, independently parallel high performance data analysis (HPDA) jobs. Each of these job types requires different features and scheduler tuning to run efficiently. A number of schedulers have been developed to address both job workload and computing system heterogeneity. High performance computing (HPC) schedulers were designed to schedule large-scale scientific modeling and simulations on supercomputers. Big Data schedulers were designed to schedule data processing and analytic jobs on clusters. This paper compares and contrasts the features of HPC and Big Data schedulers with a focus on accommodating both scientific computing and high performance data analytic workloads. Job latency is critical for the efficient utilization of scalable computing infrastructures, and this paper presents the results of job launch benchmarking of several current schedulers: Slurm, Son of Grid Engine, Mesos, and Yarn. We find that all of these schedulers have low utilization for short-running jobs. Furthermore, employing multilevel scheduling significantly improves the utilization across all schedulers.
△ Less
Submitted 21 July, 2016;
originally announced July 2016.
-
LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis
Authors:
Chansup Byun,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Matthew Hubbell,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Albert Reuther
Abstract:
The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel pro…
▽ More
The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce parallel programming model to big data users running on a supercomputer. LLMapReduce dramatically simplifies map-reduce programming by providing simple parallel programming capability in one line of code. LLMapReduce supports all programming languages and many schedulers. LLMapReduce can work with any application without the need to modify the application. Furthermore, LLMapReduce can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel programming model. These features allow users to reduce the computational overhead by more than 10x compared to standard map-reduce for certain applications. LLMapReduce is widely used by hundreds of users at MIT. Currently LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF.
△ Less
Submitted 21 July, 2016;
originally announced July 2016.
-
Enhancing HPC Security with a User-Based Firewall
Authors:
Andrew Prout,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Matthew Hubbell,
Michael Houle,
Michael Jones,
Peter Michaleas,
Lauren Milechin,
Julie Mullen,
Antonio Rosa,
Siddharth Samsi,
Albert Reuther,
Jeremy Kepner
Abstract:
HPC systems traditionally allow their users unrestricted use of their internal network. While this network is normally controlled enough to guarantee privacy without the need for encryption, it does not provide a method to authenticate peer connections. Protocols built upon this internal network must provide their own authentication. Many methods have been employed to perform this authentication.…
▽ More
HPC systems traditionally allow their users unrestricted use of their internal network. While this network is normally controlled enough to guarantee privacy without the need for encryption, it does not provide a method to authenticate peer connections. Protocols built upon this internal network must provide their own authentication. Many methods have been employed to perform this authentication. However, support for all of these methods requires the HPC application developer to include support and the user to configure and enable these services. The user-based firewall capability we have prototyped enables a set of rules governing connections across the HPC internal network to be put into place using Linux netfilter. By using an operating system-level capability, the system is not reliant on any developer or user actions to enable security. The rules we have chosen and implemented are crafted to not impact the vast majority of users and be completely invisible to them.
△ Less
Submitted 11 July, 2016;
originally announced July 2016.
-
Scalability of VM Provisioning Systems
Authors:
Mike Jones,
Bill Arcand,
Bill Bergeron,
David Bestor,
Chansup Byun,
Lauren Milechin,
Vijay Gadepally,
Matt Hubbell,
Jeremy Kepner,
Pete Michaleas,
Julie Mullen,
Andy Prout,
Tony Rosa,
Siddharth Samsi,
Charles Yee,
Albert Reuther
Abstract:
Virtual machines and virtualized hardware have been around for over half a century. The commoditization of the x86 platform and its rapidly growing hardware capabilities have led to recent exponential growth in the use of virtualization both in the enterprise and high performance computing (HPC). The startup time of a virtualized environment is a key performance metric for high performance computi…
▽ More
Virtual machines and virtualized hardware have been around for over half a century. The commoditization of the x86 platform and its rapidly growing hardware capabilities have led to recent exponential growth in the use of virtualization both in the enterprise and high performance computing (HPC). The startup time of a virtualized environment is a key performance metric for high performance computing in which the runtime of any individual task is typically much shorter than the lifetime of a virtualized service in an enterprise context. In this paper, a methodology for accurately measuring the startup performance on an HPC system is described. The startup performance overhead of three of the most mature, widely deployed cloud management frameworks (OpenStack, OpenNebula, and Eucalyptus) is measured to determine their suitability for workloads typically seen in an HPC environment. A 10x performance difference is observed between the fastest (Eucalyptus) and the slowest (OpenNebula) framework. This time difference is primarily due to delays in waiting on networking in the cloud-init portion of the startup. The methodology and measurements presented should facilitate the optimization of startup across a variety of virtualization environments.
△ Less
Submitted 18 June, 2016;
originally announced June 2016.
-
D4M: Bringing Associative Arrays to Database Engines
Authors:
Vijay Gadepally,
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Lauren Edwards,
Matthew Hubbell,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Albert Reuther
Abstract:
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the pr…
▽ More
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple technologies, however, there is significant trouble in designing the movement of information between storage and database engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance of this connection.
△ Less
Submitted 28 August, 2015;
originally announced August 2015.
-
Lustre, Hadoop, Accumulo
Authors:
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Lauren Edwards,
Vijay Gadepally,
Matthew Hubbell,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Albert Reuther
Abstract:
Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the lar…
▽ More
Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons of these technologies. This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose workloads. Accumulo provides 10,000x lower latency on random lookups than either Lustre or Hadoop but Accumulo's bulk bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.
△ Less
Submitted 8 July, 2015;
originally announced July 2015.
-
Enabling On-Demand Database Computing with MIT SuperCloud Database Management System
Authors:
Andrew Prout,
Jeremy Kepner,
Peter Michaleas,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Lauren Edwards,
Vijay Gadepally,
Matthew Hubbell,
Julie Mullen,
Antonio Rosa,
Charles Yee,
Albert Reuther
Abstract:
The MIT SuperCloud database management system allows for rapid creation and flexible execution of a variety of the latest scientific databases, including Apache Accumulo and SciDB. It is designed to permit these databases to run on a High Performance Computing Cluster (HPCC) platform as seamlessly as any other HPCC job. It ensures the seamless migration of the databases to the resources assigned b…
▽ More
The MIT SuperCloud database management system allows for rapid creation and flexible execution of a variety of the latest scientific databases, including Apache Accumulo and SciDB. It is designed to permit these databases to run on a High Performance Computing Cluster (HPCC) platform as seamlessly as any other HPCC job. It ensures the seamless migration of the databases to the resources assigned by the HPCC scheduler and centralized storage of the database files when not running. It also permits snapshotting of databases to allow researchers to experiment and push the limits of the technology without concerns for data or productivity loss if the database becomes unstable.
△ Less
Submitted 29 June, 2015;
originally announced June 2015.
-
Big Data Strategies for Data Center Infrastructure Management Using a 3D Gaming Platform
Authors:
Matthew Hubbell,
Andrew Moran,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
High Performance Computing (HPC) is intrinsically linked to effective Data Center Infrastructure Management (DCIM). Cloud services and HPC have become key components in Department of Defense and corporate Information Technology competitive strategies in the global and commercial spaces. As a result, the reliance on consistent, reliable Data Center space is more critical than ever. The costs and co…
▽ More
High Performance Computing (HPC) is intrinsically linked to effective Data Center Infrastructure Management (DCIM). Cloud services and HPC have become key components in Department of Defense and corporate Information Technology competitive strategies in the global and commercial spaces. As a result, the reliance on consistent, reliable Data Center space is more critical than ever. The costs and complexity of providing quality DCIM are constantly being tested and evaluated by the United States Government and companies such as Google, Microsoft and Facebook. This paper will demonstrate a system where Big Data strategies and 3D gaming technology is leveraged to successfully monitor and analyze multiple HPC systems and a lights-out modular HP EcoPOD 240a Data Center on a singular platform. Big Data technology and a 3D gaming platform enables the relative real time monitoring of 5000 environmental sensors, more than 3500 IT data points and display visual analytics of the overall operating condition of the Data Center from a command center over 100 miles away. In addition, the Big Data model allows for in depth analysis of historical trends and conditions to optimize operations achieving even greater efficiencies and reliability.
△ Less
Submitted 29 June, 2015;
originally announced June 2015.
-
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database
Authors:
Jeremy Kepner,
Christian Anderson,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Matthew Hubbell,
Peter Michaleas,
Julie Mullen,
David O'Gwynn,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Charles Yee
Abstract:
Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed…
▽ More
Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M)[http://d4m.mit.edu] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema
△ Less
Submitted 14 July, 2014;
originally announced July 2014.
-
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Authors:
Jeremy Kepner,
William Arcand,
David Bestor,
Bill Bergeron,
Chansup Byun,
Vijay Gadepally,
Matthew Hubbell,
Peter Michaleas,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Charles Yee
Abstract:
The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the b…
▽ More
The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.
△ Less
Submitted 18 June, 2014;
originally announced June 2014.