-
Imaginary Machines: A Serverless Model for Cloud Applications
Authors:
Michael Wawrzoniak,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud appl…
▽ More
Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud applications because they are designed for networked virtual machines/containers environments. Since such cloud applications cannot take advantage of the highly elastic resources of serverless and require run-time orchestration systems to operate, they suffer from lower resource utilization, additional management complexity, and costs relative to their FaaS serverless counterparts.
We propose Imaginary Machines, a new serverless model for cloud applications. This model (1.) exposes the highly elastic resources of serverless platforms as the traditional network-of-hosts model that cloud applications expect, and (2.) it eliminates the need for explicit run-time orchestration by transparently managing application resources based on signals generated during cloud application executions. With the Imaginary Machines model, unmodified cloud applications become serverless applications. While still based on the network-of-host model, they benefit from the highly elastic resources and do not require runtime orchestration, just like their specialized serverless FaaS counterparts, promising increased resource utilization while reducing management costs.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Boxer: FaaSt Ephemeral Elasticity for Off-the-Shelf Cloud Applications
Authors:
Michael Wawrzoniak,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Elasticity is a key property of cloud computing. However, elasticity is offered today at the granularity of virtual machines, which take tens of seconds to start. This is insufficient to react to load spikes and sudden failures in latency sensitive applications, leading users to resort to expensive overprovisioning. Function-as-a-Service (FaaS) provides significantly higher elasticity than VMs, bu…
▽ More
Elasticity is a key property of cloud computing. However, elasticity is offered today at the granularity of virtual machines, which take tens of seconds to start. This is insufficient to react to load spikes and sudden failures in latency sensitive applications, leading users to resort to expensive overprovisioning. Function-as-a-Service (FaaS) provides significantly higher elasticity than VMs, but comes coupled with an event-triggered programming model and a constrained execution environment that makes them unsuitable for off-the-shelf applications. Previous work tries to overcome these obstacles but often requires re-architecting the applications. In this paper, we show how off-the-shelf applications can transparently benefit from ephemeral elasticity with FaaS. We built Boxer, an interposition layer spanning VMs and AWS Lambda, that intercepts application execution and emulates the network-of-hosts environment that applications expect when deployed in a conventional VM/container environment. The ephemeral elasticity of Boxer enables significant performance and cost savings for off-the-shelf applications with, e.g., recovery times over 5x faster than EC2 instances and absorbing load spikes comparable to overprovisioned EC2 VM instances.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Dirigent: Lightweight Serverless Orchestration
Authors:
Lazar Cvetković,
François Costa,
Mihajlo Djokic,
Michal Friedman,
Ana Klimovic
Abstract:
While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. We find that the current approach of building FaaS cluster managers on top of legacy orchestration systems like Kubernetes leads to high scheduling delay at high sandbox churn, which is…
▽ More
While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. We find that the current approach of building FaaS cluster managers on top of legacy orchestration systems like Kubernetes leads to high scheduling delay at high sandbox churn, which is typical in FaaS clusters. While generic cluster managers use hierarchical abstractions and multiple internal components to manage and reconcile state with frequent persistent updates, this becomes a bottleneck for FaaS, where cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandboxes from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is 1250x more than with Knative.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Authors:
Foteini Strati,
Sara Mcallister,
Amar Phanishayee,
Jakub Tarnawski,
Ana Klimovic
Abstract:
Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose DéjàVu, a system to address all these challenges using a versatile and efficient KV cach…
▽ More
Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose DéjàVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). Using DéjàVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swap** for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
Authors:
Maximilian Böther,
Abraham Sebastian,
Pranjal Awasthi,
Ana Klimovic,
Srikumar Ramalingam
Abstract:
Many learning problems hinge on the fundamental problem of subset selection, i.e., identifying a subset of important and representative points. For example, selecting the most significant samples in ML training cannot only reduce training costs but also enhance model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving subset selection problems. However, existing…
▽ More
Many learning problems hinge on the fundamental problem of subset selection, i.e., identifying a subset of important and representative points. For example, selecting the most significant samples in ML training cannot only reduce training costs but also enhance model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We show that these algorithms find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Modyn: A Platform for Model Training on Dynamic Datasets With Sample-Level Data Selection
Authors:
Maximilian Böther,
Viktor Gsteiger,
Ties Robroek,
Ana Klimovic
Abstract:
Machine learning training data is often dynamic in real-world use cases, i.e., data is added or removed and may experience distribution shifts over time. Models must incorporate this evolving training data to improve generalization, adapt to potential distribution shifts, and adhere to privacy regulations. However, the cost of model (re)training is proportional to how often the model trains and on…
▽ More
Machine learning training data is often dynamic in real-world use cases, i.e., data is added or removed and may experience distribution shifts over time. Models must incorporate this evolving training data to improve generalization, adapt to potential distribution shifts, and adhere to privacy regulations. However, the cost of model (re)training is proportional to how often the model trains and on how much data it trains on. While ML research explores these topics in isolation, there is no end-to-end open-source platform to facilitate the exploration of model retraining and data selection policies and the deployment these algorithms efficiently at scale.
We present Modyn, a platform for model training on dynamic datasets that enables sample-level data selection and triggering policies. Modyn orchestrates continuous training pipelines while optimizing the underlying system infrastructure to support fast access to arbitrary data samples for efficient data selection. Modyn's extensible architecture allows users to run training pipelines without modifying the platform code, and enables researchers to effortlessly extend the system. We evaluate Modyn's training throughput, showing that even in memory-bound recommendation systems workloads, Modyn is able to reach 80 to 100 % of the throughput compared to loading big chunks of data locally without sample-level data selection. Additionally, we showcase Modyn's functionality with three different data selection policies.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
Authors:
Xiaozhe Yao,
Ana Klimovic
Abstract:
Fine-tuning large language models (LLMs) for downstream tasks can greatly improve model quality, however serving many different fine-tuned LLMs concurrently for users in multi-tenant environments is challenging. Dedicating GPU memory for each model is prohibitively expensive and naively swap** large model weights in and out of GPU memory is slow. Our key insight is that fine-tuned models can be…
▽ More
Fine-tuning large language models (LLMs) for downstream tasks can greatly improve model quality, however serving many different fine-tuned LLMs concurrently for users in multi-tenant environments is challenging. Dedicating GPU memory for each model is prohibitively expensive and naively swap** large model weights in and out of GPU memory is slow. Our key insight is that fine-tuned models can be quickly swapped in and out of GPU memory by extracting and compressing the delta between each model and its pre-trained base model. We propose DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by a factor of $6\times$ to $8\times$ while maintaining high model quality. DeltaZip increases serving throughput by $1.5\times$ to $3\times$ and improves SLO attainment compared to a vanilla HuggingFace serving system.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
tf.data service: A Case for Disaggregating ML Input Data Processing
Authors:
Andrew Audibert,
Yang Chen,
Dan Graur,
Ana Klimovic,
Jiri Simsa,
Chandramohan A. Thekkath
Abstract:
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the hos…
▽ More
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.
△ Less
Submitted 2 January, 2024; v1 submitted 26 October, 2022;
originally announced October 2022.
-
An Elastic Ephemeral Datastore using Cheap, Transient Cloud Resources
Authors:
Malte Brodmann,
Nikolas Ioannou,
Bernard Metzler,
Jonas Pfefferle,
Ana Klimovic
Abstract:
Spot instances are virtual machines offered at 60-90% lower cost that can be reclaimed at any time, with only a short warning period. Spot instances have already been used to significantly reduce the cost of processing workloads in the cloud. However, leveraging spot instances to reduce the cost of stateful cloud applications is much more challenging, as the sudden preemptions lead to data loss. I…
▽ More
Spot instances are virtual machines offered at 60-90% lower cost that can be reclaimed at any time, with only a short warning period. Spot instances have already been used to significantly reduce the cost of processing workloads in the cloud. However, leveraging spot instances to reduce the cost of stateful cloud applications is much more challenging, as the sudden preemptions lead to data loss. In this work, we propose leveraging spot instances to decrease the cost of ephemeral data management in distributed data analytics applications. We specifically target ephemeral data as this large class of data in modern analytics workloads has low durability requirements; if lost, the data can be regenerated by re-executing compute tasks. We design an elastic, distributed ephemeral datastore that handles node preemptions transparently to user applications and minimizes data loss by redistributing data during node preemption warning periods. We implement our elastic datastore on top of the Apache Crail datastore and evaluate the system with various workloads and VM types. By leveraging spot instances, we show that we can run TPC-DS queries with 60\% lower cost compared to using on-demand VMs for the datastore, while only increasing end-to-end execution time by 2.1%.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
SHiFT: An Efficient, Flexible Search Engine for Transfer Learning
Authors:
Cedric Renggli,
Xiaozhe Yao,
Luka Kolar,
Luka Rimanic,
Ana Klimovic,
Ce Zhang
Abstract:
Transfer learning can be seen as a data- and compute-efficient alternative to training models from scratch. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand…
▽ More
Transfer learning can be seen as a data- and compute-efficient alternative to training models from scratch. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. By carefully comparing various selection and search strategies, we realize that no single method outperforms the others, and hybrid or mixed strategies can be beneficial. Therefore, we propose SHiFT, the first downstream task-aware, flexible, and efficient model search engine for transfer learning. These properties are enabled by a custom query language SHiFT-QL together with a cost-based decision maker, which we empirically validate. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a careful implementation when jointly used with our optimizations.
△ Less
Submitted 28 September, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Short-lived Datacenter
Authors:
Michael Wawrzoniak,
Ingo Müller,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Serverless platforms have attracted attention due to their promise of elasticity, low cost, and fast deployment. Instead of using a fixed virtual machine (VM) infrastructure, which can incur considerable costs to operate and run, serverless platforms support short computations, triggered on demand, with cost proportional to fine-grain function execution time. However, serverless platforms offer a…
▽ More
Serverless platforms have attracted attention due to their promise of elasticity, low cost, and fast deployment. Instead of using a fixed virtual machine (VM) infrastructure, which can incur considerable costs to operate and run, serverless platforms support short computations, triggered on demand, with cost proportional to fine-grain function execution time. However, serverless platforms offer a restricted execution environment. For example, functions have limited execution times, limited resources, and no support for networking between functions. In this paper, we explore what it takes to treat serverless platforms as short-lived, general purpose data-centers which can execute unmodified existing applications. As a first step in this quest, we have developed Boxer, a system providing an execution environment on top of existing functions-as-a-service platforms that allows users to seamlessly migrate conventional VM-based cloud services to serverless platforms. Boxer allows generic applications to benefit from the fine-grain elasticity of serverless platforms without having to modify applications to adopt a restrictive event-triggered programming model or orchestrate auxiliary systems for data communication. We implement Boxer on top of AWS Lambda and extend it to transparently provide standard network interfaces. We describe its implementation and demonstrate how it can be used to run off-the-shelf cloud applications with a degree of fine-grained elasticity not available on traditional VM-based platforms.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
How to use Persistent Memory in your Database
Authors:
Dimitrios Koutsoukos,
Raghav Bhartia,
Ana Klimovic,
Gustavo Alonso
Abstract:
Persistent or Non Volatile Memory (PMEM or NVM) has recently become commercially available under several configurations with different purposes and goals. Despite the attention to the topic, we are not aware of a comprehensive empirical analysis of existing relational database engines under different PMEM configurations. Such a study is important to understand the performance implications of the v…
▽ More
Persistent or Non Volatile Memory (PMEM or NVM) has recently become commercially available under several configurations with different purposes and goals. Despite the attention to the topic, we are not aware of a comprehensive empirical analysis of existing relational database engines under different PMEM configurations. Such a study is important to understand the performance implications of the various hardware configurations and how different DB engines can benefit from them. To this end, we analyze three different engines (PostgreSQL, MySQL, and SQLServer) under common workloads (TPC-C and TPC-H) with all possible PMEM configurations supported by Intel's Optane NVM devices (PMEM as persistent memory in AppDirect mode and PMEM as volatile memory in Memory mode). Our results paint a complex picture and are not always intuitive due to the many factors involved. Based on our findings, we provide insights on how the different engines behave with PMEM and which configurations and queries perform best. Our results show that using PMEM as persistent storage usually speeds up query execution, but with some caveats as the I/O path is not fully optimized. Additionally, using PMEM in Memory mode does not offer any performance advantage despite the larger volatile memory capacity. Through the extensive coverage of engines and parameters, we provide an important starting point for exploiting PMEM in databases and tuning relational engines to take advantage of this new technology.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
Authors:
Michael Kuchnik,
Ana Klimovic,
Jiri Simsa,
Virginia Smith,
George Amvrosiadis
Abstract:
Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over two million ML jobs in Google datacenters reveals that a significant fraction of…
▽ More
Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over two million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis indicates that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interpretable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 47x for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 50% compared to state-of-the-art tuners.
△ Less
Submitted 21 March, 2022; v1 submitted 7 November, 2021;
originally announced November 2021.
-
Towards Demystifying Serverless Machine Learning Training
Authors:
Jiawei Jiang,
Shaoduo Gan,
Yue Liu,
Fanlin Wang,
Gustavo Alonso,
Ana Klimovic,
Ankit Singla,
Wentao Wu,
Ce Zhang
Abstract:
The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (…
▽ More
The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
tf.data: A Machine Learning Data Processing Framework
Authors:
Derek G. Murray,
Jiri Simsa,
Ana Klimovic,
Ihor Indyk
Abstract:
Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transferring data to hardware accelerators while overlap** computation and communication to achieve optimal performance. We present tf.data,…
▽ More
Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transferring data to hardware accelerators while overlap** computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides operators which can be parameterized with user-defined computation, composed, and reused across different machine learning domains. These abstractions allow users to focus on the application logic of data processing, while tf.data's runtime ensures that pipelines run efficiently.
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while avoiding the need for manual tuning of performance knobs. We show that tf.data features, such as parallelism, caching, static optimizations, and non-deterministic execution are essential for high performance. Finally, we characterize machine learning input pipelines for millions of jobs that ran in Google's fleet, showing that input data processing is highly diverse and consumes a significant fraction of job resources. Our analysis motivates future research directions, such as sharing computation across jobs and pushing data projection to the storage layer.
△ Less
Submitted 23 February, 2021; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms
Authors:
Dimitrios Koutsoukos,
Ingo Müller,
Renato Marroquín,
Ana Klimovic,
Gustavo Alonso
Abstract:
The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasing…
▽ More
The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e.,composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code. In fact, changes in the platform affect only sub-operators that depend on the underlying hardware. We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.
△ Less
Submitted 29 September, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.