-
Towards Building Autonomous Data Services on Azure
Authors:
Yiwen Zhu,
Yuanyuan Tian,
Joyce Cahoon,
Subru Krishnan,
Ankita Agarwal,
Rana Alotaibi,
Jesús Camacho-Rodríguez,
Bibin Chundatt,
Andrew Chung,
Niharika Dutta,
Andrew Fogarty,
Anja Gruenheid,
Brandon Haynes,
Matteo Interlandi,
Minu Iyer,
Nick Jurgens,
Sumeet Khushalani,
Brian Kroth,
Manoj Kumar,
Jyoti Leeka,
Sergiy Matusevych,
Minni Mittal,
Andreas Mueller,
Kartheek Muthyala,
Harsha Nagulapalli
, et al. (13 additional authors not shown)
Abstract:
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to…
▽ More
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to gain the most value from them. For cloud providers, managing every aspect of an ever-increasing set of data services, while meeting customer SLAs and minimizing operational cost is becoming more challenging. Cloud technology enables the collection of significant amounts of workload traces and system telemetry. With the progress in data science (DS) and machine learning (ML), it is feasible and desirable to utilize a data-driven, ML-based approach to automate various aspects of data services, resulting in the creation of autonomous data services. This paper presents our perspectives and insights on creating autonomous data services on Azure. It also covers the future endeavors we plan to undertake and unresolved issues that still need attention.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
End-to-end Optimization of Machine Learning Prediction Queries
Authors:
Kwanghyun Park,
Karla Saur,
Dalitso Banda,
Rathijit Sen,
Matteo Interlandi,
Konstantinos Karanasos
Abstract:
Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for o…
▽ More
Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored. We present Raven, a production-ready system for optimizing prediction queries. Raven follows the enterprise architectural trend of collocating data and ML runtimes. It relies on a unified intermediate representation that captures both data and ML operators in a single graph structure to unlock two families of optimizations. First, it employs logical optimizations that pass information between the data part (and the properties of the underlying data) and the ML part to optimize each other. Second, it introduces logical-to-physical transformations that allow operators to be executed on different runtimes (relational, ML, and DNN) and hardware (CPU, GPU). Novel data-driven optimizations determine the runtime to be used for each part of the query to achieve optimal performance. Our evaluation shows that Raven improves performance of prediction queries on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
Query Processing on Tensor Computation Runtimes
Authors:
Dong He,
Supun Nakandala,
Dalitso Banda,
Rathijit Sen,
Karla Saur,
Kwanghyun Park,
Carlo Curino,
Jesús Camacho-Rodríguez,
Konstantinos Karanasos,
Matteo Interlandi
Abstract:
The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientis…
▽ More
The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how database management systems can ride the wave of innovation happening in the AI space.
We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10$\times$ over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9$\times$ speedup over CPU baselines.
△ Less
Submitted 9 February, 2023; v1 submitted 3 March, 2022;
originally announced March 2022.
-
A Tensor Compiler for Unified Machine Learning Prediction Serving
Authors:
Supun Nakandala,
Karla Saur,
Gyeong-In Yu,
Konstantinos Karanasos,
Carlo Curino,
Markus Weimer,
Matteo Interlandi
Abstract:
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure---the bespoke solutions typical in large web companies are simply untenable. Model scoring, the process of obtaining predictions from a trained model over new data, is a primary contributor to infrastructure complexity and cost as models are trained once but used many times. In this paper w…
▽ More
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure---the bespoke solutions typical in large web companies are simply untenable. Model scoring, the process of obtaining predictions from a trained model over new data, is a primary contributor to infrastructure complexity and cost as models are trained once but used many times. In this paper we propose HUMMINGBIRD, a novel approach to model scoring, which compiles featurization operators and traditional ML models (e.g., decision trees) into a small set of tensor operations. This approach inherently reduces infrastructure complexity and directly leverages existing investments in Neural Network compilers and runtimes to generate efficient computations for both CPU and hardware accelerators. Our performance results are intriguing: despite replacing imperative computations (e.g., tree traversals) with tensor computation abstractions, HUMMINGBIRD is competitive and often outperforms hand-crafted kernels on micro-benchmarks on both CPU and GPU, while enabling seamless end-to-end acceleration of ML pipelines. We have released HUMMINGBIRD as open source.
△ Less
Submitted 19 October, 2020; v1 submitted 9 October, 2020;
originally announced October 2020.
-
Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML
Authors:
Ashvin Agrawal,
Rony Chatterjee,
Carlo Curino,
Avrilia Floratou,
Neha Gowdal,
Matteo Interlandi,
Alekh **dal,
Kostantinos Karanasos,
Subru Krishnan,
Brian Kroth,
Jyoti Leeka,
Kwanghyun Park,
Hiren Patel,
Olga Poppe,
Fotis Psallidas,
Raghu Ramakrishnan,
Abhishek Roy,
Karla Saur,
Rathijit Sen,
Markus Weimer,
Travis Wright,
Yiwen Zhu
Abstract:
Machine learning (ML) has proven itself in high-value web applications such as search ranking and is emerging as a powerful tool in a much broader range of enterprise scenarios including voice recognition and conversational understanding for customer support, autotuning for videoconferencing, intelligent feedback loops in large-scale sysops, manufacturing and autonomous vehicle management, complex…
▽ More
Machine learning (ML) has proven itself in high-value web applications such as search ranking and is emerging as a powerful tool in a much broader range of enterprise scenarios including voice recognition and conversational understanding for customer support, autotuning for videoconferencing, intelligent feedback loops in large-scale sysops, manufacturing and autonomous vehicle management, complex financial predictions, just to name a few. Meanwhile, as the value of data is increasingly recognized and monetized, concerns about securing valuable data and risks to individual privacy have been growing. Consequently, rigorous data management has emerged as a key requirement in enterprise settings. How will these trends (ML growing popularity, and stricter data governance) intersect? What are the unmet requirements for applying ML in enterprise settings? What are the technical challenges for the DB community to solve? In this paper, we present our vision of how ML and database systems are likely to come together, and early steps we take towards making this vision a reality.
△ Less
Submitted 27 December, 2019; v1 submitted 30 August, 2019;
originally announced September 2019.
-
Morpheus: Safe and Flexible Dynamic Updates for SDNs
Authors:
Karla Saur,
Joseph Collard,
Nate Foster,
Arjun Guha,
Laurent Vanbever,
Michael Hicks
Abstract:
SDN controllers must be periodically modified to add features, improve performance, and fix bugs, but current techniques for implementing dynamic updates are inadequate. Simply halting old controllers and bringing up new ones can cause state to be lost, which often leads to incorrect behavior-e.g., if the state represents hosts blacklisted by a firewall, then traffic that should be blocked may be…
▽ More
SDN controllers must be periodically modified to add features, improve performance, and fix bugs, but current techniques for implementing dynamic updates are inadequate. Simply halting old controllers and bringing up new ones can cause state to be lost, which often leads to incorrect behavior-e.g., if the state represents hosts blacklisted by a firewall, then traffic that should be blocked may be allowed to pass through. Techniques based on record and replay can reconstruct state automatically, but they are expensive to deploy and can lead to incorrect behavior. Problematic scenarios are especially likely to arise in distributed controllers and with semantics-altering updates.
This paper presents a new approach to implementing dynamic controller updates based on explicit state transfer. Instead of attempting to infer state changes automatically-an approach that is expensive and fundamentally incomplete-our framework gives programmers effective tools for implementing correct updates that avoid major disruptions. We develop primitives that enable programmers to directly (and easily, in most cases) initialize the new controller's state as a function of old state and we design protocols that ensure consistent behavior during the transition. We also present a prototype implementation called Morpheus, and evaluate its effectiveness on representative case studies.
△ Less
Submitted 11 September, 2015;
originally announced September 2015.
-
Evolving NoSQL Databases Without Downtime
Authors:
Karla Saur,
Tudor Dumitraş,
Michael Hicks
Abstract:
NoSQL databases like Redis, Cassandra, and MongoDB are increasingly popular because they are flexible, lightweight, and easy to work with. Applications that use these databases will evolve over time, sometimes necessitating (or preferring) a change to the format or organization of the data. The problem we address in this paper is: How can we support the evolution of high-availability applications…
▽ More
NoSQL databases like Redis, Cassandra, and MongoDB are increasingly popular because they are flexible, lightweight, and easy to work with. Applications that use these databases will evolve over time, sometimes necessitating (or preferring) a change to the format or organization of the data. The problem we address in this paper is: How can we support the evolution of high-availability applications and their NoSQL data online, without excessive delays or interruptions, even in the presence of backward-incompatible data format changes?
We present KVolve, an extension to the popular Redis NoSQL database, as a solution to this problem. KVolve permits a developer to submit an upgrade specification that defines how to transform existing data to the newest version. This transformation is applied lazily as applications interact with the database, thus avoiding long pause times. We demonstrate that KVolve is expressive enough to support substantial practical updates, including format changes to RedisFS, a Redis-backed file system, while imposing essentially no overhead in general use and minimal pause times during updates.
△ Less
Submitted 24 April, 2016; v1 submitted 29 June, 2015;
originally announced June 2015.