-
Groundhog: Efficient Request Isolation in FaaS
Authors:
Mohamed Alzayat,
Jonathan Mace,
Peter Druschel,
Deepak Garg
Abstract:
Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although ef…
▽ More
Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function's implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.
Groundhog isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. The system exploits two properties of typical FaaS platforms: each container executes at most one function at a time and legitimate functions do not retain state across invocations. This enables Groundhog to efficiently snapshot and restore function state between invocations in a manner that is independent of the programming language/runtime and does not require any changes to existing functions, libraries, language runtimes, or OS kernels. We describe the design of Groundhog and its implementation in OpenWhisk, a popular production-grade open-source FaaS framework. On three existing benchmark suites, Groundhog isolates sequential invocations with modest overhead on end-to-end latency (median: 1.5%, 95p: 7%) and throughput (median: 2.5%, 95p: 49.6%), relative to an insecure baseline that reuses the container and runtime state.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
ACT now: Aggregate Comparison of Traces for Incident Localization
Authors:
Kamala Ramasubramanian,
Ashutosh Raina,
Jonathan Mace,
Peter Alvaro
Abstract:
Incidents in production systems are common and downtime is expensive. Applying an appropriate mitigating action quickly, such as changing a specific firewall rule, reverting a change, or diverting traffic to a different availability zone, saves money. Incident localization is time-consuming since a single failure can have many effects, extending far from the site of failure. Knowing how different…
▽ More
Incidents in production systems are common and downtime is expensive. Applying an appropriate mitigating action quickly, such as changing a specific firewall rule, reverting a change, or diverting traffic to a different availability zone, saves money. Incident localization is time-consuming since a single failure can have many effects, extending far from the site of failure. Knowing how different system events relate to each other is necessary to quickly identify \emph{where} to mitigate. Our approach, Aggregate Comparison of Traces (ACT), localizes incidents by comparing sets of traces (which capture events and their relationships for individual requests) sampled from the most recent steady-state operation and during an incident. In our quantitative experiments, we show that ACT is able to effectively localize more than 99% of incidents.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Authors:
Lei Zhang,
Vaastav Anand,
Zhiqiang Xie,
Ymir Vigfusson,
Jonathan Mace
Abstract:
Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will…
▽ More
Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will be problematic until after-the-fact. On the other hand, frameworks can trace everything and later keep only the interesting edge-case traces (tail sampling), but this has high overheads on the traced application and enormous data ingestion costs.
In this paper we circumvent this trade-off for any edge-case with symptoms that can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, which implements a retroactive sampling abstraction: instead of eagerly ingesting and processing traces, Hindsight lazily retrieves trace data only after symptoms of a problem are detected. Hindsight is analogous to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Developers using Hindsight receive the exact edge-case traces they desire without undue overhead or dependence on luck. Our evaluation shows that Hindsight scales to millions of requests per second, adds nanosecond-level overhead to generate trace data, handles GB/s of data per node, transparently integrates with existing distributed tracing systems, and successfully persists full, detailed traces in real-world use cases when edge-case problems are detected.
△ Less
Submitted 26 April, 2022; v1 submitted 11 February, 2022;
originally announced February 2022.
-
Aggregate-Driven Trace Visualizations for Performance Debugging
Authors:
Vaastav Anand,
Matheus Stolet,
Thomas Davidson,
Ivan Beschastnikh,
Tamara Munzner,
Jonathan Mace
Abstract:
Performance issues in cloud systems are hard to debug. Distributed tracing is a widely adopted approach that gives engineers visibility into cloud systems. Existing trace analysis approaches focus on debugging single request correctness issues but not debugging single request performance issues. Diagnosing a performance issue in a given request requires comparing the performance of the offending r…
▽ More
Performance issues in cloud systems are hard to debug. Distributed tracing is a widely adopted approach that gives engineers visibility into cloud systems. Existing trace analysis approaches focus on debugging single request correctness issues but not debugging single request performance issues. Diagnosing a performance issue in a given request requires comparing the performance of the offending request with the aggregate performance of typical requests. Effective and efficient debugging of such issues faces three challenges: (i) identifying the correct aggregate data for diagnosis; (ii) visualizing the aggregated data; and (iii) efficiently collecting, storing, and processing trace data.
We present TraVista, a tool designed for debugging performance issues in a single trace that addresses these challenges. TraVista extends the popular single trace Gantt chart visualization with three types of aggregate data - metric, temporal, and structure data, to contextualize the performance of the offending trace across all traces.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Authors:
Arpan Gujarati,
Reza Karimi,
Safya Alzayat,
Wei Hao,
Antoine Kaufmann,
Ymir Vigfusson,
Jonathan Mace
Abstract:
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpre…
▽ More
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.
△ Less
Submitted 26 October, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Cyclic Bayesian Attack Graphs: A Systematic Computational Approach
Authors:
Isaac Matthews,
John Mace,
Sadegh Soudjani,
Aad van Moorsel
Abstract:
Attack graphs are commonly used to analyse the security of medium-sized to large networks. Based on a scan of the network and likelihood information of vulnerabilities, attack graphs can be transformed into Bayesian Attack Graphs (BAGs). These BAGs are used to evaluate how security controls affect a network and how changes in topology affect security. A challenge with these automatically generated…
▽ More
Attack graphs are commonly used to analyse the security of medium-sized to large networks. Based on a scan of the network and likelihood information of vulnerabilities, attack graphs can be transformed into Bayesian Attack Graphs (BAGs). These BAGs are used to evaluate how security controls affect a network and how changes in topology affect security. A challenge with these automatically generated BAGs is that cycles arise naturally, which make it impossible to use Bayesian network theory to calculate state probabilities. In this paper we provide a systematic approach to analyse and perform computations over cyclic Bayesian attack graphs. %thus providing a generic approach to handle cycles as well as unifying the theory of Bayesian attack graphs. Our approach first formally introduces two commonly used versions of Bayesian attack graphs and compares their expressiveness. We then present an interpretation of Bayesian attack graphs based on combinational logic circuits, which facilitates an intuitively attractive systematic treatment of cycles. We prove properties of the associated logic circuit and present an algorithm that computes state probabilities without altering the attack graphs (e.g., remove an arc to remove a cycle). Moreover, our algorithm deals seamlessly with all cycles without the need to identify their types. A set of experiments using synthetically created networks demonstrates the scalability of the algorithm on computer networks with hundreds of machines, each with multiple vulnerabilities.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
Authors:
Amit Samanta,
Suhas Shrinivasan,
Antoine Kaufmann,
Jonathan Mace
Abstract:
With the rise of machine learning, inference on deep neural networks (DNNs) has become a core building block on the critical path for many cloud applications. Applications today rely on isolated ad-hoc deployments that force users to compromise on consistent latency, elasticity, or cost-efficiency, depending on workload characteristics. We propose to elevate DNN inference to be a first class cloud…
▽ More
With the rise of machine learning, inference on deep neural networks (DNNs) has become a core building block on the critical path for many cloud applications. Applications today rely on isolated ad-hoc deployments that force users to compromise on consistent latency, elasticity, or cost-efficiency, depending on workload characteristics. We propose to elevate DNN inference to be a first class cloud primitive provided by a shared multi-tenant system, akin to cloud storage, and cloud databases. A shared system enables cost-efficient operation with consistent performance across the full spectrum of workloads. We argue that DNN inference is an ideal candidate for a multi-tenant system because of its narrow and well-defined interface and predictable resource requirements.
△ Less
Submitted 23 January, 2019; v1 submitted 21 January, 2019;
originally announced January 2019.