HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: filecontents
  • failed: graphviz
  • failed: xpatch

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2303.08028v3 [cs.DB] 22 Feb 2024

EdgeServe: A Streaming System for Decentralized Model Serving

Ted Shaowang
The University of Chicago
   Sanjay Krishnan
The University of Chicago
Abstract

The relevant features for a machine learning task may arrive as one or more continuous streams of data. Serving machine learning models over streams of data creates a number of interesting systems challenges in managing data routing, time-synchronization, and rate control. This paper presents EdgeServe, a distributed streaming system that can serve predictions from machine learning models in real time. We evaluate EdgeServe on three streaming prediction tasks: (1) human activity recognition, (2) autonomous driving, and (3) network intrusion detection.

1 Introduction

The broader computing community has long understood the importance of telemetry in both physical and digital systems. The growing maturity of AI has created new opportunities for such data, where models can be built to predict future behavior and/or automatically react to current trends – to “close the loop”. This paper presents EdgeServe, a new system that allows for low-latency feedback systems over distributed streams of data.

Efficiently serving predictions from machine learning models is already a crucial part of modern software applications ranging from automatic fraud detection to predictive medicine [3]. Accordingly, a number of model serving frameworks have been developed, including Clipper [18], TensorFlow Serving [46], and InferLine [17]. These frameworks simplify the deployment and interfacing of trained machine-learning models with a service-oriented interface. Typically, they provide a RESTful API that accepts features as inputs (i.e., a prediction “request”), and responds to these requests with predicted labels (i.e., a prediction “response”). These frameworks provide a number of crucial optimizations such as containerizing inference code [18], autoscaling [46], and model ensembling [17].

Existing model serving frameworks were envisioned as components in cloud-based deployments. Implicit to this design, there are several key assumptions: (1) prediction requests arrive asynchronously through the RESTful interface, (2) the request is self-contained with all of the features necessary to issue a prediction, (3) the design prioritizes scalability over the latency of an individual request, (4) and the response is delivered back to the requester. We find that streaming settings challenge this design paradigm. Consider a simple example of a model where the time-ordering of predictions matters (e.g., sensor fusion or forecasting). If such a model is served with a RESTful model-serving framework, there is no inherent message ordering guarantee which is crucial for accurate forecasting. The data processor needs to block processing until a prediction is returned by the framework, and this negates any pipelining or scale-out optimizations present in these frameworks. For such use cases, it is more convenient for developers to think of a machine learning model as an operator applied to one or more continuous streams of data with synchronization, rate limit, and freshness constraints.

To the best of our knowledge, the academic literature on this topic is relatively sparse with most existing work in video analytics [63, 29, 11, 25, 21, 59]. There is also a significant amount of work in real-time systems [6, 10, 45, 8], but few systems focus on model serving. In particular, significant technical challenges arise when the relevant features for a machine learning model are generated on different network nodes than where the model is served. The data has to get to “the right place at the right time” before any prediction can be made, and this communication quickly becomes the primary bottleneck. The problem is further complicated where there are multiple data streams: the data streams have to be time-synchronized and integrated before any predictions can take place. Prior work has shown that placement and synchronization decisions affect both performance and accuracy in nuanced ways [55, 56]. Thus, for low-latency model-serving over distributed streams of data, one has to jointly optimize for communication, rate control, and the fact that minor time alignment deviations typically have a negligible impact on accuracy in model-serving scenarios.

To better understand the tradeoffs, consider the following running example.

Example 1.

In network intrusion detection, machine learning models applied to packet capture data are used to infer anomalous or malicious traffic patterns. Most organizations have geo-distributed private networks spanning multiple clouds and regions. The relevant features for a particular intrusion detection model may be sourced from different packet capture streams at different points in the network. These streams will have to be synchronized and integrated to make any global prediction.

With existing tools, building such applications requires significant developer effort in the design of (1) communication between nodes collecting the streams, (2) the time-alignment strategy for the streams, and (3) the rate control of incoming data. Challenge (1,2,3) create a complex tradeoff space that leads to bespoke solutions [16, 3, 59, 37]. This paper describes a first step towards such a system, called EdgeServe, that addresses this need. Instead of a RESTful service that handles each prediction request asynchronously, EdgeServe routes synchronized streams of data to models that are flexibly placed anywhere in a network. We call such an architecture synchronized prediction to differentiate it from classical model serving, where a collection of model-serving nodes work together to serve predictions over one or more data streams in a temporally coherent way.

Practically, EdgeServe provides a lightweight inference service that can be installed on every node of the network. EdgeServe employs a message broker to route data around different nodes, allowing multiple producers and consumers to operate on the same message queue simultaneously. Users can define data movements and model placements by pointing models to named streams of data rather than their physical locations. Furthermore, the user can program her model and featurization as if there was all-to-all communication in the network, and the actual data routing over the actual network topology is handled seamlessly by EdgeServe. These data streams can be time-synchronized so that inferences that need to look at a particular snapshot in time can appropriately construct features that join data from different sources. Furthermore, the data can be derived from primary sources (e.g., sensors, user data streams, etc.) or can be results of computation (e.g., features/predictions computed from pre-trained models). This flexibility allows users to build complex but robust predictive applications in networks with heterogeneous and disaggregated resources.

While EdgeServe resembles other streaming and data flow systems [2, 5, 10, 41, 45, 8, 34, 1, 47, 6], there are three key novel architectural features due to the model-serving focus.

  • (How to trigger computations?) Data-Triggered Stream Joins. EdgeServe employs a novel temporal join strategy for combining multiple streams of data based on data arrivals (§4).

  • (What are the communication primitives?) Lazy Data Routing. For large data payloads (e.g., high-dimensional data streams), EdgeServe applies an innovative communication protocol called “lazy data routing” where only references to data are sent through the message broker. (§5)

  • (How to ensure reliable behavior?) Prediction Rate Control. EdgeServe presents strategies that can ensure that timely decisions still get made even in the presence of dropped, delayed messages or overloaded models. (§6)

2 Background and Existing Model Serving Frameworks

This section motivates EdgeServe and describes the performance of existing model serving frameworks.

Example 2.

To understand how existing cloud-based systems work, we construct a simplified scenario where a single stream of data is fed into a model-serving framework. Each data item is a 134-dimensional feature vector, the model-serving framework must issue a prediction for each item. The items are streamed into a message broker and dequeued in timestamp order. The goal of this experiment is to illustrate that queueing and communication far outweigh the actual model inference time for typical sensing workloads.

Based on blog posts and tutorials that describe best practices, we developed a few different models serving pipelines on AWS and GCE [23]. We experimented with two different models, a Random Forest and a 3-layer MLP. We used roughly comparable inference hardware on both cloud providers (on AWS SageMaker EC2 P3 and GCE a VertexAI 2.10 Container), and note that this inference hardware is GPU-accelerated.

  • AWS. This model-serving pipeline uses AWS SQS to queue messages and AWS SageMaker to perform the inferences over each queued message.

  • GCE. This model-serving pipeline uses GCE Pub/Sub to queue messages and GCE VertexAI to perform the inferences over each queued message.

  • Inf Only. We run both the AWS and GCE pipelines above in an inference-only mode which only measures the latency of AWS SageMaker and GCE VertexAI repsectively.

We evaluate these two baselines in terms of their end-to-end latency, which is the elapsed time since the execution of the “publish” message to the message queue and the delivered prediction.

Random Forest MLP
Med. P99 Med. P99
AWS 141ms 400ms 116ms 391ms
AWS (inf only) 20ms 25ms 18ms 20ms
GCE 88ms 94ms 74ms 82ms
GCE (inf only) 18ms 22ms 13ms 15ms
Table 1: End-to-end latency for model inference over a 134-dimension sensor stream

In terms of end-to-end latency, existing frameworks are not satisfactory for emerging “real-time” machine learning applications, where the typical latency tolerance is measured in tens of milliseconds. With default cloud tools, one can expect hundreds of milliseconds of latency. Even worse, this is often highly unpredictable. Interestingly enough, the primary source of end-to-end latency is not the model inference itself, but delays in message queuing. Streaming the data to the model becomes a bottleneck, incurring copying costs, queuing costs, and checkpointing/replication costs at the message broker. Cloud-based messaging services were designed to be highly available and reliable, but not particularly aimed for low-latency or time-synchronized applications. These queuing overheads can be made arbitrarily more significant if multiple streams of data are required. Then, there is additional waiting time in the system to align observations across streams.

2.1 What is Streaming Inference?

These numbers indicate the need for a new model serving framework that tightly integrates streaming with model inference. We can build a serving framework that is more suited for streaming data, and we can build a streaming system that is more suited for the typical workloads seen in machine learning serving.

Inference over a Single Stream. Consider a supervised learning inference task. Let x𝑥xitalic_x be a feature vector in psuperscript𝑝\mathcal{R}^{p}caligraphic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT be a model with parameters θ𝜃\thetaitalic_θ. fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT evaluates at x𝑥xitalic_x and returns a corresponding prediction y𝑦yitalic_y, which is in the set of labels 𝒴𝒴\mathcal{Y}caligraphic_Y. A prediction over a stream of such feature vectors can be thus summarized as:

yt=fθ(xt)subscript𝑦𝑡subscript𝑓𝜃subscript𝑥𝑡y_{t}=f_{\theta}(x_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where t𝑡titalic_t denotes a timestamp for the feature vector. In such a prediction problem, the user must ensure that the featurized data is at “the right place at the right time”: fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has to be hosted somewhere in a network and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has to be appropriately generated and sent to fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Inference over Multiple Streams. Now, let’s imagine that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is constructed from multiple different streams of data. Each xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (the original features) can be treated as a concatenation of d𝑑ditalic_d individual streams:

xt=[xt(1)xt(i)xt(d)]subscript𝑥𝑡matrixsubscriptsuperscript𝑥1𝑡subscriptsuperscript𝑥𝑖𝑡subscriptsuperscript𝑥𝑑𝑡x_{t}=\begin{bmatrix}x^{(1)}_{t}&...&x^{(i)}_{t}&...&x^{(d)}_{t}\end{bmatrix}\ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Each of these streams of data x1(i),,xt(i)subscriptsuperscript𝑥𝑖1subscriptsuperscript𝑥𝑖𝑡x^{(i)}_{1},...,x^{(i)}_{t}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT might be produced on a different node in a network. Consider the network intrusion detection example (Example 1). Each x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT corresponds to one of the streams of data (packets from node 1, packets from node 2, packets from node 3). In this case, we have different streams of data x(1),x(2),superscript𝑥1superscript𝑥2x^{(1)},x^{(2)},...italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , … coming in, and we need to aggregate them so that the final prediction arrives in our desired destination node.

If the streams of sub-features are collected independently, they will likely not be time-synchronized. This means, at any given instant, the data at the prediction node comes from a slightly different timestamp:

xt=[xt+ϵ1(1)xt+ϵi(1)xt+ϵd(d)]subscript𝑥𝑡matrixsubscriptsuperscript𝑥1𝑡subscriptitalic-ϵ1subscriptsuperscript𝑥1𝑡subscriptitalic-ϵ𝑖subscriptsuperscript𝑥𝑑𝑡subscriptitalic-ϵ𝑑x_{t}=\begin{bmatrix}x^{(1)}_{t+\epsilon_{1}}&...&x^{(1)}_{t+\epsilon_{i}}&...% &x^{(d)}_{t+\epsilon_{d}}\end{bmatrix}\ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_ϵ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Each ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a positive or negative offset. The overall time-skew of the prediction problem is ϵ=maxiϵiminjϵjitalic-ϵsubscriptmax𝑖subscriptitalic-ϵ𝑖subscriptmin𝑗subscriptitalic-ϵ𝑗\epsilon=\text{max}_{i}\epsilon_{i}-\text{min}_{j}\epsilon_{j}italic_ϵ = max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In other words, to issue a perfectly synchronized prediction at time t𝑡titalic_t, the earliest stream has to wait for ϵitalic-ϵ\epsilonitalic_ϵ steps to ensure all features are available. This can be even more complicated if different data streams are collected at different frequencies. EdgeServe provides an API for controlling synchronization errors in decentralized prediction deployments (§6).

EdgeServe: Our Contribution. Today’s model serving systems lack the support for flexibly deploying models (or partial models) across a network and routing data and predictions to/from them. Users with such problems today have to design bespoke solutions, which can result in brittle design decisions that are not robust to changes in the network or data. While it is true that prior work has considered decomposing models across a network to optimize throughput [43], this work does not consider latency-sensitive applications nor does it consider disaggregated input data streams. EdgeServe significantly reduces latency in queuing and communication leading to large improvements in end-to-end responsiveness. To motivate the contributions, if we run the same experiment on comparable hardware in AWS as above in EdgeServe, we get the following results:

Random Forest MLP
Med. P99 Med. P99
EdgeServe 21ms 31ms 20ms 29ms
Table 2: End-to-end latency For model inference over a 134-dimension sensor stream using EdgeServe

3 EdgeServe Architecture and API

EdgeServe is a system that facilitates prediction applications on streaming data.

3.1 EdgeServe Overall Workflow

The key difference between EdgeServe and model serving systems is that EdgeServe takes streams as the unit of operations. Existing model serving services all follow the request-response API design, which requires that the prediction results be sent back to the caller. This might not be suitable for streaming use cases, as users likely prefer results elsewhere for decision-making purposes. EdgeServe simply routes results to another message topic that could be consumed by any node.

Existing model serving systems use RESTful APIs to handle inputs and outputs: they take each input data item as an HTTP request and issue a prediction as the HTTP response back to the caller. Requests and responses always appear in pairs. This 1:1 relationship makes it impossible to issue one multi-modal prediction based on multiple data inputs. The only workaround would be to join those data manually and send the joined result as the HTTP request. EdgeServe, on the other hand, uses message queues to route data around. Each data modality forms its own message queue and they can be joined in real time as tuples before passing to the model. The prediction result, depending on how many outputs it consists of, is routed to one or more message queues for downstream operators to consume.

3.2 Execution Layer API

Refer to caption
Figure 1: EdgeServe’s execution layer. Data from data streams 1 and 2 (D1, D2) are paired together since both streams are grouped into topic A.

EdgeServe runs as a process on every node in the network. Every node running EdgeServe can potentially create and consume data streams, and run model inference. One of the nodes is designated as the leader node running our message broker backend. EdgeServe extends Apache Pulsar [9] to build a low-latency message broker backend to transfer messages between nodes. This is the node that coordinates message routing and maintains a canonical clock for the network. This leader can be selected through a leader election algorithm (e.g.,  [39]), or can simply be selected by the user. The leader is also responsible for dispatching user-written code to the other nodes on the network. EdgeServe assumes that these nodes are connected via a standard TCP/IP network and every node can directly communicate with the leader. While EdgeServe does not require all-to-all communication, having this capability can be advantageous, especially with large payloads. This is because an optimization technique known as lazy data routing5) utilizes all-to-all communication.

Data Streams API: Any node on the network, including the leader, can register globally-visible data streams to the network. All data in EdgeServe are represented as infinite streams of data. These streams can be of any serializable data type and leverage Python iterator syntax. To invoke EdgeServe, the user simply needs to wrap each data stream as a Python generator and register the stream with the leader. Other nodes on the network can read from this stream of data by accessing an iterator-like interface.

Streams are further grouped into “topics” representing joint predictive tasks, as illustrated in Figure 1. For example, the streams from “packet capture 1” and “packet capture 2” could be combined for a particular model. Grou** streams into topics gives the system information on which streams have to be synchronized and joined together. Each message contains details about its originating data stream and associated topic.

Models API: Over these streams of data, we would like to compute different machine learning inferences. A “model” object encapsulates such computation. A model consumes one or more input data streams from the same topic, and outputs one or more data streams. We take a general view of what a model is: a model is simply a unit of computation that is performed synchronously over a stream of data. In EdgeServe, a “model” is just an operator that produces predictions triggered by the input streams. This stream of predictions can be further combined into topics that other models consume. The same model object can represent a sub-model (e.g., one member of an ensemble), or a featurizer (e.g., a function that computes a set of features).

Models that process these streams take multiple data streams as input, e.g., a multi-modal model. These data streams need to be temporally synchronized before going to the model. This logic is encapsulated in a component called a “joiner”. A joiner fills the gap between data streams and the multimodal model. It consumes data from multiple streams and produces a single iterator interface for models. We discuss more details on the joiner in §4.

Our model API is specifically designed to simplify decentralized deployments where the output of one set of models is consumed by others (e.g., an ensemble). We treat ensembling just like another model, which takes other models’ predictions as inputs, and our system is able to combine them together in a time-synchronized way. Users only need to focus on the actual ensembling algorithm and leave communication and placement details to our system.

4 Highly-Responsive Data Stream Joins

In typical time-series databases, band-joins are used to integrate such series [19, 31], where all items within a certain time-delta are grouped together. True band-joins are challenging in streaming systems where data may arrive out-of-order or in a bursty way leading to potentially unbounded buffering, so existing streaming systems offer an approximation using tumbling time windows [6, 8, 45, 10, 4, 27]. All items that fall within the same tumbling time window are grouped together. We refer to this type of streaming join as a time-triggered join, i.e, the join condition is triggered by a clock tick.

This type of join can cause delays that affect the timeliness of predictions. A multimodal model’s response time to new data is limited by the width of the tumbling window. An alternative to time-triggered joins is to join eagerly as soon as a new item is published to the stream, which we call a data-triggered join. The novelty of this approach is that it is responsive to new data, while having a bounded buffer size. We will discuss both join methods in more detail below.

4.1 Time-triggered Join

As the name suggests, a time-triggered join buffers incoming messages from all streams over a time window, and triggers a join result at the end of the time window. Within each time window, only the latest message from each stream is kept as that stream’s input. The definition of ‘latest’ here can be either event time or processing time. These latest messages are combined as a tuple before sending downstream.

Figure 2: Time-triggered join.
Refer to caption
Refer to caption
Figure 2: Time-triggered join.
Figure 3: Data-triggered join.

Figure 3 is an example of a time-triggered join. In this example, we assume event time and processing time are the same for simplicity. The join results are as follows: join 1 (A1, B1, C1, D1); join 2 (A1, B2, C1, D2); join 3 (A1, B3, C2, D2); join 4 (A2, B3, C2, D2). Intuitively, waiting for a time-triggered join resembles waiting for a bus. Since B2 arrives immediately after the join at t=T𝑡𝑇t=Titalic_t = italic_T was issued, it will have to wait until t=2T𝑡2𝑇t=2Titalic_t = 2 italic_T to get processed, resulting in a longer waiting time. On the other hand, time-triggered joins are beneficial when joins are desired at fixed frequencies, as they smooth out the burstiness of incoming data.

The state management for a time-triggered join is rather straightforward. If the join is based on processing time, only the latest messages from each stream need to be buffered. If the join is based on event time, all messages within a fixed time window need to be additionally buffered, in case they arrive out of order in terms of event time.

4.2 Data-triggered Join

An alternative to a time-triggered join is to perform a join whenever a new piece of data from any stream arrives. Intuitively, the latest known data from all streams are buffered in order to join with the new data item (underlined in the following example). We will show how this works precisely in the two-way case, and it should be clear how to extend this to a multi-way join.

Given two streams StreamA and StreamB, the algorithm tracks the latest known item from each stream and its timestamp. Each time the stream publishes a new data item, it is joined with the latest known item from the other stream. As before, the timestamp can refer to either event time or processing time. However, the order of joined tuples is not guaranteed in terms of event time, since the joining process depends on when the data is actually received by the joiner.

Data-Triggered Join Algorithm Given: StreamA, StreamB Set: (a,at)(,)normal-←𝑎subscript𝑎𝑡(a,a_{t})\leftarrow(\emptyset,-\infty)( italic_a , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← ( ∅ , - ∞ ), (b,bt)(,)normal-←𝑏subscript𝑏𝑡(b,b_{t})\leftarrow(\emptyset,-\infty)( italic_b , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← ( ∅ , - ∞ ) onStreamA(x: data, t: timestamp): 1. If b is not \emptyset, yield (x,b)𝑥𝑏(x,b)( italic_x , italic_b ) 2. If t>at𝑡subscript𝑎𝑡t>a_{t}italic_t > italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, (a,at)(x,t)𝑎subscript𝑎𝑡𝑥𝑡(a,a_{t})\leftarrow(x,t)( italic_a , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← ( italic_x , italic_t ) onStreamB(x: data, t: timestamp): 1. If a is not \emptyset, yield (a,x)𝑎𝑥(a,x)( italic_a , italic_x ) 2. If t>bt𝑡subscript𝑏𝑡t>b_{t}italic_t > italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, (b,bt)(x,t)𝑏subscript𝑏𝑡𝑥𝑡(b,b_{t})\leftarrow(x,t)( italic_b , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← ( italic_x , italic_t )

Figure 3 is an example of a data-triggered join. Again for simplicity, we assume event time and processing time are the same. The join results are as follows: join 1 (A1, B1, C1, D1); join 2 (A1, B2, C1, D1); join 3 (A1, B2, C1, D2); join 4 (A1, B3, C1, D2); join 5 (A1, B3, C2, D2); join 6 (A2, B3, C2, D2). In this way, we ensure that the system reacts immediately to new data at the expense of more frequent joins. Data-triggered join is preferred when at least one data stream is bursty, as it is difficult to set a good time window with bursty data involved.

4.3 When Is Data-triggered Join Better

Data-triggered joins are more suitable for event-based streams whereas time windows are good aggregators for continuous data streams (e.g. sensor data). In certain cases such as activity recognition, there is no activity of interest for most of the time. For example, a Nest Cam only emits data when it finds people, vehicles, or animals in sight. Data-triggered joins can capture these events as soon as they happen. It is possible to combine time-triggered join and data-triggered join to get the best of two worlds. §6.2 describes a hybrid method that primarily operates on a data-triggered basis, while strategically integrating time-triggered elements as a rate limiter. Data-triggered joins further provide a completeness guarantee to the downstream data consumer. Every message is guaranteed to be present in at least one join tuple.

In both time-triggered and data-triggered joins, we see repeated data appearing in multiple join results due to the lower frequency of some streams. Ideally, we want to send at most one copy of the same data over the network to avoid unnecessary bandwidth usage, especially if the data payload is large. Section 5 proposes a novel technique called lazy data routing to address this problem.

5 Lazy Data Routing

The message broker system consists of a leader that orchestrates the entire message flow and multiple producers/consumers as message endpoints. Data streams as producers publish data to the leader, and models as consumers consume data from the leader. With this architecture, the leader can quickly become a point of contention since it has to process all the messages from/to all the different nodes. Furthermore, large message payloads (e.g., images) can lead to a crucial networking bottleneck at the leader, as message broker systems are not designed to handle large messages.

EdgeServe uses a novel messaging protocol to efficiently transfer data between nodes without placing an undue burden on the leader. A message sent to the leader only contains message headers: a timestamp and a global source path. The actual message payloads are not transferred; instead, they are kept and indexed on the node that collected the data. A model subscribes to the topic and reads the headers as they come in. If it wants a particular data payload, it retrieves that data lazily from the source node.

Figure 4 illustrates this protocol. When collecting data, every data item added to a DataStream is annotated with a header (Figure 4-1). We can think of this as a stream of (h,d)𝑑(h,d)( italic_h , italic_d ) tuples (header and data, respectively). After the tuple is created, the node locally writes the data to a time-indexed log (Figure 4-2). After this data is durably written, the header is published to the message broker on the leader (Figure 4-3). Nodes on the network can subscribe to streams of these headers. Model inference requires the data payload, and that can be requested from the headers (Figure 4-4). This data is transferred in a peer-to-peer fashion, and inferences happen over these streams (Figure 4-5).

Refer to caption
Figure 4: A figure illustrating the order of operations in the lazy data routing system used by EdgeServe.

Lazy retrieval has a number of essential benefits for typical model-serving tasks. In general, these benefits are analogous to that of lazy computation. First, many models predict at rates much slower than the rate of data collection. For example, a model that takes 30ms to evaluate can only process one example every 30ms. If the data collection rate is significantly faster than that, the model often has to downsample the input data. Lazy data routing allows us to avoid transferring the data payload to the leader in these cases. Next, this strategy reduces the size of the messages processed by the message broker reducing overheads in checkpointing and serialization/de-serialization. As a result, we also find that it can enable increased parallelism as well. Both of these benefits can be tied back to the traits of the machine learning setting mentioned above. (See §7.5.2 for relevant experiments)

In certain cases, we allow users to force EdgeServe to have eager message passing. Small messages, such as 1D arrays, can be transferred from data source nodes to worker nodes via the leader node. Essentially, this embeds the payload in the message headers. In some networks, peer-to-peer communication is not available or not efficient. We can default to eager message passing when needed to support these cases.

6 Prediction Rate Control

Next, we show how to ensure this execution layer can meet particular model-serving service level objectives (SLOs). We leverage statistical approximations that exploit temporal similarity in typical data streams. Every model in EdgeServe is annotated with three timing parameters: (1) if it consumes multiple streams, a maximum tolerable skew, (2) a target prediction frequency, which is an output rate limiter, and (3) a freshness threshold, designed to discard stale messages originating from an earlier time.

6.1 Message Skew

EdgeServe gives the programmer an illusion of stream alignment, namely, streams associated with the same topic can be thought of as synchronized from the perspective of machine learning modeling. The consuming models receive tuples of headers corresponding to data from each of the sources.

Under the hood, EdgeServe has to buffer streams locally to keep up this illusion. The different data streams will arrive at different rates and have different system delays that cause misalignment. We use a time interval-based interface for specifying alignment criteria. Every topic has a maximum allowed time-skew (§2.1) between headers that can be produced. Locally, the buffer retains a header until it receives matching header messages from other streams or the time-skew expires. Thus, we can enforce a bounded-skew synchronization on the model side. It is up to the user to set a reasonable time-skew limit for her specific task. If the allowed time-skew is overly long, there is a risk of her encountering messages that lack proper synchronization. Conversely, setting the time-skew limit too restrictively may result in the loss of some actually synchronized predictions owing to this stringent threshold.

6.2 Hybrid Time- and Data-triggered Join

The hybrid join in EdgeServe is an innovative approach that combines the principles of both data-triggered and time-triggered joins. This method is designed to efficiently handle the challenges posed by high-velocity data streams and the processing capacity of models.

In essence, the hybrid join operates on two fundamental principles: rapid response to new data (inherited from the data-triggered join) and effective data management to avoid overloading the downstream model (inspired by the time-triggered join and backpressure mechanism [58]). When new data arrives from any stream, the hybrid join promptly triggers a joining operation, similar to a data-triggered join. This ensures that the system remains responsive to incoming data, allowing for timely processing and analysis.

However, to address the issue of data arriving faster than the downstream model can handle, the hybrid join incorporates a critical feature from the time-triggered approach: setting a minimum interval between consecutive processing instances, which we call target prediction frequency. This interval acts as a throttle, ensuring that the downstream model is not overwhelmed by a continuous influx of data. If data arrives more rapidly than this set interval, the hybrid join mechanism will simply drop earlier data in the queue. This decision is based on the understanding that data delayed excessively in the queue may no longer be accurate or relevant for real-time decision-making.

By integrating these two approaches, the hybrid join offers a balanced solution that maximizes responsiveness to new data while maintaining a manageable processing load for the downstream model.

6.3 Freshness Threshold

As a streaming system, EdgeServe prioritizes the timeliness of incoming data. Stale data is essentially inaccurate data for latency-sensitive tasks. Making use of such outdated information in real-time decision-making can lead to disastrous outcomes. Our freshness threshold ensures the recency of data that the model can take. It also acts as a rate limit when data arrives faster than the model inference rate.

7 Experiments

7.1 Experimental Setup

All of our experiments are performed on a private “edge cluster”. Our hardware setup consists of 5 NVIDIA Jetson Nano Developer Kits, 4 Intel Skylake NUC computers, and a desktop PC. Each NUC is equipped with an Intel Core i3-6100U CPU, 16 GB RAM, and M.2 SSD. The desktop PC features an Intel Xeon CPU E5-2603 v4 CPU, NVIDIA Quadro P6000 GPU, 64 GB RAM, and HDD. Direct peer-to-peer connection is available between all nodes via 1Gbps Ethernet. Throughout these experiments, we vary the network topology to test various scenarios. In some experiments, only partial nodes are used. These variations will be explained in respective sections, but one NUC is always used as the leader node.

As a primary baseline, we have configured PyTorch distributed [35] on our edge cluster, with Gloo as the distributed communication backend. We have also implemented an eager data routing architecture similar to ROS [49] within our framework to understand the key design decisions. ROS is widely used in the sensor and robotics communities and provides a centralized message broker service. However, ROS does not support lazy data routing, distributed stream synchronization, and adaptive rate control. Additionally, we implemented a time-triggered join strategy similar to Apache Flink [6]. We also set up a local NTP server to make sure all nodes share a global wall clock time.

We borrow evaluation metrics from the streaming literature and a detailed description of these metrics is in Appendix B.

7.2 Application: Human Activity Recognition

Description. We use the Opportunity dataset for human activity recognition [52, 15] as an example. Data from multiple motion sensors were collected about every 33ms while users were executing typical daily activities. For each subject, there are five activity of daily living (ADL) runs, and each run lasts 15-30 minutes. We take the first subject’s first four ADL runs as the training set and the last ADL run as the test set. When played at 2x speed, the last ADL run takes 8 minutes and 22 seconds. We partition the first 134 columns vertically into four disjoint subsets, each placed on one of four nodes (3 NUCs and 1 Jetson Nano) as data sources. The subsets are distributed as follows: columns 1-37 (accelerometers), 38-76 (IMU back and right arm), 77-102 (IMU left arm), and 103-134 (IMU shoes). We train an aggregated random forest model with scikit-learn [48] for all 134 features as an early fusion baseline, and also four separate random forest models for each subset of features to evaluate an ensemble-based late fusion method. We primarily evaluate EdgeServe with the late fusion deployment: one RF model at each data source node and we ensemble local predictions at another node. This simulates a scenario where there is a small amount of compute on each wearable sensor and that compute is used to reduce the data communicated to make a global prediction across those sensors. Due to space limits, we defer the early vs. late fusion discussion to Appendix C.

In our best-effort PyTorch implementation, we use the gather() API to aggregate data from multiple data source nodes. PyTorch distributed requires all tensors to be the same size to be gathered, so we have to pad each local tensor to the maximum size with zeros. The individual streams are fast enough that misalignments can occur due to queueing delays. However, PyTorch enforces that multiple data sources are always perfectly synchronized, as it does not begin the actual computation until data from all data sources has been gathered, and it only gathers new data after finishing previous predictions. Such strict requirement does not exist in EdgeServe, as we are able to set a reasonable skew (§6.1) in EdgeServe.

Queueing in EdgeServe is better suited for real-time applications.

Figure 5: Measure of backlog in the activity recognition task. More frequent predictions are on the left side.
Refer to caption
Refer to caption
Figure 5: Measure of backlog in the activity recognition task. More frequent predictions are on the left side.
Figure 6: Standard deviation of actual prediction frequency, where EdgeServe maintains a lower variability.

First, we evaluate the ability of the system to even issue real-time predictions by measuring the backlog in the system, or the accumulated queuing time as defined in Appendix B.2. Unlike PyTorch, EdgeServe deployments have a prediction frequency target and can use this target to automatically downsample data to meet real-time requirements. We illustrate the improvements in Figure 6. The x-axis is the target prediction frequency (§6.2) designated by the end-user, where a larger number means a lower frequency; the y-axis is the backlog for each of the serving systems over this dataset. The compute part of the task itself takes about 23ms to complete, and a near-zero number in backlog means the inference is processed in real time. EdgeServe offers a no-backlog queue for a wider range of prediction frequency targets (27absent27\geq 27≥ 27ms/pred). However, a long queue of unprocessed examples is quickly developed without proper rate control (e.g. when target prediction frequency 26absent26\leq 26≤ 26ms per prediction). Since PyTorch lacks a message queue and rate control, it has to process each example individually and trigger joins in a strictly synchronous manner, leading to an unsatisfactory backlog.

Figure 7: Overall real-time accuracy for human activity recognition task measured in F-1 score.
Refer to caption
Refer to caption
Figure 7: Overall real-time accuracy for human activity recognition task measured in F-1 score.
Figure 8: CDF of end-to-end latency for eager data routing vs. lazy data routing.

Even if PyTorch could meet real-time prediction targets, we find that the variability in prediction latencies is quite high. In Figure 6, we see a much higher variability in actual prediction frequencies for PyTorch than EdgeServe across all user-defined rates. This is because PyTorch communicates in a synchronous fashion, and has to account for the variability of all 4 nodes making local predictions with local data streams.

Queueing Delays Reduce “Real-Time” Accuracy. In real-time serving scenarios, the timeliness of predictions becomes a key concern. For latency-sensitive tasks, a delayed prediction equates to an incorrect one. To evaluate the timeliness of predictions, we introduce real-time accuracy as a measure, which evaluates the accuracy of predictions against the most recent label at the time of prediction. For instance, if a prediction is made between two consecutive labels at times t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (t1t2subscript𝑡1subscript𝑡2t_{1}\leq t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), its accuracy is compared with the label at t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Since we assume adjacent examples are likely similar, we expect roughly correct prediction results when the examples arrive slightly late. However, if the examples arrive significantly late, they are likely outdated and yield incorrect predictions.

Figure 8 shows the real-time accuracy of EdgeServe and PyTorch under various target prediction frequencies. PyTorch distributed is not able to issue accurate predictions because data is communicated in a synchronous manner. It is unable to downsample the input stream even if the node is overloaded, making most of its predictions outdated. In contrast, EdgeServe, at a target prediction frequency of 25ms, experiences a greater backlog compared to PyTorch but achieves superior real-time accuracy. This advantage is primarily due to the experiment setup of 3 NUCs and 1 Jetson Nano for local model inference. The NUCs process the CPU model more efficiently than the Jetson Nano, resulting in a significant portion of the backlog being attributed to the Jetson Nano, as it completes local inference later than the NUCs. This situation leads to a notable message skew. To mitigate this, EdgeServe selectively skips data that exceeds the maximum tolerable skew (§6.1). This strategy of skip** mostly inaccurate data significantly boosts EdgeServe’s real-time accuracy. Furthermore, when the target prediction frequency is set above 26ms/pred, EdgeServe sees less backlog and achieves even higher real-time accuracy. This improvement results from EdgeServe’s capability to instantly process fresher data that, while not perfectly synchronized, falls within an acceptable time skew.

7.3 Application: Autonomous Driving

We use a subset of the nuScenes self-driving dataset for autonomous driving [13] consisting of 6 cameras and a lidar sensor. All cameras generate 10 frames per second and the lidar sensor emits at 2 Hz. Each camera is connected to a separate NVIDIA Jetson Nano running pre-trained YOLOv5n model [28] on GPU. The lidar sensor is connected to a NUC node, which preprocesses the data and then transfers it to our desktop PC equipped with NVIDIA Quadro P6000 GPU running pre-trained CenterPoint model [61]. Communication incurs considerable cost here as preprocessed lidar data is very large. All predictions are sent to another NUC node, which triggers the join and yields synchronized predictions.

Queueing time (ms) Time-triggered (Flink-like) Data-triggered (Ours) Speedup
Lidar Med. 463.60 16.56 28.00x
P95 963.20 48.35 19.92x
Cam 1 Med. 51.04 5.64 9.06x
P95 126.06 13.56 9.30x
Cam 2 Med. 64.05 9.59 6.68x
P95 124.08 18.27 6.79x
Cam 3 Med. 56.62 9.76 5.80x
P95 143.45 17.84 8.04x
Cam 4 Med. 74.30 5.23 14.19x
P95 117.08 13.08 8.95x
Cam 5 Med. 45.32 7.63 5.94x
P95 88.87 15.90 5.59x
Cam 6 Med. 57.33 5.38 10.66x
P95 109.61 13.20 8.31x
Table 3: Queueing time for time- vs. data-triggered joins.

First, we compare the end-to-end latency between eager and lazy data routing, applying data-triggered join in both scenarios. In the lazy data routing approach, we implemented a freshness threshold SLO (§6.3) that discards data older than 500ms. As depicted in Figure 8, the CDF of end-to-end latency demonstrates that lazy data routing significantly reduces latency by only pulling data with recent timestamps. This reduction is particularly notable since communication is the primary bottleneck in this task. Notably, lazy data routing led to the skip** of 72.5% of predictions that failed to meet our freshness threshold SLO compared to eager data routing.

Second, we compare the queueing time (as defined in B) between time-triggered and data-triggered joins, applying lazy data routing in both scenarios. For time-triggered join, we set the time interval of joins to be every 1 second. For data-triggered join, we issue a join as soon as a new example that meets our freshness threshold SLO comes in. Table 3 shows the median and 95th percentile of queueing time for each data source. Data-triggered join reduces the queueing time by up to 28x as it does not have to wait for fixed intervals.

7.4 Application: Network Intrusion Detection

EdgeServe natively allows multiple producers and multiple consumers to operate on the shared message queue at the same time, which is an essential communication paradigm in decentralized prediction but not currently supported by PyTorch or TensorFlow. We use a public Network Intrusion Detection dataset from Canadian Institute for Cybersecurity (CIC-IDS2017) [57] and an existing model [33] to differentiate malicious traffic from benign network traffic. Specifically, we partition the data horizontally into four disjoint subsets by “Source IP” for our four data source nodes. The underlying assumption is that network traffic from different source IP addresses may be collected separately.

If a web attack is detected, the related network packet needs to be sent to a specific destination node, but the actual computation can be done anywhere. We show that EdgeServe can support three deployment strategies: (Early fusion, topology 1) transfer all data to the prediction node that does all computations in a centralized way; (Early fusion with parallelism, topology 2) transfer all data from data source nodes to an intermediate shared queue, where four prediction nodes can pull data from when they become available, and they need to inform the destination node if an attack is detected; or (Late fusion, topology 3) data source nodes do computations locally and only transfer data to the destination node if an attack is detected.

In an early fusion setting, PyTorch distributed is able to process 41.94 examples per second, while EdgeServe can process 47.58 examples per second. This is the baseline setting of both systems, and the performances of both systems are comparable. In an early fusion with parallelism setting that is only supported by EdgeServe, thanks to its queuing design, 182.57 examples are processed per second, which is almost a linear (3.84x) speedup compared to a centralized setting given that we now have 4 prediction nodes. In a late fusion setting, we make all 4 data source nodes also local prediction nodes, and PyTorch achieves 181.33 examples per second while EdgeServe takes 197.30 examples per second. For both systems, superlinear speedup (4.32x and 4.15x compared to centralized, respectively) is achieved by making the most of local computational resources and communicating only local prediction results instead of the entire dataset. Since we use the same model/sub-models for both EdgeServe and PyTorch without synchronization issues, the accuracies of predictions between both systems are the same.

7.5 Micro-benchmarks

7.5.1 Data-triggered Joins Are More Responsive

Join strategy Reaction time
Median P95
Data-triggered 9.02ms 10.28ms
Time-triggered (time window: 1s) 0.5s 0.9s
Time-triggered (time window: 5s) 2.5s 4.7s
Table 4: Reaction time for data- and time-triggered joins.

This micro-benchmark evaluates the responsiveness between time-triggered join and data-triggered join. We use two NUC nodes as data sources, one NUC node as the message broker, and one more NUC node performing the join. We show how data-triggered joins significantly reduce reaction time for latency-sensitive applications. We have a steady stream that arrives every 5 seconds (5 MB each) and a bursty stream that arrives at 10 Hz (1 Byte each) for a minute. Real-time decision-making requires the latest information from both streams. The reaction time (as defined in Appendix B.1) for both join strategies during that one minute is shown in Table 4. Data-triggered join achieves a much better reaction time without the difficulty of setting a reasonable time window.

7.5.2 Benefits of Lazy Data Routing

In §5, we described our lazy data routing model as an alternative to a ROS-like system that eagerly transfers raw data through a centralized broker. Now, we evaluate the pros and cons of the lazy data routing model. We employ one NUC node as the data source and one NUC node as the receiver in this subsection, except for the parallelism experiment where the number of receiver nodes is varied.

Refer to caption
Figure 9: Lazy data routing reduces latency on producer side but has fixed overhead on consumer side. Both axes are log-scaled.
Refer to caption
Figure 10: Lazy data routing reduces latency on producer side but has fixed overhead on consumer side. Both axes are log-scaled.

First, we send a series of messages of different sizes from a data source node to a receiver node, through the leader node. No actual computation is performed. We compare the communication latency between eager and lazy data routing.

Lazy Data Routing Reduces Latency on Producer but Has Fixed Overhead on Consumer. Since we only need to transfer the headers instead of raw data in our lazy data routing model, the latency on producer side remains a negligible number even if the message is huge. As shown in Figure 9a and 10a, a ROS-like eager data routing model could result in very high latency when sending large messages, which itself could force subsequent messages to queue up and become outdated when they arrive. In contrast, our lazy data routing model makes sure that message headers are sent in milliseconds, which never blocks the rest of messages. The consumers may choose to downsample some data and only fetch necessary data.

Whenever the consumer needs to fetch raw data, there is a fixed overhead to establish P2P connections even if the actual data is just a few bytes. This fixed overhead can be amortized when the actual data is larger, as depicted in Figure 9b and 10b.

In summary, lazy data routing is more performant when the messages transferred are larger in size. As shown in Figure 10c, eager data routing actually has a lower total communication latency when the messages are smaller than 512KB in size, and lazy data routing performs better when the messages are larger than 512KB in size.

Figure 11: Lazy data routing scales out well while eager data routing does not.
Refer to caption
Refer to caption
Figure 11: Lazy data routing scales out well while eager data routing does not.
Figure 12: Lazy data routing saves communication when some data is skipped.

Lazy Data Routing Naturally Supports Parallelism. It is very common for multiple consumers to fetch data from one or more producers at the same time. In our lazy data routing model, since messages are transferred in a peer-to-peer fashion, the leader node only has a very light workload to process tiny headers simultaneously, saving precious bandwidth at the leader node. However, in the eager data routing model, the leader node can be blocked when a piece of large message is going through the leader node from a producer to a consumer. As a result, other producers and/or consumers running in parallel cannot send or receive messages at the same time.

To compare the scalability of communication between eager and lazy data routing models, we have one producer continuously sending the same 512KB message for a total of 100 times to a shared queue. We gradually increase the number of consumer nodes from 1 to 4 and see how it scales out. While no actual computation is done, we measure the total working duration and use the single-node setup for both the eager and lazy data routing as the 1.0x baseline. Figure 12 shows how the eager data routing model fails to scale out with more consumer nodes, while our lazy data routing model achieves reasonable speedup. The line shadows represent the lower and upper bounds of repeated experiments.

Lazy Data Routing Performs Better with Network Contention. Our lazy data routing model is especially beneficial when the leader node is busy with network requests. We specifically construct a task where the message payload is large: real-time inference over video streams. In this experiment, two webcams capture the same moving QR code from different positions. Both videos are 150 frames long at 1920x1080 resolution. Each camera is connected to a unique data source node on the network. For multi-camera tracking, the QR code has to be detected in both streams and corresponded in time-aligned frames from both cameras. So these two data streams need to be joined at the prediction node. We simulate a congestion scenario where the network bandwidth at the leader node is limited. Note that the rest of the network retains its full speed; the only congestion is at the leader node.

Table 5 shows the results. With no congestion, the system can process roughly 0.8 frames per second in both lazy and eager data routing. With congestion, the story is very different. Our lazy data routing is tolerant, while transferring raw frames in a ROS-style eager communication pattern can be extremely slow when the network is congested. The total working duration increases by a factor of 7 simply due to congestion. Without care, distributed, multi-sensor deployments can easily lose real-time processing capabilities if the broker becomes a point of contention. These experiments illustrate the value of EdgeServe in a controlled scenario, where we can isolate performance differences.

Data routing strategy Rate limit (up/down) Time
Lazy (ours) No limit 3m 10s
Lazy (ours) 1 Mbps / 1 Mbps 3m 12s
Eager (similar to ROS) No limit 3m 16s
Eager (similar to ROS) 20 Mbps / 20 Mbps 21m 32s
Table 5: Total working duration with network bandwidth limits.

Lazy Data Routing Performs Better with Data Skip**. Apart from network congestion, lazy data routing is also valuable when data skip** is employed to ensure the timeliness of prediction results. We take one of the two 150-frame videos mentioned earlier and transfer these frames from one node to another, via the leader node. Each frame is about 6 MB in its uncompressed form. No actual computation is performed as we are only interested in the communication cost. In Figure 12, we illustrate how much communication cost can be saved by lazy data routing. On the x-axis, we have a variable percentage of frames skipped due to adaptive rate control described in §6; on the y-axis, we measure the total working duration defined in §B.1. Even when no frames are skipped, our lazy data routing model performs better than the eager data routing model due to the eliminated overhead of transferring a large amount of data through the leader node. When more frames are skipped, our lazy data routing saves communication time almost linearly to the number of frames skipped by the downstream node. On the other hand, the ROS-style eager routing pattern spends roughly the same time on communication even if most of the frames are skipped by the downstream model, because it would transfer the entire data payload upfront anyway.

7.6 Comparison with Ray Serve

We conduct an object detection task with another serving system, Ray Serve [40], with a sample of nuScenes [13] camera data and pre-trained YOLOv5n model [28] on a single NVIDIA Jetson Nano.

Single-node performance between Ray Serve, EdgeServe, and an ideal case where the job runs locally without any communication is presented in Table 6. First, we enforce the freshness threshold SLO (§6.3) of 1 second and see how many examples must be skipped in order to hit the SLO. Since the data comes faster than the model’s inference speed, 19.0% of incoming data has to be skipped even in an ideal case. EdgeServe skips a bit more examples than ideal, but the overhead is reasonably small. Ray Serve, however, skips 89.4% of incoming data, which means the system consumes more computational resources than the task itself. Second, we drop the SLO requirement and see how long it takes for each system to complete the task without downsampling. In an ideal scenario, the model runs for 25 seconds to finish the dataset. EdgeServe spends 26 seconds, which presents negligible system overhead. Ray Serve, on the other hand, spends 2m40s finishing the task, which is 6.4x slower compared to ideal due to its complex design. In addition, we compare key design decisions between Ray Serve and EdgeServe in Table 7.

Ray Serve EdgeServe Ideal
% examples skipped (w/ SLO enforced) 89.4% 22.2% 19.0%
Total working duration (w/o SLO enforced) 2m40s 26s 25s
Table 6: Single-node performance of Ray Serve, EdgeServe, and an ideal case.

8 Related Work

Model serving. Current machine learning model serving systems, including Clipper [18], TensorFlow Serving [46], and InferLine [17], all assume that the user has manually programmed all necessary data movement. Recent systems have begun to realize the underappreciated problem of data movement and communication-intensive aspects of modern AI applications. For example, Hoplite [64] generates data transfer schedules specifically for asynchronous collective communication operations (e.g., broadcast, reduce) in a task-based framework, such as Ray [40] and Dask [51]. However, they have yet to address the trade-offs in time-synchronization between different data sources when they do not arrive at the same time.

Ray Serve EdgeServe (ours)
Communication HTTP POST Message queue
Data routing Eager Eager / Lazy
Downsampling Not supported Supported
Result delivery Back to caller Can route to any topic
In-out relationship 1:1 request-response 1:n/n:1 with joins
Table 7: Design decisions: Ray Serve vs. EdgeServe (ours)

Edge computing. On the other hand, there has been a steady trend towards moving model serving to resources closer to the point of data collection, or the “edge”. The primary focus of model serving on the edge has been to design reduced-size models that can efficiently be deployed on lower-powered devices [24, 38, 26, 62]. Simply reducing the computational footprint of each prediction served is only part of the problem, and these tools do not support data routing when the relevant features might be generated on different edge nodes.

Distributed communication. The closest existing tools are those designed for distributed training of ML models. TensorFlow Distributed [1], for example, allows both all-reduce (synchronous) and parameter server (asynchronous) strategies to train a model with multiple compute nodes. Another popular framework, PyTorch distributed [35], supports additional collective communication operations such as gather and all-gather with Gloo, MPI, and NCCL backends. One might ask, can we perform distributed inference using these existing distributed training frameworks? Technically it is possible, but as we have shown in §7.2, the performance is unsatisfactory because such frameworks are optimized for maximum throughput but not end-to-end latency.

Message queues. An alternative to distributed communication is message queues. Existing message queues typically expose a publish-subscribe model that moves messages between services. Compared with transient queues like RabbitMQ [50] and direct TCP connections as used in Naiad [41], log-based queues, such as Kafka [7] and Pulsar [9], ensure the order of messages and are persistent in nature. Users can always replay the logs for debugging purposes. Modern message services offered by cloud providers, such as Google PubSub and Amazon SQS, generally have higher latencies in the hundreds of milliseconds due to synchronous data replication across multiple zones.

Dataflow systems. Existing batch processing and stream processing systems support dataflow computations over a dataflow graph [2, 5, 10, 41, 45, 8, 34, 1, 47, 6]. These systems are optimized for windowed operations over unbounded data, because a fixed frequency is preferred in typical data analysis workloads. In contrast, model-serving applications have a much finer granularity of data input and more sensitivity in decision making. Fixed time-windowing is usually not suitable for bursty data. Decision making is based on up-to-date predictions as every single new event arrives.

Temporal synchronization. Similarly, this problem is more than just a stream processing problem. Traditional relational stream processing systems, e.g., [14], have stringent requirements for temporal synchronization where they model such an operation as a temporal join. These systems will buffer data, indefinitely if needed, to ensure that corresponding observations are properly linked. While desirable for relation query processing, this approach is excessive in machine learning applications which have to tolerate some level of inaccuracy anyway. In addition, existing streaming join algorithms look for values from different streams with the same key [22, 42, 60, 53]. Multi-modal machine learning inference, however, pays more attention to the time skew between streams than the join predicate, as data sources might come at different rates. In this setting, a looser level of synchronization would benefit the system and improve performance.

In the context of sensing, ROS (Robot Operating System) [49] is an open-source framework designed for robotics research. It incorporates an algorithm called ApproximateTime that tries to join messages coming on different topics at different timestamps. This algorithm can drop messages on certain topics if they arrive too frequently, but does not use any message more than once. In other words, if one sensor sends data very infrequently, the algorithm will have to wait and drop messages from all other sensors until it sees a new message from the low-frequency sensor to issue a join. The frequency of combined prediction is thus upper-bounded by the most infrequent sensor. Such a wait can harm both end-to-end latency and accuracy due to the loss of high-frequency information.

References

  • [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.
  • [2] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow., 8(12):1792–1803, August 2015.
  • [3] Alphabet Inc. Tensorflow case studies. https://www.tensorflow.org/about/case-studies (visited on October 5, 2022), 2022.
  • [4] Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. Photon: fault-tolerant and scalable joining of continuous data streams. In Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 577–588. ACM, 2013.
  • [5] Apache. Apache spark. https://spark.apache.org/ (visited on July 27, 2021), 2016.
  • [6] Apache. Apache flink. stateful computations over data streams. https://flink.apache.org/ (visited on July 27, 2021), 2021.
  • [7] Apache. Apache kafka. an open-source distributed event streaming platform. https://kafka.apache.org/ (visited on July 27, 2021), 2021.
  • [8] Apache. Apache storm. an open source distributed realtime computation system. https://storm.apache.org/ (visited on July 27, 2021), 2021.
  • [9] Apache. Apache pulsar. https://pulsar.apache.org/ (visited on March 27, 2022), 2022.
  • [10] Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. Structured streaming: A declarative API for real-time applications in apache spark. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 601–613. ACM, 2018.
  • [11] Joy Arulraj. Accelerating video analytics. ACM SIGMOD Record, 50(4):39–40, 2022.
  • [12] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V. Smith, and M. Zaharia, editors, Proceedings of Machine Learning and Systems, volume 1, pages 374–388, 2019.
  • [13] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  • [14] Sirish Chandrasekaran and Michael J Franklin. Psoup: a system for streaming queries over streaming data. The VLDB Journal, 12(2):140–156, 2003.
  • [15] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R. Millán, and Daniel Roggen. The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013. Smart Approaches for Human Action Recognition.
  • [16] Intel Corporation. Audi’s automated factory moves closer to industry 4.0. https://www.intel.com/content/dam/www/public/us/en/documents/case-studies/audis-automated-factory-closer-to-industry-case-study.pdf (visited on July 27, 2021), 2020.
  • [17] Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: latency-aware provisioning and scaling for prediction serving pipelines. In Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi, editors, SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, pages 477–491. ACM, 2020.
  • [18] Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In Aditya Akella and Jon Howell, editors, 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, pages 613–627. USENIX Association, 2017.
  • [19] David J DeWitt, Jeffrey F Naughton, and Donovan A Schneider. An evaluation of non-equijoin algorithms. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1991.
  • [20] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  • [21] Apache Flink. Flinkml documentation. https://nightlies.apache.org/flink/flink-ml-docs-master/ (visited on October 5, 2022), 2022.
  • [22] Bugra Gedik, Rajesh Bordawekar, and Philip S. Yu. Celljoin: a parallel stream join operator for the cell processor. VLDB J., 18(2):501–519, 2009.
  • [23] Googl. Dataflow and vertex ai: Scalable and efficient model serving. https://cloud.google.com/blog/products/ai-machine-learning/streaming-prediction-with-dataflow-and-vertex (visited on Jan, 14, 2024), 2024.
  • [24] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • [25] Sonia Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, and Paris Carbone. Evaluating model serving strategies over streaming data. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning, pages 1–5, 2022.
  • [26] Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodik, Leana Golubchik, Minlan Yu, Paramvir Bahl, and Matthai Philipose. Videoedge: Processing camera streams using hierarchical clusters. In 2018 IEEE/ACM Symposium on Edge Computing (SEC), pages 115–131. IEEE, 2018.
  • [27] Gabriela Jacques-Silva, Ran Lei, Luwei Cheng, Guoqiang Jerry Chen, Kuen Ching, Tanji Hu, Yuan Mei, Kevin Wilfong, Rithin Shetty, Serhat Yilmaz, Anirban Banerjee, Benjamin Heintz, Shridar Iyer, and Anshul Jaiswal. Providing streaming joins as a service at facebook. Proc. VLDB Endow., 11(12):1809–1821, 2018.
  • [28] Glenn Jocher. YOLOv5 by Ultralytics, May 2020.
  • [29] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11):1586–1597, 2017.
  • [30] Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. Benchmarking distributed stream data processing systems. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pages 1507–1518. IEEE Computer Society, 2018.
  • [31] Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Panos Kalnis. Lightning fast and space efficient inequality joins. 2015.
  • [32] Jakub Konečný, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter. CoRR, abs/1511.03575, 2015.
  • [33] Kahraman Kostas. Anomaly Detection in Networks Using Machine Learning. Master’s thesis, University of Essex, Colchester, UK, 2018.
  • [34] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. Twitter heron: Stream processing at scale. In Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives, editors, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 239–250. ACM, 2015.
  • [35] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, 2020.
  • [36] Shinan Liu, Tarun Mangla, Ted Shaowang, **** Zhao, John Paparrizos, Sanjay Krishnan, and Nick Feamster. Amir: Active multimodal interaction recognition from video and network traffic in connected environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 7(1), mar 2023.
  • [37] Lyft. Powering millions of real-time decisions with lyftlearn serving. https://lft.to/3O3EJVc (visited on Jan, 7, 2024), 2024.
  • [38] Sumit Maheshwari, Wuyang Zhang, Ivan Seskar, Yanyong Zhang, and Dipankar Raychaudhuri. Edgedrive: Supporting advanced driver assistance systems using mobile edge clouds networks. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 1–6. IEEE, 2019.
  • [39] Navneet Malpani, Jennifer L Welch, and Nitin Vaidya. Leader election algorithms for mobile ad hoc networks. In Proceedings of the 4th international workshop on Discrete algorithms and methods for mobile computing and communications, pages 96–103, 2000.
  • [40] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging ai applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, 2018.
  • [41] Derek Gordon Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. Naiad: a timely dataflow system. In Michael Kaminsky and Mike Dahlin, editors, ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013, pages 439–455. ACM, 2013.
  • [42] Mohammadreza Najafi, Mohammad Sadoghi, and Hans-Arno Jacobsen. Splitjoin: A scalable, low-latency stream join architecture with adjustable ordering precision. In Ajay Gulati and Hakim Weatherspoon, editors, 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016, pages 493–505. USENIX Association, 2016.
  • [43] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  • [44] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023, pages 2980–2987. IEEE, 2023.
  • [45] Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. Stateful scalable stream processing at linkedin. Proc. VLDB Endow., 10(12):1634–1645, 2017.
  • [46] Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. Tensorflow-serving: Flexible, high-performance ml serving. In Workshop on ML Systems at NIPS 2017, 2017.
  • [47] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [49] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.
  • [50] RabbitMQ. An open-source message-broker software. https://www.rabbitmq.com/ (visited on July 27, 2021), 2021.
  • [51] Matthew Rocklin. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th python in science conference, volume 130, page 136. Citeseer, 2015.
  • [52] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz, David Bannach, Gerald Pirkl, Alois Ferscha, Jakob Doppler, Clemens Holzmann, Marc Kurz, Gerald Holl, Ricardo Chavarriaga, Hesam Sagha, Hamidreza Bayati, Marco Creatura, and José del R. Millàn. Collecting complex activity datasets in highly rich networked sensor environments. In 2010 Seventh International Conference on Networked Sensing Systems (INSS), pages 233–240, 2010.
  • [53] Pratanu Roy, Jens Teubner, and Rainer Gemulla. Low-latency handshake join. Proc. VLDB Endow., 7(9):709–720, 2014.
  • [54] Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
  • [55] Ted Shaowang, Nilesh Jain, Dennis D Matthews, and Sanjay Krishnan. Declarative data serving: the future of machine learning inference on the edge. Proceedings of the VLDB Endowment, 14(11):2555–2562, 2021.
  • [56] Ted Shaowang, Xi Liang, and Sanjay Krishnan. Sensor fusion on the edge: Initial experiments in the edgeserve system. In Proceedings of The International Workshop on Big Data in Emergent Distributed Environments, BiDEDE ’22, New York, NY, USA, 2022. Association for Computing Machinery.
  • [57] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Paolo Mori, Steven Furnell, and Olivier Camp, editors, Proceedings of the 4th International Conference on Information Systems Security and Privacy, ICISSP 2018, Funchal, Madeira - Portugal, January 22-24, 2018, pages 108–116. SciTePress, 2018.
  • [58] Leandros Tassiulas and Anthony Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. In 29th IEEE Conference on Decision and Control, pages 2130–2132. IEEE, 1990.
  • [59] Tensorflow. Robust machine learning on streaming data using kafka and tensorflow-io. https://www.tensorflow.org/io/tutorials/kafka (visited on October 5, 2022), 2022.
  • [60] Jens Teubner and René Müller. How soccer players would do stream joins. In Timos K. Sellis, Renée J. Miller, Anastasios Kementsietsidis, and Yannis Velegrakis, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 625–636. ACM, 2011.
  • [61] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-based 3d object detection and tracking. CVPR, 2021.
  • [62] Xiao Zeng, Biyi Fang, Haichen Shen, and Mi Zhang. Distream: scaling live video analytics with workload-adaptive distributed edge intelligence. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems, pages 409–421, 2020.
  • [63] Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, and Zheng-Jun Zha. Streaming video model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14602–14612, 2023.
  • [64] Siyuan Zhuang, Zhuohan Li, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, and Ion Stoica. Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 641–656, 2021.

Appendix

Refer to caption
(a) Topology 1
(Early fusion)
Refer to caption
(b) Topology 2
(Early fusion with parallelism)
Refer to caption
(c) Topology 3
(Late fusion)
Figure 13: Network topologies used in our experimental edge cluster. Topology 2 takes advantage of multiple prediction nodes consuming a shared message queue at the same time, and is only used in “Parallel” experiments.

A Model Decomposition

Since models are the unit of placement and computation in EdgeServe, the goal of model decomposition is to increase opportunities for optimizing placement. The idea is to approximate a single model with an ensemble or mixture of smaller local models. Obviously, not all models can be decomposed into smaller parts. However, many real-world models can be partitioned.

Strategy 1. Ensemble Models Ensemble machine learning models are techniques that combine multiple models to improve the accuracy and robustness of predictions. Let’s imagine that we have p𝑝pitalic_p features and n𝑛nitalic_n examples with an example matrix X𝑋Xitalic_X and a label vector Y𝑌Yitalic_Y. Different subsets of these features are constructed on m𝑚mitalic_m data sources on the network. Each source generates a partition of features fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., X[:,fi]𝑋:subscript𝑓𝑖X[:,f_{i}]italic_X [ : , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] is the source-specific projection of training data. Stacking is an ensembling technique where multiple models are trained, and their predictions are used as inputs to a final model. The final model learns to weigh the predictions of each model and make a final prediction based on the weighted inputs. This helps capture the strengths of each individual model and produce a more accurate prediction.

We can train the following models. For each feature subset fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we train a model (from any model family) that uses only the subset of features to predict the label.

gi𝗍𝗋𝖺𝗂𝗇(X[:,fi],Y)subscript𝑔𝑖𝗍𝗋𝖺𝗂𝗇𝑋:subscript𝑓𝑖𝑌g_{i}\leftarrow\textsf{train}(X[:,f_{i}],Y)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← train ( italic_X [ : , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , italic_Y )

After training each of these models over the m𝑚mitalic_m subset, we train a stacking model that combines the prediction. This is a learned function of the outputs of each gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that predicts a single final label:

h𝗍𝗋𝖺𝗂𝗇([g1,,gm],Y)𝗍𝗋𝖺𝗂𝗇subscript𝑔1subscript𝑔𝑚𝑌h\leftarrow\textsf{train}([g_{1},...,g_{m}],Y)italic_h ← train ( [ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] , italic_Y )

Stacking models are well-studied in literature and are not new [54]. For multi-modal prediction tasks, prior work has found that such models do not sacrifice accuracy and sometimes actually improve accuracy [36].

Strategy 2. Mixture of Experts Models Similarly, there are neural network architectures that can be trained end-to-end to take advantage of EdgeServe. Mixture of Experts (MoE) is a deep learning architecture that combines multiple models or “experts’ to make predictions on a given task. The basic idea of the MoE architecture is to divide the input space into regions and assign an expert to each region. The gating network takes the input, decides which region it belongs to, and then selects the corresponding expert to make the prediction. The gating network then weights the output of each expert, and the final prediction is the weighted sum of the expert predictions. MoE architectures have been applied to a wide range of tasks, including language modeling, image classification, and speech recognition [20]. After training, each expert can be placed independently once trained.

B Evaluation Metrics

We borrow the following common metrics used in streaming systems [30] to measure the system performance of EdgeServe.

B.1 Types of Latency

Producer Sending Latency. We define producer sending latency to be the interval between the time the producer begins sending an example to the leader node and the time the producer finishes sending the same example to the leader node.

Consumer Receiving Latency. We define consumer receiving latency to be the interval between the time the producer finishes sending an example to the leader node and the time the consumer finishes receiving the same example from the leader node. That means, if there is any queuing backlog at the leader node, it is counted as part of consumer receiving latency.

Total Communication Latency. We define total communication latency to be the sum of producer sending latency and consumer receiving latency. It means the interval between the time the producer begins sending an example to the leader node and the time the consumer finishes receiving the same example from the leader node.

Reaction Time. We define reaction time to be the interval between the time the producer begins sending the latest example involved in a join and the time at which the joined data tuple arrives at the consumer node. It measures how timely the system reacts to the newest information available.

Processing Latency. We define processing latency to be the interval between the time a prediction node starts processing an example (or a joined set of examples) and the time it finishes processing the same example. The processing latency is used to measure the actual computation time of an example (or a joined set of examples).

End-to-end Latency. We measure the interval between the time an example is collected by EdgeServe and the time the last prediction node finishes processing the same example as end-to-end latency. The end-to-end latency includes but not limited to total communication latency and processing latency.

Queueing Time. We define queueing time to be the interval between the time a node finishes its local inference of an example and the time such local prediction is joined with other local outputs. It measures how long it has to wait for other local predictions to be joined together.

Total Working Duration. We define total working duration to be the interval between the time the producer begins sending the first example of a task to the leader node and the time the last prediction node finishes processing the last example of the task. This is a task-level measurement rather than a per-example measurement, and it includes both communication time and computation time (if any).

B.2 Backlog

We define the end-to-end latency of the last example of a task as backlog. Backlog is an important metric because all kinds of delays can easily accumulate, which causes outdated predictions for later examples in a real-time inference scenario. The lower bound of the backlog is near-zero, when there is no delay along the path from the data source to prediction nodes. Ideally, such lower bound is achievable if data arrive slower than the rate our computational power can serve, or we might have to skip some data points to keep the predictions in time.

C Benefits of Late Fusion Models

There has been recent work discussing the latency vs. accuracy tradeoff between early fusion and late fusion models [44]. Early fusion models, while potentially capturing more cross-modal correlations, tend to require more communication as they combine raw data from various sources. In contrast, late fusion models reduce communication by combining locally inferred results rather than raw data. Late fusion models also naturally support parallelism since most inference is performed locally. For latency-sensitive tasks, the communication efficiencies offered by late fusion models are extremely valuable. We conduct experiments to demonstrate the benefits of late fusion models in EdgeServe.

C.1 Network Topology Setup

Figure 13 shows the network topologies of our experiment setup. Network topology 1 in Figure 12(a) uses only one prediction node, which is supported by both EdgeServe and PyTorch. Network topology 2 in Figure 12(b) has 3 additional prediction nodes, and all these 4 prediction nodes consume a shared queue at the same time in parallel experiments thanks to EdgeServe. This cannot be done in PyTorch due to the lack of a shared queue. Network topology 3 in Figure 12(c) takes advantage of model decomposition described in §A and uses local data source nodes as local prediction nodes, too. The node that was making prediction in topology 1 and 2 now only has to gather local predictions and take a majority vote. All topologies have 4 data source nodes, from which we use 3 NUCs (in yellow) and 1 Jetson Nano (in black) to reflect the heterogeneity of real-world edge devices. The other NUC is set up as the leader node, or the “master” node in terms of PyTorch distributed. The rest of Jetson Nanos are prediction nodes where the actual computation is done, and one of them is designated as the destination node where final results should be sent.

C.2 Late Fusion Models Reduce Backlog

Figure 14 is an extension to Figure 6, showing the backlog for all network topologies. For both EdgeServe and PyTorch, we see a lower backlog for late fusion models (Topology 3) because we are able to make the most of local data source nodes and save communication costs. EdgeServe early fusion with parallelism (Topology 2) also helps reduce the backlog over early fusion (Topology 1), when the model is not able to catch up with incoming data rate.

Refer to caption
Figure 14: Measure of backlog in the activity recognition task for all network topologies.

C.3 Late Fusion Models Are More Tolerant To Delays

Real-time accuracy (F-1 score) No delay 25ms delay
EdgeServe Early Fusion 0.90 0.55
EdgeServe Early Fusion w/ Parallelism 0.90 0.55
EdgeServe Late Fusion 0.91 0.85
Table 8: Real-time accuracy measured in F-1 score when one of the four data streams has a constant delay. Late fusion models are more accurate even if there is a delay from one data source.

We next evaluate the fail-soft benefits of EdgeServe, by measuring real-time accuracy of predictions when one of the four data streams have a constant 25ms delay. Table 8 shows that, while early fusion models (with or without parallelism) suffer from considerable degradation in accuracy due to the delay, late fusion models achieves a much higher accuracy even if there is a constant delay. This is because local predictions from other data streams are unaffected and the “unpopular vote” – local prediction of the delayed data stream is likely dropped by the ensemble method.

Refer to caption
Figure 15: Number of excess examples processed for different strategies and target prediction frequencies.

C.4 Late Fusion Models Reduce Excess Work.

Now, we look at the number of excess examples that are processed to better understand how EdgeServe automatically downsamples the incoming data stream in response to the target prediction frequency. As can be seen from Figure 15, PyTorch (either early or late fusion), as a baseline, is marked as zero on the y-axis because it always processes a fixed number of examples equal to the input size. In early fusion (with or without parallelism) settings, EdgeServe is very sensitive to target prediction frequency because it can downsample incoming data stream when such target is relaxed. Therefore, we see a rapidly decreasing excess work from left to right as the target prediction frequency becomes less frequent. On the other hand, in a late fusion setting, EdgeServe is not as sensitive to such change in target prediction frequency for the same reason why EdgeServe late fusion maintains a high real-time accuracy discussed in §7.2. After faster NUCs finish local predictions, the ensemble model simply skips further local predictions made by the Jetson Nano as they fall outside the acceptable time skew range. As a result, late fusion models only process a small number of examples, even when the target prediction frequency is high enough.

D Comparison with Federated learning.

Our work on decentralized prediction might seem similar to federated learning [32, 12], but there are several key differences. First, our goal is not to collaboratively train a shared model, but to make combined predictions based on multiple streams of data. Second, we optimize for millisecond-level end-to-end timeliness from the point of data collection to the point where prediction is delivered. Federated learning tasks usually assume a much longer end-to-end latency, and they have other optimization goals, such as communication cost. Third, we have to take care of time-synchronization between data streams, while federated learning systems usually treat those data as the same batch.