-
Evolving the Digital Industrial Infrastructure for Production: Steps Taken and the Road Ahead
Authors:
Jan Pennekamp,
Anastasiia Belova,
Thomas Bergs,
Matthias Bodenbenner,
Andreas Bührig-Polaczek,
Markus Dahlmanns,
Ike Kunze,
Moritz Kröger,
Sandra Geisler,
Martin Henze,
Daniel Lütticke,
Benjamin Montavon,
Philipp Niemietz,
Lucia Ortjohann,
Maximilian Rudack,
Robert H. Schmitt,
Uwe Vroomen,
Klaus Wehrle,
Michael Zeng
Abstract:
The Internet of Production (IoP) leverages concepts such as digital shadows, data lakes, and a World Wide Lab (WWL) to advance today's production. Consequently, it requires a technical infrastructure that can support the agile deployment of these concepts and corresponding high-level applications, which, e.g., demand the processing of massive data in motion and at rest. As such, key research aspec…
▽ More
The Internet of Production (IoP) leverages concepts such as digital shadows, data lakes, and a World Wide Lab (WWL) to advance today's production. Consequently, it requires a technical infrastructure that can support the agile deployment of these concepts and corresponding high-level applications, which, e.g., demand the processing of massive data in motion and at rest. As such, key research aspects are the support for low-latency control loops, concepts on scalable data stream processing, deployable information security, and semantically rich and efficient long-term storage. In particular, such an infrastructure cannot continue to be limited to machines and sensors, but additionally needs to encompass networked environments: production cells, edge computing, and location-independent cloud infrastructures. Finally, in light of the envisioned WWL, i.e., the interconnection of production sites, the technical infrastructure must be advanced to support secure and privacy-preserving industrial collaboration. To evolve today's production sites and lay the infrastructural foundation for the IoP, we identify five broad streams of research: (1) adapting data and stream processing to heterogeneous data from distributed sources, (2) ensuring data interoperability between systems and production sites, (3) exchanging and sharing data with different stakeholders, (4) network security approaches addressing the risks of increasing interconnectivity, and (5) security architectures to enable secure and privacy-preserving industrial collaboration. With our research, we evolve the underlying infrastructure from isolated, sparsely networked production sites toward an architecture that supports high-level applications and sophisticated digital shadows while facilitating the transition toward a WWL.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Large-scale text processing pipeline with Apache Spark
Authors:
Alexey Svyatkovskiy,
Kosuke Imai,
Mary Kroeger,
Yuki Shiraito
Abstract:
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide…
▽ More
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Virtually the Same: Comparing Physical and Virtual Testbeds
Authors:
Jonathan Crussell,
Thomas M Kroeger,
Aaron Brown,
Cynthia Phillips
Abstract:
Network designers, planners, and security professionals increasingly rely on large-scale virtual testbeds to emulate networks and make decisions about real-world deployments. However, there has been limited research on how well these virtual testbeds match their physical counterparts. Specifically, does the virtualization that these testbeds depend on actually capture real-world behaviors sufficie…
▽ More
Network designers, planners, and security professionals increasingly rely on large-scale virtual testbeds to emulate networks and make decisions about real-world deployments. However, there has been limited research on how well these virtual testbeds match their physical counterparts. Specifically, does the virtualization that these testbeds depend on actually capture real-world behaviors sufficiently well to support decisions? As a first step, we perform simple experiments on both physical and virtual testbeds to begin to understand where and how the testbeds differ. We set up a web service on one host and run ApacheBench against this service from a different host, instrumenting each system during these tests. We define an initial repeatable methodology to quantitatively compare testbeds. Specifically, we compare the testbeds at three levels of abstraction: application, operating system (OS) and network. For the application level, we use the ApacheBench results. For OS behavior, we compare patterns of system call orderings using Markov chains. This provides a unique visual representation of the workload and OS behavior in our testbeds. We also drill down into read-system-call behaviors and show how at one level both systems are deterministic and identical, but as we move up in abstractions that consistency declines. Finally, we use packet captures to compare network behaviors and performance. We reconstruct flows and compare per-flow and per-experiment statistics. We find that the behavior of the workload in the testbeds is similar but that the underlying processes to support it do vary. The low-level network behavior can vary widely in packetization depending on the virtual network driver. While these differences can be important, and knowing about them will help experiment designers, the core application and OS behaviors still represent similar processes.
△ Less
Submitted 21 January, 2019;
originally announced February 2019.
-
The Online Event-Detection Problem
Authors:
Michael A. Bender,
Jonathan W. Berry,
Martin Farach-Colton,
Rob Johnson,
Thomas M. Kroeger,
Prashant Pandey,
Cynthia A. Phillips,
Shikha Singh
Abstract:
Given a stream $S = (s_1, s_2, ..., s_N)$, a $φ$-heavy hitter is an item $s_i$ that occurs at least $φN$ times in $S$. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a $φ$-event at time $t$ if $s_t$ occurs exactly $φN$ times in $(s_1, s_2, ..., s_t)$. Thus, for each $φ$-heavy hitter there…
▽ More
Given a stream $S = (s_1, s_2, ..., s_N)$, a $φ$-heavy hitter is an item $s_i$ that occurs at least $φN$ times in $S$. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a $φ$-event at time $t$ if $s_t$ occurs exactly $φN$ times in $(s_1, s_2, ..., s_t)$. Thus, for each $φ$-heavy hitter there is a single $φ$-event which occurs when its count reaches the reporting threshold $φN$. We define the online event-detection problem (OEDP) as: given $φ$ and a stream $S$, report all $φ$-events as soon as they occur.
Many real-world monitoring systems demand event detection where all events must be reported (no false negatives), in a timely manner, with no non-events reported (no false positives), and a low reporting threshold. As a result, the OEDP requires a large amount of space (Omega(N) words) and is not solvable in the streaming model or via standard sampling-based approaches.
Since OEDP requires large space, we focus on cache-efficient algorithms in the external-memory model.
We provide algorithms for the OEDP that are within a log factor of optimal. Our algorithms are tunable: its parameters can be set to allow for a bounded false-positives and a bounded delay in reporting. None of our relaxations allow false negatives since reporting all events is a strict requirement of our applications. Finally, we show improved results when the count of items in the input stream follows a power-law distribution.
△ Less
Submitted 23 December, 2018;
originally announced December 2018.