-
S-RASTER: Contraction Clustering for Evolving Data Streams
Authors:
Gregor Ulm,
Simon Smith,
Adrian Nilsson,
Emil Gustavsson,
Mats Jirstrand
Abstract:
Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost…
▽ More
Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.
△ Less
Submitted 16 September, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Active-Code Replacement in the OODIDA Data Analytics Platform
Authors:
Gregor Ulm,
Emil Gustavsson,
Mats Jirstrand
Abstract:
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributing and executing concurrent data analytics tasks. It targets fleets of reference vehicles in the automotive industry and has a particular focus on rapid prototy**. Its underlying message-passing infrastructure has been implemented in Erlang/OTP. External Python applications perform data analytics tasks. Most work…
▽ More
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributing and executing concurrent data analytics tasks. It targets fleets of reference vehicles in the automotive industry and has a particular focus on rapid prototy**. Its underlying message-passing infrastructure has been implemented in Erlang/OTP. External Python applications perform data analytics tasks. Most work is performed by clients (on-board). A central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. This is potentially disruptive. To address this issue, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system and they can even be replaced between iterations of an ongoing assignment. This facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly.
△ Less
Submitted 15 June, 2020; v1 submitted 3 October, 2019;
originally announced October 2019.
-
Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass
Authors:
Gregor Ulm,
Simon Smith,
Adrian Nilsson,
Emil Gustavsson,
Mats Jirstrand
Abstract:
Clustering is an essential data mining tool for analyzing and grou** similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorab…
▽ More
Clustering is an essential data mining tool for analyzing and grou** similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.
△ Less
Submitted 29 January, 2020; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Facilitating Rapid Prototy** in the OODIDA Data Analytics Platform via Active-Code Replacement
Authors:
Gregor Ulm,
Simon Smith,
Adrian Nilsson,
Emil Gustavsson,
Mats Jirstrand
Abstract:
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, wh…
▽ More
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. As this is potentially disruptive, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system; they can even be replaced between iterations of an ongoing assignment. This feature is referred to as active-code replacement. It facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly. Consistency of results is achieved by majority vote, which prevents tainted state. Active-code replacement can be done in less than a second in an idealized setting whereas a standard deployment takes many orders of magnitude more time. The main contribution of this paper is the description of a relatively straightforward approach to active-code replacement that is very user-friendly. It enables a data analyst to quickly execute custom code on the cloud server as well as on client devices. Sensible safeguards and design decisions ensure that this feature can be used by non-specialists who are not familiar with the implementation of OODIDA in general or this feature in particular. As a consequence of adding the active-code replacement feature, OODIDA is now very well-suited for rapid prototy**.
△ Less
Submitted 30 December, 2020; v1 submitted 22 March, 2019;
originally announced March 2019.
-
OODIDA: On-board/Off-board Distributed Real-Time Data Analytics for Connected Vehicles
Authors:
Gregor Ulm,
Simon Smith,
Adrian Nilsson,
Emil Gustavsson,
Mats Jirstrand
Abstract:
A fleet of connected vehicles easily produces many gigabytes of data per hour, making centralized (off-board) data processing impractical. In addition, there is the issue of distributing tasks to on-board units in vehicles and processing them efficiently. Our solution to this problem is OODIDA (On-board/Off-board Distributed Data Analytics), which is a platform that tackles both task distribution…
▽ More
A fleet of connected vehicles easily produces many gigabytes of data per hour, making centralized (off-board) data processing impractical. In addition, there is the issue of distributing tasks to on-board units in vehicles and processing them efficiently. Our solution to this problem is OODIDA (On-board/Off-board Distributed Data Analytics), which is a platform that tackles both task distribution to connected vehicles as well as concurrent execution of tasks on arbitrary subsets of edge clients. Its message-passing infrastructure has been implemented in Erlang/OTP, while the end points use a language-independent JSON interface. Computations can be carried out in arbitrary programming languages. The message-passing infrastructure of OODIDA is highly scalable, facilitating the execution of large numbers of concurrent tasks.
△ Less
Submitted 31 January, 2021; v1 submitted 1 February, 2019;
originally announced February 2019.
-
Functional Federated Learning in Erlang (ffl-erl)
Authors:
Gregor Ulm,
Emil Gustavsson,
Mats Jirstrand
Abstract:
The functional programming language Erlang is well-suited for concurrent and distributed applications. Numerical computing, however, is not seen as one of its strengths. The recent introduction of Federated Learning, a concept according to which client devices are leveraged for decentralized machine learning tasks, while a central server updates and distributes a global model, provided the motivat…
▽ More
The functional programming language Erlang is well-suited for concurrent and distributed applications. Numerical computing, however, is not seen as one of its strengths. The recent introduction of Federated Learning, a concept according to which client devices are leveraged for decentralized machine learning tasks, while a central server updates and distributes a global model, provided the motivation for exploring how well Erlang is suited to that problem. We present ffl-erl, a framework for Federated Learning, written in Erlang, and explore how well it performs in two scenarios: one in which the entire system has been written in Erlang, and another in which Erlang is relegated to coordinating client processes that rely on performing numerical computations in the programming language C. There is a concurrent as well as a distributed implementation of each case. Erlang incurs a performance penalty, but for certain use cases this may not be detrimental, considering the trade-off between conciseness of the language and speed of development (Erlang) versus performance (C). Thus, Erlang may be a viable alternative to C for some practical machine learning tasks.
△ Less
Submitted 15 March, 2019; v1 submitted 24 August, 2018;
originally announced August 2018.
-
Non-thermal response of YBCO thin films to picosecond THz pulses
Authors:
P. Probst,
A. Semenov,
M. Ries,
A. Hoehl,
P. Rieger,
A. Scheuring,
V. Judin,
S. Wünsch,
K. Il'in,
N. Smale,
Y. -L. Mathis,
R. Müller,
G. Ulm,
G. Wüstefeld,
H. -W. Hübers,
J. Hänisch,
B. Holzapfel,
M. Siegel,
A. -S. Müller
Abstract:
The photoresponse of YBa2Cu3O7-d thin film microbridges with thicknesses between 15 and 50 nm was studied in the optical and terahertz frequency range. The voltage transients in response to short radiation pulses were recorded in real time with a resolution of a few tens of picoseconds. The bridges were excited by either femtosecond pulses at a wavelength of 0.8 μm or broadband (0.1 - 1.5 THz) pic…
▽ More
The photoresponse of YBa2Cu3O7-d thin film microbridges with thicknesses between 15 and 50 nm was studied in the optical and terahertz frequency range. The voltage transients in response to short radiation pulses were recorded in real time with a resolution of a few tens of picoseconds. The bridges were excited by either femtosecond pulses at a wavelength of 0.8 μm or broadband (0.1 - 1.5 THz) picosecond pulses of coherent synchrotron radiation. The transients in response to optical radiation are qualitatively well explained in the framework of the two-temperature model with a fast component in the picosecond range and a bolometric nanosecond component whose decay time depends on the film thickness. The transients in the THz regime showed no bolometric component and had amplitudes up to three orders of magnitude larger than the two-temperature model predicts. Additionally THz-field dependent transients in the absence of DC bias were observed. We attribute the response in the THz regime to a rearrangement of vortices caused by high-frequency currents.
△ Less
Submitted 12 April, 2012;
originally announced April 2012.
-
Double Counting in LDA+DMFT - The Example of NiO
Authors:
M. Karolak,
G. Ulm,
T. O. Wehling,
V. Mazurenko,
A. Poteryaev,
A. I. Lichtenstein
Abstract:
An intrinsic issue of the LDA+DMFT approach is the so called double counting of interaction terms. How to choose the double-counting potential in a manner that is both physically sound and consistent is unknown. We have conducted an extensive study of the charge transfer system NiO in the LDA+DMFT framework using quantum Monte Carlo and exact diagonalization as impurity solvers. By explicitly trea…
▽ More
An intrinsic issue of the LDA+DMFT approach is the so called double counting of interaction terms. How to choose the double-counting potential in a manner that is both physically sound and consistent is unknown. We have conducted an extensive study of the charge transfer system NiO in the LDA+DMFT framework using quantum Monte Carlo and exact diagonalization as impurity solvers. By explicitly treating the double-counting correction as an adjustable parameter we systematically investigated the effects of different choices for the double counting on the spectral function. Different methods for fixing the double counting can drive the result from Mott insulating to almost metallic. We propose a reasonable scheme for the determination of double-counting corrections for insulating systems.
△ Less
Submitted 27 April, 2010; v1 submitted 26 April, 2010;
originally announced April 2010.
-
Characterising a Si(Li) detector element for the SIXA X-ray spectrometer
Authors:
T. Tikkanen,
S. Kraft,
F. Scholze,
R. Thornagel,
G. Ulm
Abstract:
The detection efficiency and response function of a Si(Li) detector element for the SIXA spectrometer have been determined in the 500 eV to 5 keV energy range using synchrotron radiation emitted at a bending magnet of the electron storage ring BESSY, which is a primary radiation standard. The agreement between the measured spectrum and the model calculation is better than 2%.
PACS: 95.55.Ka; 0…
▽ More
The detection efficiency and response function of a Si(Li) detector element for the SIXA spectrometer have been determined in the 500 eV to 5 keV energy range using synchrotron radiation emitted at a bending magnet of the electron storage ring BESSY, which is a primary radiation standard. The agreement between the measured spectrum and the model calculation is better than 2%.
PACS: 95.55.Ka; 07.85.Nc; 29.40.Wk; 85.30.De
Keywords: Si(Li) detectors, X-ray spectrometers, detector calibration, X-ray response, spectral lineshape
△ Less
Submitted 4 March, 1997;
originally announced March 1997.