-
Solving Data Quality Problems with Desbordante: a Demo
Authors:
George Chernishev,
Michael Polyntsov,
Anton Chizhov,
Kirill Stupakov,
Ilya Shchuckin,
Alexander Smirnov,
Maxim Strutovsky,
Alexey Shlyonskikh,
Mikhail Firsov,
Stepan Manannikov,
Nikita Bobrov,
Daniil Goncharov,
Ilia Barutkin,
Vladislav Shalnev,
Kirill Muraviev,
Anna Rakhmukova,
Dmitriy Shcheka,
Anton Chernikov,
Mikhail Vyrodov,
Yaroslav Kurbatov,
Maxim Fofanov,
Sergei Belokonnyi,
Pavel Anosov,
Arthur Saliou,
Eduard Gaisin
, et al. (1 additional authors not shown)
Abstract:
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others.
However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data s…
▽ More
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others.
However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data.
Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems.
Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.
△ Less
Submitted 28 July, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)
Authors:
George Chernishev,
Michael Polyntsov,
Anton Chizhov,
Kirill Stupakov,
Ilya Shchuckin,
Alexander Smirnov,
Maxim Strutovsky,
Alexey Shlyonskikh,
Mikhail Firsov,
Stepan Manannikov,
Nikita Bobrov,
Daniil Goncharov,
Ilia Barutkin,
Vladislav Shalnev,
Kirill Muraviev,
Anna Rakhmukova,
Dmitriy Shcheka,
Anton Chernikov,
Dmitrii Mandelshtam,
Mikhail Vyrodov,
Arthur Saliou,
Eduard Gaisin,
Kirill Smirnov
Abstract:
Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems.
The following work presents Desbordan…
▽ More
Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems.
The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers.
Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well.
In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.
△ Less
Submitted 14 January, 2023;
originally announced January 2023.
-
EVOPS Benchmark: Evaluation of Plane Segmentation from RGBD and LiDAR Data
Authors:
Anastasiia Kornilova,
Dmitrii Iarosh,
Denis Kukushkin,
Nikolai Goncharov,
Pavel Mokeev,
Arthur Saliou,
Gonzalo Ferrer
Abstract:
This paper provides the EVOPS dataset for plane segmentation from 3D data, both from RGBD images and LiDAR point clouds. We have designed two annotation methodologies (RGBD and LiDAR) running on well-known and widely-used datasets for SLAM evaluation and we have provided a complete set of benchmarking tools including point, planes and segmentation metrics. The data includes a total number of 10k R…
▽ More
This paper provides the EVOPS dataset for plane segmentation from 3D data, both from RGBD images and LiDAR point clouds. We have designed two annotation methodologies (RGBD and LiDAR) running on well-known and widely-used datasets for SLAM evaluation and we have provided a complete set of benchmarking tools including point, planes and segmentation metrics. The data includes a total number of 10k RGBD and 7K LiDAR frames over different selected scenes which consist of high quality segmented planes. The experiments report quality of SOTA methods for RGBD plane segmentation on our annotated data. We also have provided learnable baseline for plane segmentation in LiDAR point clouds. All labeled data and benchmark tools used have been made publicly available at https://evops.netlify.app/.
△ Less
Submitted 24 August, 2022; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Interatomic machine learning potentials for aluminium: application to solidification phenomena
Authors:
Noel Jakse,
Johannes Sandberg,
Leon F. Granz,
Anthony Saliou,
Philippe Jarry,
Emilie Devijver,
Thomas Voigtmann,
Jürgen Horbach,
Andreas Meyer
Abstract:
In studying solidification process by simulations on the atomic scale, the modeling of crystal nucleation or amorphisation requires the construction of interatomic interactions that are able to reproduce the properties of both the solid and the liquid states. Taking into account rare nucleation events or structural relaxation under deep undercooling conditions requires much larger length scales an…
▽ More
In studying solidification process by simulations on the atomic scale, the modeling of crystal nucleation or amorphisation requires the construction of interatomic interactions that are able to reproduce the properties of both the solid and the liquid states. Taking into account rare nucleation events or structural relaxation under deep undercooling conditions requires much larger length scales and longer time scales than those achievable by \textit{ab initio} molecular dynamics (AIMD). This problem is addressed by means of classical MD simulations using a well established high dimensional neural network potential trained on a relevant set of configurations generated by AIMD. Our dataset contains various crystalline structures and liquid states at different pressures, including their time fluctuations in a wide range of temperatures considering only their energy labels. Applied to elemental aluminium, the resulting potential is shown to be efficient to reproduce the basic structural, dynamics and thermodynamic quantities in the liquid and undercooled states without the need to include neither explicitly the forces nor all kind of configurations in the training procedure. The early stage of crystallization is further investigated on a much larger scale with one million atoms, allowing us to unravel features of the homogeneous nucleation mechanisms in the fcc phase at ambient pressure as well as in the bcc phase at high pressure with unprecedented accuracy close to the \textit{ab initio} one. In both case, a single step nucleation process is observed.
△ Less
Submitted 5 August, 2022; v1 submitted 4 January, 2022;
originally announced January 2022.
-
On the Excess Entropy Scaling Law: a Potential Energy Landscape View
Authors:
Anthony Saliou,
Philippe Jarry,
Noel Jakse
Abstract:
The relationship between excess entropy and diffusion is revisited by means of large-scale computer simulation combined to supervised learning approach to determine the excess entropy for the Lennard-Jones potential. Results reveal that it finds its roots in the properties of the potential energy landscape (PEL). In particular the exponential law holding in the liquid is seen to be correlated with…
▽ More
The relationship between excess entropy and diffusion is revisited by means of large-scale computer simulation combined to supervised learning approach to determine the excess entropy for the Lennard-Jones potential. Results reveal that it finds its roots in the properties of the potential energy landscape (PEL). In particular the exponential law holding in the liquid is seen to be correlated with the landscape-influenced regime of the PEL while the fluid-like power-law corresponds to the free diffusion regime.
△ Less
Submitted 1 June, 2021; v1 submitted 27 May, 2021;
originally announced May 2021.