-
Provenance for Lattice QCD workflows
Authors:
Tanja Auge,
Gunnar Bali,
Meike Klettke,
Bertram Ludäscher,
Wolfgang Söldner,
Simon Weishäupl,
Tilo Wettig
Abstract:
We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest supercomputers worldwide with data in the multi-PetaByte range being generated and analyzed. In the Lattice QCD community, a custom metadata standard (QCDml) tha…
▽ More
We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest supercomputers worldwide with data in the multi-PetaByte range being generated and analyzed. In the Lattice QCD community, a custom metadata standard (QCDml) that includes certain provenance information already exists for one part of the workflow, the so-called generation of configurations.
In this paper, we follow the W3C PROV standard and formulate a provenance model that includes both the generation part and the so-called measurement part of the Lattice QCD workflow. We demonstrate the applicability of this model and show how the model can be used to answer some provenance-related research questions. However, many important provenance questions in the Lattice QCD community require extensions of this provenance model. To this end, we propose a multi-layered provenance approach that combines prospective and retrospective elements.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Enhanced Inversion of Schema Evolution with Provenance
Authors:
Tanja Auge,
Andreas Heuer
Abstract:
Long-term data-driven studies have become indispensable in many areas of science. Often, the data formats, structures and semantics of data change over time, the data sets evolve. Therefore, studies over several decades in particular have to consider changing database schemas. The evolution of these databases lead at some point to a large number of schemas, which have to be stored and managed, cos…
▽ More
Long-term data-driven studies have become indispensable in many areas of science. Often, the data formats, structures and semantics of data change over time, the data sets evolve. Therefore, studies over several decades in particular have to consider changing database schemas. The evolution of these databases lead at some point to a large number of schemas, which have to be stored and managed, costly and time-consuming. However, in the sense of reproducibility of research data each database version must be reconstructable with little effort. So a previously published result can be validated and reproduced at any time.
Nevertheless, in many cases, such an evolution can not be fully reconstructed. This article classifies the 15 most frequently used schema modification operators and defines the associated inverses for each operation. For avoiding an information loss, it furthermore defines which additional provenance information have to be stored. We define four classes dealing with dangling tuples, duplicates and provenance-invariant operators. Each class will be presented by one representative.
By using and extending the theory of schema map**s and their inverses for queries, data analysis, why-provenance, and schema evolution, we are able to combine data analysis applications with provenance under evolving database structures, in order to enable the reproducibility of scientific results over longer periods of time. While most of the inverses of schema map**s used for analysis or evolution are not exact, but only quasi-inverses, adding provenance information enables us to reconstruct a sub-database of research data that is sufficient to guarantee reproducibility.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
ChaTEAU: A Universal Toolkit for Applying the Chase
Authors:
Tanja Auge,
Nic Scharlau,
Andreas Görres,
Jakob Zimmer,
Andreas Heuer
Abstract:
What do applications like semantic optimization, data exchange and integration, answering queries under dependencies, query reformulation with constraints, and data cleaning have in common? All these applications can be processed by the Chase, a family of algorithms for reasoning with constraints. While the theory of the Chase is well understood, existing implementations are confined to specific u…
▽ More
What do applications like semantic optimization, data exchange and integration, answering queries under dependencies, query reformulation with constraints, and data cleaning have in common? All these applications can be processed by the Chase, a family of algorithms for reasoning with constraints. While the theory of the Chase is well understood, existing implementations are confined to specific use cases and application scenarios, making it difficult to reuse them in other settings. ChaTEAU overcomes this limitation: It takes the logical core of the Chase, generalizes it, and provides a software library for different Chase applications in a single toolkit.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Privacy Aspects of Provenance Queries
Authors:
Tanja Auge,
Nic Scharlau,
Andreas Heuer
Abstract:
Given a query result of a big database, why-provenance can be used to calculate the necessary part of this database, consisting of so-called witnesses. If this database consists of personal data, privacy protection has to prevent the publication of these witnesses. This implies a natural conflict of interest between publishing original data (provenance) and protecting these data (privacy).
In th…
▽ More
Given a query result of a big database, why-provenance can be used to calculate the necessary part of this database, consisting of so-called witnesses. If this database consists of personal data, privacy protection has to prevent the publication of these witnesses. This implies a natural conflict of interest between publishing original data (provenance) and protecting these data (privacy).
In this paper, privacy goes beyond the concept of personal data protection. The paper gives an extended definition of privacy as intellectual property protection. If the provenance information is not sufficient to reconstruct a query result, additional data such as witnesses or provenance polynomials have to be published to guarantee traceability. Nevertheless, publishing this provenance information might be a problem if (significantly) more tuples than necessary can be derived from the original database. At this point, it is already possible to violate privacy policies, provided that quasi identifiers are included in this provenance information. With this poster, we point out fundamental problems and discuss first proposals for solutions.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.