Search | arXiv e-print repository

Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation

Authors: Yilin Xia, Shawn Bowers, Lan Li, Bertram Ludäscher

Abstract: We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal \emph{argumentation framework}(AF). Such… ▽ More We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal \emph{argumentation framework}(AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program $P_{AF}$ whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: Accepted to IDCC 2024. Source code is available at https://github.com/idaks/Games-and-Argumentation/tree/idcc

arXiv:2310.05649 [pdf, other]

Context, Composition, Automation, and Communication -- The C2AC Roadmap for Modeling and Simulation

Authors: Adelinde Uhrmacher, Peter Frazier, Reiner Hähnle, Franziska Klügl, Fabian Lorig, Bertram Ludäscher, Laura Nenzi, Cristina Ruiz-Martin, Bernhard Rumpe, Claudia Szabo, Gabriel A. Wainer, Pia Wilsdorf

Abstract: Simulation has become, in many application areas, a sine-qua-non. Most recently, COVID-19 has underlined the importance of simulation studies and limitations in current practices and methods. We identify four goals of methodological work for addressing these limitations. The first is to provide better support for capturing, representing, and evaluating the context of simulation studies, including… ▽ More Simulation has become, in many application areas, a sine-qua-non. Most recently, COVID-19 has underlined the importance of simulation studies and limitations in current practices and methods. We identify four goals of methodological work for addressing these limitations. The first is to provide better support for capturing, representing, and evaluating the context of simulation studies, including research questions, assumptions, requirements, and activities contributing to a simulation study. In addition, the composition of simulation models and other simulation studies' products must be supported beyond syntactical coherence, including aspects of semantics and purpose, enabling their effective reuse. A higher degree of automating simulation studies will contribute to more systematic, standardized simulation studies and their efficiency. Finally, it is essential to invest increased effort into effectively communicating results and the processes involved in simulation studies to enable their use in research and decision-making. These goals are not pursued independently of each other, but they will benefit from and sometimes even rely on advances in other subfields. In the present paper, we explore the basis and interdependencies evident in current research and practice and delineate future research directions based on these considerations. △ Less

Submitted 27 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

ACM Class: I.6

arXiv:2309.06620 [pdf, other]

Games and Argumentation: Time for a Family Reunion!

Authors: Bertram Ludäscher, Yilin Xia

Abstract: The rule "defeated(X) $\leftarrow$ attacks(Y,X), $\neg$ defeated(Y)" states that an argument is defeated if it is attacked by an argument that is not defeated. The rule "win(X) $\leftarrow$ move(X,Y), $\neg$ win(Y)" states that in a game a position is won if there is a move to a position that is not won. Both logic rules can be seen as close relatives (even identical twins) and both rules have bee… ▽ More The rule "defeated(X) $\leftarrow$ attacks(Y,X), $\neg$ defeated(Y)" states that an argument is defeated if it is attacked by an argument that is not defeated. The rule "win(X) $\leftarrow$ move(X,Y), $\neg$ win(Y)" states that in a game a position is won if there is a move to a position that is not won. Both logic rules can be seen as close relatives (even identical twins) and both rules have been at the center of attention at various times in different communities: The first rule lies at the core of argumentation frameworks and has spawned a large family of models and semantics of abstract argumentation. The second rule has played a key role in the quest to find the "right" semantics for logic programs with recursion through negation, and has given rise to the stable and well-founded semantics. Both semantics have been widely studied by the logic programming and nonmonotonic reasoning community. The second rule has also received much attention by the database and finite model theory community, e.g., when studying the expressive power of query languages and fixpoint logics. Although close connections between argumentation frameworks, logic programming, and dialogue games have been known for a long time, the overlap and cross-fertilization between the communities appears to be smaller than one might expect. To this end, we recall some of the key results from database theory in which the win-move query has played a central role, e.g., on normal forms and expressive power of query languages. We introduce some notions that naturally emerge from games and that may provide new perspectives and research opportunities for argumentation frameworks. We discuss how solved query evaluation games reveal how- and why-not provenance of query answers. These techniques can be used to explain how results were derived via the given query, game, or argumentation framework. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: Fourth Workshop on Explainable Logic-Based Knowledge Representation (XLoKR), Sept 2, 2023. Rhodes, Greece

arXiv:2303.12640 [pdf, other]

doi 10.1145/3543873.3587559

Provenance for Lattice QCD workflows

Authors: Tanja Auge, Gunnar Bali, Meike Klettke, Bertram Ludäscher, Wolfgang Söldner, Simon Weishäupl, Tilo Wettig

Abstract: We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest supercomputers worldwide with data in the multi-PetaByte range being generated and analyzed. In the Lattice QCD community, a custom metadata standard (QCDml) tha… ▽ More We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest supercomputers worldwide with data in the multi-PetaByte range being generated and analyzed. In the Lattice QCD community, a custom metadata standard (QCDml) that includes certain provenance information already exists for one part of the workflow, the so-called generation of configurations. In this paper, we follow the W3C PROV standard and formulate a provenance model that includes both the generation part and the so-called measurement part of the Lattice QCD workflow. We demonstrate the applicability of this model and show how the model can be used to answer some provenance-related research questions. However, many important provenance questions in the Lattice QCD community require extensions of this provenance model. To this end, we propose a multi-layered provenance approach that combines prospective and retrospective elements. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2301.04770 [pdf, other]

KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution

Authors: Liri Fang, Lan Li, Yiren Liu, Vetle I. Torvik, Bertram Ludäscher

Abstract: Entity resolution has been an essential and well-studied task in data cleaning research for decades. Existing work has discussed the feasibility of utilizing pre-trained language models to perform entity resolution and achieved promising results. However, few works have discussed injecting domain knowledge to improve the performance of pre-trained language models on entity resolution tasks. In thi… ▽ More Entity resolution has been an essential and well-studied task in data cleaning research for decades. Existing work has discussed the feasibility of utilizing pre-trained language models to perform entity resolution and achieved promising results. However, few works have discussed injecting domain knowledge to improve the performance of pre-trained language models on entity resolution tasks. In this study, we propose Knowledge Augmented Entity Resolution (KAER), a novel framework named for augmenting pre-trained language models with external knowledge for entity resolution. We discuss the results of utilizing different knowledge augmentation and prompting methods to improve entity resolution performance. Our model improves on Ditto, the existing state-of-the-art entity resolution method. In particular, 1) KAER performs more robustly and achieves better results on "dirty data", and 2) with more general knowledge injection, KAER outperforms the existing baseline models on the textual dataset and dataset from the online product domain. 3) KAER achieves competitive results on highly domain-specific datasets, such as citation datasets, requiring the injection of expert knowledge in future work. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2112.08259 [pdf, other]

or2yw: Modeling and Visualizing OpenRefineHistories as YesWorkflow Diagrams

Authors: Nikolaus Nova Parulian, Lan Li, Bertram Ludaescher

Abstract: OpenRefine is a popular open-source data cleaning tool. It allows users to export a previously executed data cleaning workflow in a JSON format for possible reuse on other datasets. We have developed or2yw, a novel tool that maps a JSON-formatted OpenRefine operation history to a YesWorkflow (YW) model, which then can be visualized and queried using the YW tool. The latter was originally developed… ▽ More OpenRefine is a popular open-source data cleaning tool. It allows users to export a previously executed data cleaning workflow in a JSON format for possible reuse on other datasets. We have developed or2yw, a novel tool that maps a JSON-formatted OpenRefine operation history to a YesWorkflow (YW) model, which then can be visualized and queried using the YW tool. The latter was originally developed to allow researchers a simple way to annotate their program scripts in order to reveal the workflow steps and dataflow dependencies implicit in those scripts. With or2yw the user can automatically generate YW models from OpenRefine operation histories, thus providing a 'workflow view' on a previously executed sequence of data cleaning operations. The or2yw tool can generate different types of YesWorkflow models, e.g., a linear model which mirrors the sequential execution order of operations in OpenRefine, and a \emph{parallel model} which reveals independent workflow branches, based on a simple analysis of dependencies between steps: if two operations are independent of each other (e.g., when the columns they read and write do not overlap) then these can be viewed as parallel steps in the data cleaning workflow. The resulting YW models can be understood as a form of prospective provenance, i.e., knowledge artifacts that can be queried and visualized (i) to help authors document their own data cleaning workflows, thereby increasing transparency, and (ii) to help other users, who might want to reuse such workflows, to understand them better. △ Less

Submitted 15 December, 2021; originally announced December 2021.

arXiv:2106.05177 [pdf, other]

doi 10.5281/zenodo.4915801

Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development

Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Tainã Coleman, Dan Laney, Dong Ahn, Shantenu Jha, Dorran Howell, Stian Soiland-Reys, Ilkay Altintas, Douglas Thain, Rosa Filgueira, Yadu Babuji, Rosa M. Badia, Bartosz Balis, Silvina Caino-Lores, Scott Callaghan, Frederik Coppens, Michael R. Crusoe, Kaushik De, Frank Di Natale, Tu M. A. Do, Bjoern Enders, Thomas Fahringer, Anne Fouilloux , et al. (33 additional authors not shown)

Abstract: Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role i… ▽ More Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role in the data-oriented and post-Moore's computing landscape as they democratize the application of cutting-edge research techniques, computationally intensive methods, and use of new computing platforms. As workflows continue to be adopted by scientific projects and user communities, they are becoming more complex. Workflows are increasingly composed of tasks that perform computations such as short machine learning inference, multi-node simulations, long-running machine learning model training, amongst others, and thus increasingly rely on heterogeneous architectures that include CPUs but also GPUs and accelerators. The workflow management system (WMS) technology landscape is currently segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. Another fundamental problem is that there are conflicting theoretical bases and abstractions for a WMS. Systems that use the same underlying abstractions can likely be translated between, which is not the case for systems that use different abstractions. More information: https://workflowsri.org/summits/technical △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2103.09181 [pdf, other]

doi 10.5281/zenodo.4606958

Workflows Community Summit: Bringing the Scientific Workflows Community Together

Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Dan Laney, Dong Ahn, Shantenu Jha, Carole Goble, Lavanya Ramakrishnan, Luc Peterson, Bjoern Enders, Douglas Thain, Ilkay Altintas, Yadu Babuji, Rosa M. Badia, Vivien Bonazzi, Taina Coleman, Michael Crusoe, Ewa Deelman, Frank Di Natale, Paolo Di Tommaso, Thomas Fahringer, Rosa Filgueira, Grigori Fursin, Alex Ganose, Bjorn Gruning , et al. (20 additional authors not shown)

Abstract: Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) pla… ▽ More Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. The "Workflows Community Summit" was held online on January 13, 2021. The overarching goal of the summit was to develop a view of the state of the art and identify crucial research challenges in the workflow community. Prior to the summit, a survey sent to stakeholders in the workflow community (including both developers of WMSs and users of workflows) helped to identify key challenges in this community that were translated into 6 broad themes for the summit, each of them being the object of a focused discussion led by a volunteer member of the community. This report documents and organizes the wealth of information provided by the participants before, during, and after the summit. △ Less

Submitted 16 March, 2021; originally announced March 2021.

arXiv:2005.06087 [pdf, other]

doi 10.3233/APC200107

Toward Enabling Reproducibility for Data-Intensive Research using the Whole Tale Platform

Authors: Kyle Chard, Niall Gaffney, Mihael Hategan, Kacper Kowalik, Bertram Ludaescher, Timothy McPhillips, Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk, Craig Willis

Abstract: Whole Tale http://wholetale.org is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of "Tales" for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studie… ▽ More Whole Tale http://wholetale.org is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of "Tales" for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring significant compute resources or utilizing massive data is an especially challenging open problem. We describe opportunities, challenges, and solutions to facilitating reproducibility for data- and compute-intensive research, that we call "Tales at Scale," using the Whole Tale computing platform. We highlight challenges and solutions in frontend responsiveness needs, gaps in current middleware design and implementation, network restrictions, containerization, and data access. Finally, we discuss challenges in packaging computational experiment implementations for portable data-intensive Tales and outline future work. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Journal ref: Advances in Parallel Computing 2020

arXiv:2002.00084 [pdf, other]

Approximate Summaries for Why and Why-not Provenance (Extended Version)

Authors: Seokki Lee, Bertram Ludaescher, Boris Glavic

Abstract: Why and why-not provenance have been studied extensively in recent years. However, why-not provenance, and to a lesser degree why provenance, can be very large resulting in severe scalability and usability challenges. In this paper, we introduce a novel approximate summarization technique for provenance which overcomes these challenges. Our approach uses patterns to encode (why-not) provenance con… ▽ More Why and why-not provenance have been studied extensively in recent years. However, why-not provenance, and to a lesser degree why provenance, can be very large resulting in severe scalability and usability challenges. In this paper, we introduce a novel approximate summarization technique for provenance which overcomes these challenges. Our approach uses patterns to encode (why-not) provenance concisely. We develop techniques for efficiently computing provenance summaries balancing informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to scale to large datasets and to generate comprehensive and meaningful summaries. △ Less

Submitted 27 April, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

arXiv:1808.05752 [pdf, other]

PUG: A Framework and Practical Implementation for Why & Why-Not Provenance (extended version)

Authors: Seokki Lee, Bertram Ludaescher, Boris Glavic

Abstract: Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first- order queries). Specifically, we introduce a graph-based provenance model that, while synta… ▽ More Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first- order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency. △ Less

Submitted 15 August, 2018; originally announced August 2018.

Comments: Extended version of VLDB journal article of the same name. arXiv admin note: text overlap with arXiv:1701.05699

Report number: IIT/CS-DB-2018-02

arXiv:1807.09899 [pdf, other]

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

Authors: Shawn Bowers, Timothy McPhillips, Bertram Ludäscher

Abstract: An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage" relationships among data items, which can help answer prove… ▽ More An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage" relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives". In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: To appear in: Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018, Proceedings

arXiv:1805.00400 [pdf, other]

Computing Environments for Reproducibility: Capturing the "Whole Tale"

Authors: Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce D. Mecum, Jarek Nabrzyski, Victoria Stodden, Ian J. Taylor, Matthew J. Turk, Kandace Turner

Abstract: The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale p… ▽ More The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale project aims to address these barriers by connecting computational, data-intensive research efforts with the larger research process--transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create "living publications" or "tales". The Whole Tale focuses on the full spectrum of science, empowering users in the long tail of science, and power users with demands for access to big data and compute resources. We report here on the design, architecture, and implementation of the Whole Tale environment. △ Less

Submitted 1 May, 2018; originally announced May 2018.

Comments: Future Generation Computer Systems, 2018

arXiv:1701.05699 [pdf, other]

Efficiently Computing Provenance Graphs for Queries with Negation

Authors: Seokki Lee, Sven Koehler, Bertram Ludaescher, Boris Glavic

Abstract: Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for q… ▽ More Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Our approach is based on a rewriting of Datalog rules (called firing rules) that captures successful rule derivations within the context of a Datalog query. We extend this rewriting to support negation and to capture failed derivations that explain missing answers. Given a (why or why-not) provenance question, we compute an explanation, i.e., the part of the provenance that is relevant to answer the question. We introduce optimizations that prune parts of a provenance graph early on if we can determine that they will not be part of the explanation for a given question. We present an implementation that runs on top of a relational database using SQL to compute explanations. Our experiments demonstrate that our approach scales to large instances and significantly outperforms an earlier approach which instantiates the full provenance to compute explanations. △ Less

Submitted 20 January, 2017; originally announced January 2017.

Comments: Illinois Institute of Technology, IIT/CS-DB-2016-03

arXiv:1610.09958 [pdf]

Capturing the "Whole Tale" of Computational Research: Reproducibility in Computing Environments

Authors: Bertram Ludaescher, Kyle Chard, Niall Gaffney, Matthew B. Jones, Jaroslaw Nabrzyski, Victoria Stodden, Matthew Turk

Abstract: We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete narrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research public… ▽ More We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete narrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research publications to their associated digital scholarly objects such as the data, code, and workflows. To enable this, Whole Tale will create an environment where researchers can collaborate on data, workspaces, and workflows and then publish them for future adoption or modification. Published data and applications will be consumed either directly by users using the Whole Tale environment or can be integrated into existing or future domain Science Gateways. △ Less

Submitted 28 October, 2016; originally announced October 2016.

Report number: Gateways2016 paper 30

arXiv:1502.02403 [pdf, other]

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Authors: Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, Deborah Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludaescher

Abstract: Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow… ▽ More Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems. △ Less

Submitted 9 February, 2015; originally announced February 2015.

arXiv:1312.2919 [pdf, ps, other]

doi 10.1145/2274576.2274588

Win-Move is Coordination-Free (Sometimes)

Authors: Daniel Zinn, Todd J Green, Bertram Ludäscher

Abstract: In a recent paper by Hellerstein [15], a tight relationship was conjectured between the number of strata of a Datalog${}^\neg$ program and the number of "coordination stages" required for its distributed computation. Indeed, Ameloot et al. [9] showed that a query can be computed by a coordination-free relational transducer network iff it is monotone, thus answering in the affirmative a variant of… ▽ More In a recent paper by Hellerstein [15], a tight relationship was conjectured between the number of strata of a Datalog${}^\neg$ program and the number of "coordination stages" required for its distributed computation. Indeed, Ameloot et al. [9] showed that a query can be computed by a coordination-free relational transducer network iff it is monotone, thus answering in the affirmative a variant of Hellerstein's CALM conjecture, based on a particular definition of coordination-free computation. In this paper, we present three additional models for declarative networking. In these variants, relational transducers have limited access to the way data is distributed. This variation allows transducer networks to compute more queries in a coordination-free manner: e.g., a transducer can check whether a ground atom $A$ over the input schema is in the "scope" of the local node, and then send either $A$ or $\neg A$ to other nodes. We show the surprising result that the query given by the well-founded semantics of the unstratifiable win-move program is coordination-free in some of the models we consider. We also show that the original transducer network model [9] and our variants form a strict hierarchy of classes of coordination-free queries. Finally, we identify different syntactic fragments of Datalog${}^{\neg\neg}_{\forall}$, called semi-monotone programs, which can be used as declarative network programming languages, whose distributed computation is guaranteed to be eventually consistent and coordination-free. △ Less

Submitted 10 December, 2013; originally announced December 2013.

Comments: Proceedings of the 15th International Conference on Database Theory. Pages 99-113. March 26-30, 2012, Berlin, Germany

ACM Class: H.2.4

arXiv:1311.4610 [pdf, other]

doi 10.1007/s13222-012-0100-z

Scientific Workflows and Provenance: Introduction and Research Opportunities

Authors: Víctor Cuevas-Vicenttín, Saumen Dey, Sven Köhler, Sean Riddle, Bertram Ludäscher

Abstract: Scientific workflows are becoming increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate "reproducible science" through provenance (e.g., data lineage) support. However, as described in the paper, important researc… ▽ More Scientific workflows are becoming increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate "reproducible science" through provenance (e.g., data lineage) support. However, as described in the paper, important research challenges remain. While the database community has studied (business) workflow technologies extensively in the past, most current work in scientific workflows seems to be done outside of the database community, e.g., by practitioners and researchers in the computational sciences and eScience. We provide a brief introduction to scientific workflows and provenance, and identify areas and problems that suggest new opportunities for database research. △ Less

Submitted 23 November, 2013; v1 submitted 18 November, 2013; originally announced November 2013.

Comments: 12 pages, 2 figures

Journal ref: Datenbank-Spektrum, November 2012, Volume 12, Issue 3, pp 193-203

arXiv:1309.2655 [pdf, other]

doi 10.1007/978-3-642-41660-6_20

First-Order Provenance Games

Authors: Sven Köhler, Bertram Ludäscher, Daniel Zinn

Abstract: We propose a new model of provenance, based on a game-theoretic approach to query evaluation. First, we study games G in their own right, and ask how to explain that a position x in G is won, lost, or drawn. The resulting notion of game provenance is closely related to winning strategies, and excludes from provenance all "bad moves", i.e., those which unnecessarily allow the opponent to improve th… ▽ More We propose a new model of provenance, based on a game-theoretic approach to query evaluation. First, we study games G in their own right, and ask how to explain that a position x in G is won, lost, or drawn. The resulting notion of game provenance is closely related to winning strategies, and excludes from provenance all "bad moves", i.e., those which unnecessarily allow the opponent to improve the outcome of a play. In this way, the value of a position is determined by its game provenance. We then define provenance games by viewing the evaluation of a first-order query as a game between two players who argue whether a tuple is in the query answer. For RA+ queries, we show that game provenance is equivalent to the most general semiring of provenance polynomials N[X]. Variants of our game yield other known semirings. However, unlike semiring provenance, game provenance also provides a "built-in" way to handle negation and thus to answer why-not questions: In (provenance) games, the reason why x is not won, is the same as why x is lost or drawn (the latter is possible for games with draws). Since first-order provenance games are draw-free, they yield a new provenance model that combines how- and why-not provenance. △ Less

Submitted 10 September, 2013; originally announced September 2013.

Journal ref: Peter Buneman Festschrift, LNCS 8000, 2013

Showing 1–19 of 19 results for author: Ludaescher, B