-
MATILDA: Inclusive Data Science Pipelines Design through Computational Creativity
Authors:
Genoveva Vargas-Solar,
Santiago Negrete-Yankelevich,
Javier A. Espinosa-Oviedo,
Khalid Belhajjame,
José-Luis Zechinelli-Martini
Abstract:
We argue for the need for a new generation of data science solutions that can democratize recent advances in data engineering and artificial intelligence for non-technical users from various disciplines, enabling them to unlock the full potential of these solutions. To do so, we adopt an approach whereby computational creativity and conversational computing are combined to guide non-specialists in…
▽ More
We argue for the need for a new generation of data science solutions that can democratize recent advances in data engineering and artificial intelligence for non-technical users from various disciplines, enabling them to unlock the full potential of these solutions. To do so, we adopt an approach whereby computational creativity and conversational computing are combined to guide non-specialists intuitively to explore and extract knowledge from data collections. The paper introduces MATILDA, a creativity-based data science design platform, showing how it can support the design process of data science pipelines guided by human and computational creativity.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
A framework for mining process models from emails logs
Authors:
Diana Jlailaty,
Daniela Grigori,
Khalid Belhajjame
Abstract:
Due to its wide use in personal, but most importantly, professional contexts, email represents a valuable source of information that can be harvested for understanding, reengineering and repurposing undocumented business processes of companies and institutions. Towards this aim, a few researchers investigated the problem of extracting process oriented information from email logs in order to take b…
▽ More
Due to its wide use in personal, but most importantly, professional contexts, email represents a valuable source of information that can be harvested for understanding, reengineering and repurposing undocumented business processes of companies and institutions. Towards this aim, a few researchers investigated the problem of extracting process oriented information from email logs in order to take benefit of the many available process mining techniques and tools. In this paper we go further in this direction, by proposing a new method for mining process models from email logs that leverage unsupervised machine learning techniques with little human involvement. Moreover, our method allows to semi-automatically label emails with activity names, that can be used for activity recognition in new incoming emails. A use case demonstrates the usefulness of the proposed solution using a modest in size, yet real-world, dataset containing emails that belong to two different process models.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
Automatic vs Manual Provenance Abstractions: Mind the Gap
Authors:
Pinar Alper,
Khalid Belhajjame,
Carole A. Goble
Abstract:
In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is common…
▽ More
In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions.
△ Less
Submitted 21 May, 2016;
originally announced May 2016.
-
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts
Authors:
Timothy McPhillips,
Tianhong Song,
Tyler Kolisnik,
Steve Aulenbach,
Khalid Belhajjame,
Kyle Bocinsky,
Yang Cao,
Fernando Chirigati,
Saumen Dey,
Juliana Freire,
Deborah Huntzinger,
Christopher Jones,
David Koop,
Paolo Missier,
Mark Schildhauer,
Christopher Schwalm,
Yaxing Wei,
James Cheney,
Mark Bieda,
Bertram Ludaescher
Abstract:
Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow…
▽ More
Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.
△ Less
Submitted 9 February, 2015;
originally announced February 2015.
-
The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web
Authors:
Khalid Belhajjame,
Jun Zhao,
Daniel Garijo,
Kristina Hettne,
Raul Palma,
Óscar Corcho,
José-Manuel Gómez-Pérez,
Sean Bechhofer,
Graham Klyne,
Carole Goble
Abstract:
Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVers…
▽ More
Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption.
△ Less
Submitted 3 February, 2014; v1 submitted 17 January, 2014;
originally announced January 2014.
-
Structuring research methods and data with the Research Object model: genomics workflows as a case study
Authors:
Kristina M. Hettne,
Harish Dharuri,
Jun Zhao,
Katherine Wolstencroft,
Khalid Belhajjame,
Stian Soiland-Reyes,
Eleni Mina,
Mark Thompson,
Don Cruickshank,
Lourdes Verdes-Montenegro,
Julian Garrido,
David de Roure,
Oscar Corcho,
Graham Klyne,
Reinout van Schouwen,
Peter A. C. 't Hoen,
Sean Bechhofer,
Carole Goble,
Marco Roos
Abstract:
One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinform…
▽ More
One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows.
△ Less
Submitted 19 September, 2014; v1 submitted 12 November, 2013;
originally announced November 2013.
-
PAV ontology: Provenance, Authoring and Versioning
Authors:
Paolo Ciccarese,
Stian Soiland-Reyes,
Khalid Belhajjame,
Alasdair J G Gray,
Carole Goble,
Tim Clark
Abstract:
Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to…
▽ More
Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator.
We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present map**s that show how PAV extends the PROV-O ontology to support broader interoperability.
The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible.
We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS map**s that align PAV with DC Terms.
△ Less
Submitted 6 December, 2013; v1 submitted 26 April, 2013;
originally announced April 2013.