Search | arXiv e-print repository

MATILDA: Inclusive Data Science Pipelines Design through Computational Creativity

Authors: Genoveva Vargas-Solar, Santiago Negrete-Yankelevich, Javier A. Espinosa-Oviedo, Khalid Belhajjame, José-Luis Zechinelli-Martini

Abstract: We argue for the need for a new generation of data science solutions that can democratize recent advances in data engineering and artificial intelligence for non-technical users from various disciplines, enabling them to unlock the full potential of these solutions. To do so, we adopt an approach whereby computational creativity and conversational computing are combined to guide non-specialists in… ▽ More We argue for the need for a new generation of data science solutions that can democratize recent advances in data engineering and artificial intelligence for non-technical users from various disciplines, enabling them to unlock the full potential of these solutions. To do so, we adopt an approach whereby computational creativity and conversational computing are combined to guide non-specialists intuitively to explore and extract knowledge from data collections. The paper introduces MATILDA, a creativity-based data science design platform, showing how it can support the design process of data science pipelines guided by human and computational creativity. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:1609.06127 [pdf, other]

A framework for mining process models from emails logs

Authors: Diana Jlailaty, Daniela Grigori, Khalid Belhajjame

Abstract: Due to its wide use in personal, but most importantly, professional contexts, email represents a valuable source of information that can be harvested for understanding, reengineering and repurposing undocumented business processes of companies and institutions. Towards this aim, a few researchers investigated the problem of extracting process oriented information from email logs in order to take b… ▽ More Due to its wide use in personal, but most importantly, professional contexts, email represents a valuable source of information that can be harvested for understanding, reengineering and repurposing undocumented business processes of companies and institutions. Towards this aim, a few researchers investigated the problem of extracting process oriented information from email logs in order to take benefit of the many available process mining techniques and tools. In this paper we go further in this direction, by proposing a new method for mining process models from email logs that leverage unsupervised machine learning techniques with little human involvement. Moreover, our method allows to semi-automatically label emails with activity names, that can be used for activity recognition in new incoming emails. A use case demonstrates the usefulness of the proposed solution using a modest in size, yet real-world, dataset containing emails that belong to two different process models. △ Less

Submitted 20 September, 2016; originally announced September 2016.

Comments: 18 pages, 6 figures

arXiv:1605.06669 [pdf, other]

Automatic vs Manual Provenance Abstractions: Mind the Gap

Authors: Pinar Alper, Khalid Belhajjame, Carole A. Goble

Abstract: In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is common… ▽ More In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions. △ Less

Submitted 21 May, 2016; originally announced May 2016.

Comments: Preprint accepted to the 2016 workshop on the Theory and Applications of Provenance, TAPP 2016

arXiv:1502.02403 [pdf, other]

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Authors: Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, Deborah Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludaescher

Abstract: Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow… ▽ More Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems. △ Less

Submitted 9 February, 2015; originally announced February 2015.

arXiv:1401.4307 [pdf, other]

The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web

Authors: Khalid Belhajjame, Jun Zhao, Daniel Garijo, Kristina Hettne, Raul Palma, Óscar Corcho, José-Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole Goble

Abstract: Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVers… ▽ More Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption. △ Less

Submitted 3 February, 2014; v1 submitted 17 January, 2014; originally announced January 2014.

Comments: 20 pages

arXiv:1311.2789 [pdf]

doi 10.1186/2041-1480-5-41

Structuring research methods and data with the Research Object model: genomics workflows as a case study

Authors: Kristina M. Hettne, Harish Dharuri, Jun Zhao, Katherine Wolstencroft, Khalid Belhajjame, Stian Soiland-Reyes, Eleni Mina, Mark Thompson, Don Cruickshank, Lourdes Verdes-Montenegro, Julian Garrido, David de Roure, Oscar Corcho, Graham Klyne, Reinout van Schouwen, Peter A. C. 't Hoen, Sean Bechhofer, Carole Goble, Marco Roos

Abstract: One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinform… ▽ More One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. △ Less

Submitted 19 September, 2014; v1 submitted 12 November, 2013; originally announced November 2013.

Comments: 35 pages, 10 figures, 1 table. Submitted to Journal of Biomedical Semantics on 2013-05-13, resubmitted after reviews 2013-11-09, 2014-06-27. Accepted in principle 2014-07-29. Published: 2014-09-18 http://www.jbiomedsem.com/content/5/1/41. Research Object homepage: http://www.researchobject.org/

Report number: uk-ac-man-scw:212837 ACM Class: J.3; I.7.4; H.3.7

arXiv:1304.7224 [pdf]

doi 10.1186/2041-1480-4-37

PAV ontology: Provenance, Authoring and Versioning

Authors: Paolo Ciccarese, Stian Soiland-Reyes, Khalid Belhajjame, Alasdair J G Gray, Carole Goble, Tim Clark

Abstract: Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to… ▽ More Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator. We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present map**s that show how PAV extends the PROV-O ontology to support broader interoperability. The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible. We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS map**s that align PAV with DC Terms. △ Less

Submitted 6 December, 2013; v1 submitted 26 April, 2013; originally announced April 2013.

Comments: 22 pages (incl 5 tables and 19 figures). Submitted to Journal of Biomedical Semantics 2013-04-26 (#1858276535979415). Revised article submitted 2013-08-30. Second revised article submitted 2013-10-06. Accepted 2013-10-07. Author proofs sent 2013-10-09 and 2013-10-16. Published 2013-11-22. Final version 2013-12-06. http://www.jbiomedsem.com/content/4/1/37

Report number: University of Manchester eScholar: uk-ac-man-scw:193385 ACM Class: I.2.4; H.2.1; H.3.7; I.7.4

Journal ref: Journal of Biomedical Semantics 2013, 4:37

Showing 1–7 of 7 results for author: Belhajjame, K