Search | arXiv e-print repository

arXiv:2407.02018 [pdf]

A Proposal for a FAIR Management of 3D Data in Cultural Heritage: The Aldrovandi Digital Twin Case

Authors: Sebastian Barzaghi, Alice Bordignon, Bianca Gualandi, Ivan Heibi, Arcangelo Massari, Arianna Moretti, Silvio Peroni, Giulia Renda

Abstract: In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspe… ▽ More In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspective? We begin with a comprehensive literature review touching on: FAIR principles applied to cultural heritage data; representation models; both Object Provenance Information (OPI) and Metadata Record Provenance Information (MRPI), respectively meant as, on the one hand, the detailed history and origin of an object, and - on the other hand - the detailed history and origin of the metadata itself, which describes the primary object (whether physical or digital); 3D models as cultural heritage research data and their creation, selection, publication, archival and preservation. We then describe the process of creating the Aldrovandi Digital Twin, by collecting, storing and modelling data about cultural heritage objects and processes. We detail the many steps from the acquisition of the Digital Cultural Heritage Objects (DCHO), through to the upload of the optimised DCHO onto a web-based framework (ATON), with a focus on open technologies and standards for interoperability and preservation. Using the FAIR Principles for Heritage Library, Archive and Museum Collections as a framework, we look in detail at how the Digital Twin implements FAIR principles at the object and metadata level. We then describe the main challenges we encountered and we summarise what seem to be the peculiarities of 3D cultural heritage data and the possible directions for further research in this field. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.02113 [pdf]

A Workflow for GLAM Metadata Crosswalk

Authors: Arianna Moretti, Ivan Heibi, Silvio Peroni

Abstract: The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process e… ▽ More The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process expands the opportunities for exchange and joint utilization, granted that the descriptive schemas are made interoperable in advance. To achieve this goal, we propose a replicable workflow for metadata schema crosswalks that facilitates the preservation and accessibility of cultural heritage in the digital ecosystem. This work presents a methodology for metadata generation and management in the case study of the digital twin of the temporary exhibition "The Other Renaissance - Ulisse Aldrovandi and the Wonders of the World". The workflow delineates a systematic, step-by-step transformation of tabular data into RDF format, to enhance Linked Open Data. The methodology adopts the RDF Map** Language (RML) technology for converting data to RDF with a human contribution involvement. This last aspect entails an interaction between digital humanists and domain experts through surveys leading to the abstraction and reformulation of domain-specific knowledge, to be exploited in the process of formalizing and converting information. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: Submitted to AIUCD conference 2024 1 figure 8 pages

arXiv:2404.12069 [pdf, other]

Develo** Application Profiles for Enhancing Data and Workflows in Cultural Heritage Digitisation Processes

Authors: Sebastian Barzaghi, Ivan Heibi, Arianna Moretti, Silvio Peroni

Abstract: As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application… ▽ More As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application context, thus making their understanding and adoption difficult. Application profiles of a given standard - defined as sets of ontological entities drawn from one or more semantic artefacts for a particular context or application - are usually proposed as tools for promoting interoperability and reuse while being tied entirely to the particular application context they refer to. In this paper, we present an adaptation and application of an ontology development methodology, i.e. SAMOD, to guide the creation of robust, semantically sound application profiles of large standard models. Using an existing pilot study we have developed in a project dedicated to leveraging virtual technologies to preserve and valorise cultural heritage, we introduce an application profile named CHAD-AP, that we have developed following our customised version of SAMOD. We reflect on the use of SAMOD and similar ontology development methodologies for this purpose, highlighting its strengths and current limitations, future developments, and possible adoption in other similar projects. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2402.12000 [pdf]

Thinking Outside the Black Box: Insights from a Digital Exhibition in the Humanities

Authors: Sebastian Barzaghi, Alice Bordignon, Bianca Gualandi, Silvio Peroni

Abstract: One of the main goals of Open Science is to make research more reproducible. There is no consensus, however, on what exactly "reproducibility" is, as opposed for example to "replicability", and how it applies to different research fields. After a short review of the literature on reproducibility/replicability with a focus on the humanities, we describe how the creation of the digital twin of the t… ▽ More One of the main goals of Open Science is to make research more reproducible. There is no consensus, however, on what exactly "reproducibility" is, as opposed for example to "replicability", and how it applies to different research fields. After a short review of the literature on reproducibility/replicability with a focus on the humanities, we describe how the creation of the digital twin of the temporary exhibition "The Other Renaissance" has been documented throughout, with different methods, but with constant attention to research transparency, openness and accountability. A careful documentation of the study design, data collection and analysis techniques helps reflect and make all possible influencing factors explicit, and is a fundamental tool for reliability and rigour and for opening the "black box" of research. △ Less

Submitted 10 April, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: Accepted to the AIUCD2024 Conference: https://aiucd2024.unict.it/ - will be published in conference proceedings

arXiv:2402.00477 [pdf]

HERITRACE: Tracing Evolution and Bridging Data for Streamlined Curatorial Work in the GLAM Domain

Authors: Arcangelo Massari, Silvio Peroni

Abstract: HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, pr… ▽ More HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, provenance management, change tracking, customization capabilities, and data integration. The system leverages SHACL for data modeling and employs the OpenCitations Data Model (OCDM) for provenance and change tracking, ensuring a harmonious blend of advanced technical features and user accessibility. Future developments include the integration of a robust authentication system and the expansion of data compatibility via the RDF Map** Language (RML), enhancing HERITRACE's utility in digital heritage management. △ Less

Submitted 24 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: 5 pages, 1 figure, submitted to AIUCD 2024

arXiv:2312.16523 [pdf]

Map** bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex

Authors: Elia Rizzetto, Silvio Peroni

Abstract: This study describes the methodology and analyses the results of the process of map** entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this map** is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collection… ▽ More This study describes the methodology and analyses the results of the process of map** entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this map** is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the map** provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2308.15920 [pdf]

doi 10.1016/j.daach.2023.e00309

Saving temporary exhibitions in virtual environments: the Digital Renaissance of Ulisse Aldrovandi -- acquisition and digitisation of cultural heritage objects

Authors: Roberto Balzani, Sebastian Barzaghi, Gabriele Bitelli, Federica Bonifazi, Alice Bordignon, Luca Cipriani, Simona Colitti, Federica Collina, Marilena Daquino, Francesca Fabbri, Bruno Fanini, Filippo Fantini, Daniele Ferdani, Giulia Fiorini, Elena Formia, Anna Forte, Federica Giacomini, Valentina Alena Girelli, Bianca Gualandi, Ivan Heibi, Alessandro Iannucci, Rachele Manganelli Del Fà, Arcangelo Massari, Arianna Moretti, Silvio Peroni , et al. (8 additional authors not shown)

Abstract: As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related… ▽ More As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related solutions, we proceeded with the digital twin creation by acquiring, processing, modelling, optimising, exporting, and metadating the exhibition. We made hybrid use of two acquisition techniques to create new digital cultural heritage objects and environments, and we used open technologies, formats, and protocols to make available the final digital product. Here, we describe the process of collecting and curating bibliographical exhibition (meta)data and the beginning of the digital twin creation to foster its findability, accessibility, interoperability, and reusability. The creation of the digital twin is currently ongoing. △ Less

Submitted 27 December, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.13573 [pdf]

Retractions in Arts and Humanities: an Analysis of the Retraction Notices

Authors: Ivan Heibi, Silvio Peroni

Abstract: The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and… ▽ More The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and a content analysis (mainly performed using a topic modeling process) of the retraction notices. Considering 343 cases of retraction, we found that many retraction notices are neither identifiable nor findable. In addition, these were not always separated from the original papers, introducing ambiguity in understanding how these notices were perceived by the community (i.e., cited). Also, we noticed that there is no systematic way to write a retraction notice. Indeed, some retraction notices presented a complete discussion of the reasons for retraction, while others tended to be more direct and succinct. We have also reported many notices having similar text while addressing different retractions. We think a further study with a larger collection should be done using the same methodology to confirm and investigate our findings further. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2307.01718 [pdf]

A Prototype for a Controlled and Valid RDF Data Production Using SHACL

Authors: Elia Rizzetto, Arcangelo Massari, Ivan Heibi, Silvio Peroni

Abstract: The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full fun… ▽ More The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full functionality, and explores the potential applications and benefits of the tool. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2306.16191 [pdf]

OpenCitations Meta

Authors: Arcangelo Massari, Fabio Mariani, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it… ▽ More OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it enables disambiguation of citations between publications described using different identifiers from various sources. For example, it can link publications identified by DOIs in Crossref and PMIDs in PubMed. Secondly, it assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs), to bibliographic resources without existing external persistent identifiers like DOIs. Lastly, by hosting the bibliographic metadata internally, OpenCitations Meta improves the speed of metadata retrieval for citing and cited documents. The database is populated through automated data curation, including deduplication, error correction, and metadata enrichment. The data is stored in RDF format following the OpenCitations Data Model, and changes and provenance information are tracked. OpenCitations Meta and its production. OpenCitations Meta currently incorporates data from Crossref, DataCite, and the NIH Open Citation Collection. In terms of semantic publishing datasets, it is currently the first in data volume. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 26 pages, 7 figures

arXiv:2305.08477 [pdf]

Representing provenance and track changes of cultural heritage metadata in RDF: a survey of existing approaches

Authors: Arcangelo Massari, Silvio Peroni, Francesca Tomasi, Ivan Heibi

Abstract: The data within collections from all Digital Humanities fields must be trustworthy. To this end, both provenance and change-tracking systems are needed. This contribution offers a systematic review of the metadata representation models for provenance in RDF, focusing on the problem of modelling conjectures in humanistic data. The data within collections from all Digital Humanities fields must be trustworthy. To this end, both provenance and change-tracking systems are needed. This contribution offers a systematic review of the metadata representation models for provenance in RDF, focusing on the problem of modelling conjectures in humanistic data. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 10 pages, 2 figures, submitted to the ADHO Digital Humanities Conference 2023

arXiv:2305.06746 [pdf, ps, other]

doi 10.1038/s41597-024-03185-4

A maturity model for catalogues of semantic artefacts

Authors: Oscar Corcho, Fajar J. Ekaputra, Ivan Heibi, Clement Jonquet, Andras Micsik, Silvio Peroni, Emanuele Storti

Abstract: This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrat… ▽ More This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrate the effectiveness of the maturity model, which includes 12 different dimensions (Metadata, Openness, Quality, Availability, Statistics, PID, Governance, Community, Sustainability, Technology, Transparency, and Assessment) and 43 related features (or sub-criteria) associated with these dimensions. Such a maturity model is one of the first attempts to provide recommendations for governance and processes for preserving and maintaining semantic artefacts and helps assess/address interoperability challenges. △ Less

Submitted 24 March, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

Journal ref: Scientific Data, 11, 479

arXiv:2210.02534 [pdf]

Performing live time-traversal queries via SPARQL on RDF datasets

Authors: Arcangelo Massari, Silvio Peroni

Abstract: This article introduces a methodology to perform live time-traversal SPARQL queries on RDF datasets and software based on this methodology that offers a solution to manage the provenance and change-tracking of entities described using RDF. These are crucial factors in ensuring verifiability and trust. Nevertheless, some of the most prominent knowledge bases - including DBpedia, Wikidata, Yago, and… ▽ More This article introduces a methodology to perform live time-traversal SPARQL queries on RDF datasets and software based on this methodology that offers a solution to manage the provenance and change-tracking of entities described using RDF. These are crucial factors in ensuring verifiability and trust. Nevertheless, some of the most prominent knowledge bases - including DBpedia, Wikidata, Yago, and the Dynamic Linked Data Observatory - do not support time-agnostic queries, i.e., queries across different snapshots together with provenance information. The OpenCitations Data Model (OCDM) describes one possible way to track provenance and entities' changes in RDF datasets, and it allows restoring an entity to a specific status in time (i.e., a snapshot) by applying SPARQL update queries. The methodology and library presented in this article are based on the rationale introduced in the OCDM. We also developed benchmarks proving that such a procedure is efficient for specific queries and less efficient for others. To the best of our knowledge, our library is the only one to support all the time-related retrieval functionalities live, i.e., enabling real-time searches and updates. Moreover, since OCDM complies with standard RDF, queries are expressed via standard SPARQL. △ Less

Submitted 12 October, 2022; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: 26 pages, 10 figures, 3 tables, submitted to the Journal of the Association for Information Science and Technology (JASIST)

arXiv:2209.06091 [pdf]

Approaching Digital Humanities at the University: a Cultural Challenge

Authors: Silvio Peroni, Francesca Tomasi

Abstract: The University of Bologna has a long tradition in Digital Humanities, both at the level of research and teaching. In this article, we want to introduce some experiences in develo** new educational models based on the idea of transversal learning, collaborative approaches and projects-oriented outputs, together with the definition of research fields within this vast domain, accompanied by practic… ▽ More The University of Bologna has a long tradition in Digital Humanities, both at the level of research and teaching. In this article, we want to introduce some experiences in develo** new educational models based on the idea of transversal learning, collaborative approaches and projects-oriented outputs, together with the definition of research fields within this vast domain, accompanied by practical examples. The creation of an international master's degree (DHDK), a PhD (CHeDE) and a research centre (/DH.arc) are the results of refining our notion of Digital Humanities in a new bidirectional way: to reflect on computational methodologies and models in the cultural sphere and to suggest a cultural approach to Informatics. △ Less

Submitted 27 November, 2022; v1 submitted 13 September, 2022; originally announced September 2022.

arXiv:2206.07476 [pdf]

OpenCitations, an open e-infrastructure to foster maximum reuse of citation data

Authors: Chiara Di Giambattista, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and… ▽ More OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and the FAIR data principles that data should be Findable, Accessible, Interoperable and Reusable. Since its data satisfies all the Reuse guidelines provided by FAIR in terms of richness, provenance, usage licenses and domain-relevant community standards, OpenCitations provides an example of a successful open e-infrastructure in which the reusability of data is integral to its mission. △ Less

Submitted 15 June, 2022; originally announced June 2022.

arXiv:2206.03926 [pdf, ps, other]

doi 10.1007/978-3-031-16802-4_36

Enabling Portability and Reusability of Open Science Infrastructures

Authors: Giuseppe Grieco, Ivan Heibi, Arcangelo Massari, Arianna Moretti, Silvio Peroni

Abstract: This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with e… ▽ More This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with existing technologies and concrete examples of application. △ Less

Submitted 28 July, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: 8 pages, 1 PostScript figure, submitted to TPDL 2022

Journal ref: Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham

arXiv:2205.14677 [pdf]

Structured references from PDF articles: assessing the tools for bibliographic reference extraction and parsing

Authors: Alessia Cioffi, Silvio Peroni

Abstract: Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We id… ▽ More Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks. △ Less

Submitted 6 September, 2022; v1 submitted 29 May, 2022; originally announced May 2022.

arXiv:2205.13419 [pdf]

The way we cite: common metadata used across disciplines for defining bibliographic references

Authors: Erika Alves dos Santos, Silvio Peroni, Marcos Luiz Mucheroni

Abstract: Current citation practices observed in articles are very noisy, confusing, and not standardised, making identifying the cited works problematic for hu-mans and any reference extraction software. In this work, we want to investigate such citation practices for referencing different types of entities and, in particular, to understand the most used metadata in bibliographic refer-ences. We identified… ▽ More Current citation practices observed in articles are very noisy, confusing, and not standardised, making identifying the cited works problematic for hu-mans and any reference extraction software. In this work, we want to investigate such citation practices for referencing different types of entities and, in particular, to understand the most used metadata in bibliographic refer-ences. We identified 36 types of cited entities (the most cited ones were articles, books, and proceeding papers) within the 34,140 bibliographic references extracted from a vast set of journal articles on 27 different subject ar-eas. The analysis of such bibliographic references, grouped by the particular type of cited entities, enabled us to highlight the most used metadata for de-fining bibliographic references across the subject areas. However, we also noticed that, in some cases, bibliographic references did not provide the essential elements to identify the work they refer to easily. △ Less

Submitted 21 July, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2205.06764 [pdf]

doi 10.1108/JD-07-2022-0146

What do we mean by "data"? A proposed classification of data types in the arts and humanities

Authors: Bianca Gualandi, Luca Pareschi, Silvio Peroni

Abstract: Purpose: This article describes the interviews we conducted in late 2021 with 19 researchers at the Department of Classical Philology and Italian Studies at the University of Bologna. The main purpose was to shed light on the definition of the word "data" in the humanities domain, as far as FAIR data management practices are concerned, and on what researchers think of the term. Methodology: We inv… ▽ More Purpose: This article describes the interviews we conducted in late 2021 with 19 researchers at the Department of Classical Philology and Italian Studies at the University of Bologna. The main purpose was to shed light on the definition of the word "data" in the humanities domain, as far as FAIR data management practices are concerned, and on what researchers think of the term. Methodology: We invited one researcher for each of the official disciplinary areas represented within the department and all 19 accepted to participate in the study. Participants were then divided into 5 main research areas: philology and literary criticism, language and linguistics, history of art, computer science, archival studies. The interviews were transcribed and analysed using a grounded theory approach. Findings: A list of 13 research data types has been compiled thanks to the information collected from participants. The term "data" does not emerge as especially problematic, although a good deal of confusion remains. Looking at current research management practices, methodologies and teamwork appear more central than previously reported. Originality: Our findings confirm that "data" within the FAIR framework should include all types of input and outputs humanities research work with, including publications. Also, the participants to this study appear ready for a discussion around making their research data FAIR: they do not find the terminology particularly problematic, while they rely on precise and recognised methodologies, as well as on sharing and collaboration with colleagues. △ Less

Submitted 8 November, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

arXiv:2202.08469 [pdf]

doi 10.1108/JD-10-2022-0234

An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic referencing and citing practices

Authors: Erika Alves dos Santos, Silvio Peroni, Marcos Luiz Mucheroni

Abstract: Purpose. In this study, we want to identify current possible causes for citing and referencing errors in scholarly literature to compare if something changed from the snapshot provided Sweetland in his 1989 paper. Design/methodology/approach. We analysed reference elements, i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers, from 729 articles published i… ▽ More Purpose. In this study, we want to identify current possible causes for citing and referencing errors in scholarly literature to compare if something changed from the snapshot provided Sweetland in his 1989 paper. Design/methodology/approach. We analysed reference elements, i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers, from 729 articles published in 147 journals across the 27 subject areas. Findings. The outcomes of our analysis pointed out that bibliographic errors have been perpetuated for decades and that their possible causes have increased, despite the encouraged use of technological facilities, i.e., the reference managers. Originality. As far as we know, our study is the best recent available analysis of errors in referencing and citing practices in the literature since Sweetland (1989). △ Less

Submitted 10 June, 2023; v1 submitted 17 February, 2022; originally announced February 2022.

arXiv:2201.09555 [pdf, other]

A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using Literals

Authors: Cristian Santini, Genet Asefa Gesese, Silvio Peroni, Aldo Gangemi, Harald Sack, Mehwish Alam

Abstract: Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available as Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, et… ▽ More Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available as Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: 1) Multimodal KGEs, 2) A blocking procedure, and finally, 3) Hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8-14% in terms of the F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github: https://github.com/sntcristian/and-kge and Zenodo:https://doi.org/10.5281/zenodo.6309855 respectively. △ Less

Submitted 1 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

arXiv:2111.11263 [pdf]

doi 10.1007/s11192-022-04367-w

Identifying and correcting invalid citations due to DOI errors in Crossref data

Authors: Alessia Cioffi, Sara Coppini, Arcangelo Massari, Arianna Moretti, Silvio Peroni, Cristian Santini, Nooshin Shahidzadeh Asadi

Abstract: This work aims to identify classes of DOI mistakes by analysing the open bibliographic metadata available in Crossref, highlighting which publishers were responsible for such mistakes and how many of these incorrect DOIs could be corrected through automatic processes. By using a list of invalid cited DOIs gathered by OpenCitations while processing the OpenCitations Index of Crossref open DOI-to-DO… ▽ More This work aims to identify classes of DOI mistakes by analysing the open bibliographic metadata available in Crossref, highlighting which publishers were responsible for such mistakes and how many of these incorrect DOIs could be corrected through automatic processes. By using a list of invalid cited DOIs gathered by OpenCitations while processing the OpenCitations Index of Crossref open DOI-to-DOI citations (COCI) in the past two years, we retrieved the citations in the January 2021 Crossref dump to such invalid DOIs. We processed these citations by kee** track of their validity and the publishers responsible for uploading the related citation data in Crossref. Finally, we identified patterns of factual errors in the invalid DOIs and the regular expressions needed to catch and correct them. The outcomes of this research show that only a few publishers were responsible for and/or affected by the majority of invalid citations. We extended the taxonomy of DOI name errors proposed in past studies and defined more elaborated regular expressions that can clean a higher number of mistakes in invalid DOIs than prior approaches. The data gathered in our study can enable investigating possible reasons for DOI mistakes from a qualitative point of view, hel** publishers identify the problems underlying their production of invalid citation data. Also, the DOI cleaning mechanism we present could be integrated into the existing process (e.g. in COCI) to add citations by automatically correcting a wrong DOI. This study was run strictly following Open Science principles, and, as such, our research outcomes are fully reproducible. △ Less

Submitted 7 March, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Journal ref: Scientometrics 127, 3593-3612 (2022)

arXiv:2111.05223 [pdf]

A quantitative and qualitative open citation analysis of retracted articles in the humanities

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue,… ▽ More In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue, subject, etc.) and the characteristics of their in-text citations (e.g., intent, sentiment, etc.). Using these data, we performed a quantitative and qualitative study of retractions in the humanities, presenting descriptive statistics and a topic modeling analysis of the citing entities' abstracts and the in-text citation contexts. As part of our main findings, we noticed that there was no drop in the overall number of citations after the year of retraction, with few entities which have either mentioned the retraction or expressed a negative sentiment toward the cited publication. In addition, on several occasions, we noticed a higher concern/awareness when it was about citing a retracted publication, by the citing entities belonging to the health sciences domain, if compared to the humanities and the social science domains. Philosophy, arts, and history are the humanities areas that showed the higher concern toward the retraction. △ Less

Submitted 10 October, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

arXiv:2110.02111 [pdf]

Open bibliographic data and the Italian National Scientific Qualification: measuring coverage of academic fields

Authors: Federica Bologna, Angelo Di Iorio, Silvio Peroni, Francesco Poggi

Abstract: The importance of open bibliographic repositories is widely accepted by the scientific community. For evaluation processes, however, there is still some skepticism: even if large repositories of open access articles and free publication indexes exist and are continuously growing, assessment procedures still rely on proprietary databases, mainly due to the richness of the data available in these pr… ▽ More The importance of open bibliographic repositories is widely accepted by the scientific community. For evaluation processes, however, there is still some skepticism: even if large repositories of open access articles and free publication indexes exist and are continuously growing, assessment procedures still rely on proprietary databases, mainly due to the richness of the data available in these proprietary databases and the services provided by the companies they are offered by. This paper investigates the status of open bibliographic data of three of the most used open resources, namely Microsoft Academic Graph, Crossref and OpenAIRE, evaluating their potentialities as substitutes of proprietary databases for academic evaluation processes. We focused on the Italian National Scientific Qualification (NSQ), the Italian process for University Professor qualification, which uses data from commercial indexes, and investigated similarities and differences between research areas, disciplines and application roles. The main conclusion is that open datasets are ready to be used for some disciplines, among which mathematics, natural sciences, economics and statistics, even if there is still room for improvement; but there is still a large gap to fill in others - like history, philosophy, pedagogy and psychology - and a stronger effort is required from researchers and institutions. △ Less

Submitted 13 May, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

arXiv:2110.00307 [pdf, other]

The case for the Humanities Citation Index (HuCI): a citation index by the humanities, for the humanities

Authors: Giovanni Colavizza, Silvio Peroni, Matteo Romanello

Abstract: Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public… ▽ More Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public organization. We argue that an open citation index for the humanities is desirable, for four reasons: it would greatly improve and accelerate the retrieval of sources, it would offer a way to interlink collections across repositories (such as archives and libraries), it would foster the adoption of metadata standards and best practices by all stakeholders (including publishers) and it would contribute research data to fields such as bibliometrics and science studies. We also suggest that the citation index should be informed by a set of requirements relevant to the humanities. We discuss four such requirements: source coverage must be comprehensive, including books and citations to primary sources; there needs to be chronological depth, as scholarship in the humanities remains relevant over time; the index should be collection-driven, leveraging the accumulated thematic collections of specialized research libraries; and it should be rich in context in order to allow for the qualification of each citation, for example by providing citation excerpts. We detail the fit-for-purpose research infrastructure which can make the Humanities Citation Index a reality. Ultimately, we argue that a citation index for the humanities can be created by humanists, via a collaborative, distributed and open effort. △ Less

Submitted 14 May, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

arXiv:2108.12190 [pdf]

A map of Digital Humanities research across bibliographic data sources

Authors: Gianmarco Spinaci, Giovanni Colavizza, Silvio Peroni

Abstract: Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. Methodology. We created a list of DH journals based on manual curation and bibliometric data. We used that list to identify DH publi… ▽ More Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. Methodology. We created a list of DH journals based on manual curation and bibliometric data. We used that list to identify DH publications in the bibliographic data sources under consideration. We used the ERIH-PLUS list of journals to identify Social Sciences and Humanities (SSH) publications. We analysed the citation links they included to understand the relationship between DH publications and SSH and non-SSH fields. Findings. Crossref emerges as the database containing the highest number of DH publications. Citations from and to DH publications show strong connections between DH and research in Computer Science, Linguistics, Psychology, and Pedagogical & Educational Research. Computer Science is responsible for a large part of incoming and outgoing citations to and from DH research, which suggests a reciprocal interest between the two disciplines. Value. This is the first bibliometric study of DH research involving several bibliographic data sources, including open and proprietary databases. Research limitations. The list of DH journals we created might be only partially representative of broader DH research. In addition, some DH publications could have been cut off from the study since we did not consider books and other publications published in proceedings of DH conferences and workshops. Finally, we used a specific time coverage (2000-2018) that could have prevented the inclusion of additional DH publications. △ Less

Submitted 1 March, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

arXiv:2106.12320 [pdf, ps, other]

doi 10.1145/3447548.346

BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and Processing

Authors: Zeyd Boukhers, Philipp Mayr, Silvio Peroni

Abstract: Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in kee** pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF d… ▽ More Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in kee** pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: This workshop will be held in conjunction with KDD' 2021

arXiv:2106.05725 [pdf]

Academics evaluating academics: a methodology to inform the review process on top of open citations

Authors: Federica Bologna, Angelo Di Iorio, Silvio Peroni, Francesco Poggi

Abstract: In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. In this work, we aim at introducing a methodology to explore whether citation-based metrics, calculated only considering open bibliographic and citation data, can yield insights on how human peer-review of research assessment exercises is conducted. To understand i… ▽ More In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. In this work, we aim at introducing a methodology to explore whether citation-based metrics, calculated only considering open bibliographic and citation data, can yield insights on how human peer-review of research assessment exercises is conducted. To understand if and what metrics provide relevant information, we propose to use a series of machine learning models to replicate the decisions of the committees of the research assessment exercises. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2103.07942

arXiv:2106.01781 [pdf]

doi 10.1371/journal.pone.0270872

A protocol to gather, characterize and analyze incoming citations of retracted articles

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and… ▽ More In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and extracting basic metadata of the entities which have cited a retracted article, (2) extracting and labeling additional features based on the textual content of the citing entities, (3) building a descriptive statistical summary based on the collected data, and finally (4) running a topic modeling analysis. The goal of the methodology is to generate data and visualizations that help understanding possible behaviors related to retraction cases. We present the methodology in a structured step-by-step form following its four phases, discuss its limits and possible workarounds, and list the planned future improvements. △ Less

Submitted 3 June, 2021; originally announced June 2021.

arXiv:2105.08599 [pdf]

Can we assess research using open scientific knowledge graphs? A case study within the Italian National Scientific Qualification

Authors: Federica Bologna, Angelo Di Iorio, Silvio Peroni, Francesco Poggi

Abstract: The need for open scientific knowledge graphs is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still few free knowledge graphs exposing citation networks, and often their coverage is partial. Consequently, most evaluation processes based on citation counts rely on commercial citation databases. Things are changing thanks to the… ▽ More The need for open scientific knowledge graphs is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still few free knowledge graphs exposing citation networks, and often their coverage is partial. Consequently, most evaluation processes based on citation counts rely on commercial citation databases. Things are changing thanks to the Initiative for Open Citations (I4OC, https://i4oc.org) and the Initiative for Open Abstracts (I4OA, https://i4oa.org), whose goal is to campaign for scholarly publishers to open the reference lists and the other metadata of their articles. This paper investigates the growth of the open bibliographic metadata and open citations in two scientific knowledge graphs, OpenCitations' COCI and Crossref, with an experiment on the Italian National Scientific Qualification (NSQ), the National process for University Professor qualification which uses data from commercial indexes. We simulated the procedure by only using such open data and explored similarities and differences with the official results. The outcomes of the experiment show that the amount of open bibliographic metadata and open citation data currently available in the two scientific knowledge graphs adopted is not yet enough for obtaining results similar to those provided using commercial databases. △ Less

Submitted 18 May, 2021; originally announced May 2021.

arXiv:2103.07942 [pdf]

doi 10.1007/s11192-022-04581-6

Do open citations give insights on the qualitative peer-review evaluation in research assessments? An analysis of the Italian National Scientific Qualification

Authors: Federica Bologna, Angelo Di Iorio, Silvio Peroni, Francesco Poggi

Abstract: In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. Indeed, the Italian National Scientific Qualification (NSQ), i.e. the national assessment exercise which aims at deciding whether a scholar can apply to professorial academic positions as Associate Professor and Full Professor, adopts a quantitative and qualitative… ▽ More In the past, several works have investigated ways for combining quantitative and qualitative methods in research assessment exercises. Indeed, the Italian National Scientific Qualification (NSQ), i.e. the national assessment exercise which aims at deciding whether a scholar can apply to professorial academic positions as Associate Professor and Full Professor, adopts a quantitative and qualitative evaluation process: it makes use of bibliometrics followed by a peer-review process of candidates' CVs. The NSQ divides academic disciplines into two categories, i.e. citation-based disciplines (CDs) and non-citation-based disciplines (NDs), a division that affects the metrics used for assessing the candidates of that discipline in the first part of the process, which is based on bibliometrics. In this work, we aim at exploring whether citation-based metrics, calculated only considering open bibliographic and citation data, can support the human peer-review of NDs and yield insights on how it is conducted. To understand if and what citation-based (and, possibly, other) metrics provide relevant information, we created a series of machine learning models to replicate the decisions of the NSQ committees. As one of the main outcomes of our study, we noticed that the strength of the citational relationship between the candidate and the commission in charge of assessing his/her CV seems to play a role in the peer-review phase of the NSQ of NDs. △ Less

Submitted 23 October, 2022; v1 submitted 14 March, 2021; originally announced March 2021.

arXiv:2012.11475 [pdf]

A qualitative and quantitative analysis of open citations to retracted articles: the Wakefield et al.'s case

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles… ▽ More In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles and the characteristics of the citations the retracted articles accumulated over time. Our analysis is based on a methodology which illustrates how we gathered the data, extracted the topics of the citing articles, and visualized the results. The data and services used are all open and free to foster the reproducibility of the analysis. The outcomes concerned the analysis of the entities citing Wakefield et al.'s article and their related in-text citations. We observed a constant increasing number of citations in the last 20 years, accompanied with a constant increment in the percentage of those acknowledging its retraction. Citing articles have started either discussing or dealing with the retraction of Wakefield et al.'s article even before its full retraction, happened in 2010. Articles in the social sciences domain citing the Wakefield et al.'s one were among those that have mostly discussed its retraction. In addition, when observing the in-text citations, we noticed that a large part of the citations received by Wakefield et al.'s article has focused on general discussions without recalling strictly medical details, especially after the full retraction. Medical studies did not hesitate in acknowledging the retraction and often provided strong negative statements on it. △ Less

Submitted 24 May, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2011.13886 [pdf]

MITAO: a tool for enabling scholars in the Humanities to use Topic Modelling in their studies

Authors: Ivan Heibi, Silvio Peroni, Luca Pareschi, Paolo Ferri

Abstract: Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited cod… ▽ More Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited coding skills, we introduce MITAO, a web-based tool that allow the definition of a visual workflow which embeds various automatic text analysis operations and allows one to store and share both the workflow and the results of its execution to other researchers, which enables the reproducibility of the analysis. We present an example of an application of use of Topic Modelling with MITAO using a collection of English abstracts of the articles published in "Umanistica Digitale". The results returned by MITAO are shown with dynamic web-based visualizations, which allowed us to have preliminary insights about the evolution of the topics treated over the time in the articles published in "Umanistica Digitale". All the results along with the defined workflows are published and accessible for further studies. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:2011.12599 [pdf, other]

doi 10.3233/SSW200033

The Landscape of Ontology Reuse Approaches

Authors: Valentina Anita Carriero, Marilena Daquino, Aldo Gangemi, Andrea Giovanni Nuzzolese, Silvio Peroni, Valentina Presutti, Francesca Tomasi

Abstract: Ontology reuse aims to foster interoperability and facilitate knowledge reuse. Several approaches are typically evaluated by ontology engineers when bootstrap** a new project. However, current practices are often motivated by subjective, case-by-case decisions, which hamper the definition of a recommended behaviour. In this chapter we argue that to date there are no effective solutions for suppo… ▽ More Ontology reuse aims to foster interoperability and facilitate knowledge reuse. Several approaches are typically evaluated by ontology engineers when bootstrap** a new project. However, current practices are often motivated by subjective, case-by-case decisions, which hamper the definition of a recommended behaviour. In this chapter we argue that to date there are no effective solutions for supporting developers' decision-making process when deciding on an ontology reuse strategy. The objective is twofold: (i) to survey current approaches to ontology reuse, presenting motivations, strategies, benefits and limits, and (ii) to analyse two representative approaches and discuss their merits. △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:2009.05588 [pdf]

doi 10.1108/JD-08-2020-0144

Citing and referencing habits in Medicine and Social Sciences journals in 2019

Authors: Erika Alves dos Santos, Silvio Peroni, Marcos Luiz Mucheroni

Abstract: This article explores citing and referencing systems in Social Sciences and Medicine articles from different theoretical and practical perspectives, considering bibliographic references as a facet of descriptive representation. The analysis of citing and referencing elements (i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers) identified citing and refer… ▽ More This article explores citing and referencing systems in Social Sciences and Medicine articles from different theoretical and practical perspectives, considering bibliographic references as a facet of descriptive representation. The analysis of citing and referencing elements (i.e. bibliographic references, mentions, quotations, and respective in-text reference pointers) identified citing and referencing habits within disciplines under consideration and errors occurring over the long term as stated by previous studies now expanded. Future expected trends of information retrieval from bibliographic metadata was gathered by approaching these referencing elements from the FRBR Entities concepts. Reference styles do not fully accomplish with their role of guiding authors and publishers on providing concise and well-structured bibliographic metadata within bibliographic references. Trends on representative description revision suggest a predicted distancing on the ways information is approached by bibliographic references and bibliographic catalogs adopting FRBR concepts, including the description levels adopted by each of them under the perspective of the FRBR Entities concept. This study was based on a subset of Medicine and Social Sciences articles published in 2019 and, therefore, it may not be taken as a final and broad coverage. Future studies expanding these approaches to other disciplines and chronological periods are encouraged. By approaching citing and referencing issues as descriptive representation's facets, findings on this study may encourage further studies that will support Information Science and Computer Science on providing tools to become bibliographic metadata description simpler, better structured and more efficient facing the revision of descriptive representation actually in progress. △ Less

Submitted 20 January, 2021; v1 submitted 11 September, 2020; originally announced September 2020.

Comments: Accepted for publication on 18 January 2021 in Journal of Documentation

arXiv:2007.16079 [pdf]

Creating RESTful APIs over SPARQL endpoints using RAMOSE

Authors: Marilena Daquino, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web R… ▽ More Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web RESTful APIs for querying Web-available data and creating applications over them. To solve this problem, we have developed RAMOSE, a generic tool developed in Python to create REST APIs over SPARQL endpoints. Through the creation of source-specific textual configuration files, RAMOSE enables the querying of SPARQL endpoints via simple Web RESTful API calls that return either JSON or CSV-formatted data, thus hiding all the intrinsic complexities of SPARQL and RDF from common Web users. We provide evidence that the use of RAMOSE to provide REST API access to RDF data within OpenCitations triplestores is beneficial in terms of the number of queries made by external users to such RDF data using the RAMOSE API compared with the direct access via the SPARQL endpoint. Our findings show the importance for suppliers of RDF data of having an alternative API access service, which enables its use by those with no (or little) experience in Semantic Web technologies and the SPARQL query language. RAMOSE can be used both to query any SPARQL endpoint and to query any other Web API, and thus it represents an easy generic technical solution for service providers who wish to create an API service to access Linked Data stored as RDF in a conventional triplestore. △ Less

Submitted 30 May, 2021; v1 submitted 31 July, 2020; originally announced July 2020.

arXiv:2005.11981 [pdf, other]

The OpenCitations Data Model

Authors: Marilena Daquino, Silvio Peroni, David Shotton, Giovanni Colavizza, Behnam Ghavimi, Anne Lauscher, Philipp Mayr, Matteo Romanello, Philipp Zumstein

Abstract: A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data supplier or context application. In this paper we presen… ▽ More A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data supplier or context application. In this paper we present the OpenCitations Data Model (OCDM), a generic data model for describing bibliographic entities and citations, developed using Semantic Web technologies. We also evaluate the effective reusability of OCDM according to ontology evaluation practices, mention existing users of OCDM, and discuss the use and impact of OCDM in the wider open science community. △ Less

Submitted 24 August, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

Comments: ISWC 2020 Conference proceedings

arXiv:1906.11964 [pdf]

doi 10.1162/qss_a_00023

OpenCitations, an infrastructure organization for open scholarship

Authors: Silvio Peroni, David Shotton

Abstract: OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open citation data as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. Open citation data are valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling publication o… ▽ More OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open citation data as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. Open citation data are valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling publication of the source data. Following brief introductions to the development and benefits of open scholarship and to Semantic Web technologies, this paper describes OpenCitations and its datasets, tools, services and activities. These include the OpenCitations Data Model; the SPAR (Semantic Publishing and Referencing) Ontologies; OpenCitations' open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores; Open Citation Identifiers (OCIs) and the OpenCitations OCI Resolution Service; the OpenCitations Corpus (OCC), a database of open downloadable bibliographic and citation data made available in RDF under a Creative Commons public domain dedication; and the OpenCitations Indexes of open citation data, of which the first and largest is COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, which currently contains over 445 million bibliographic citations and is receiving considerable usage by the scholarly community. △ Less

Submitted 9 December, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

arXiv:1906.06039 [pdf]

doi 10.1007/s11192-019-03311-9

Nine Million Book Items and Eleven Million Citations: A Study of Book-Based Scholarly Communication Using OpenCitations

Authors: Yongjun Zhu, Erjia Yan, Silvio Peroni, Chao Che

Abstract: Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books' roles in scholarly communication. The COCI d… ▽ More Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books' roles in scholarly communication. The COCI data we analyzed includes 445,826,118 citations from 46,534,705 bibliographic entities. By analyzing such a large amount of data, we provide a thorough, multifaceted understanding of books. Among the investigated factors are 1) temporal changes to book citations; 2) book citation distributions; 3) years to citation peak; 4) citation half-life; and 5) characteristics of the most-cited books. Results show that books have received less than 4% of total citations, and have been cited mainly by journal articles. Moreover, 97.96% of books have been cited fewer than ten times. Books take longer than other bibliographic materials to reach peak citation levels, yet are cited for the same duration as journal articles. Most-cited books tend to cover general (yet essential) topics, theories, and technological concepts in mathematics and statistics. △ Less

Submitted 6 December, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

arXiv:1904.06052 [pdf]

doi 10.1007/s11192-019-03217-6

COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations

Authors: Ivan Heibi, Silvio Peroni, David Shotton

Abstract: In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citation… ▽ More In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citations are described in RDF by means of the newly extended version of the OpenCitations Data Model (OCDM). We introduce the workflow we have developed for creating these data, and also show the additional services that facilitate the access to and querying of these data via different access points: a SPARQL endpoint, a REST API, bulk downloads, Web interfaces, and direct access to the citations via HTTP content negotiation. Finally, we present statistics regarding the use of COCI citation data, and we introduce several projects that have already started to use COCI data for different purposes. △ Less

Submitted 26 July, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: Submitted to Scientometrics (https://link.springer.com/journal/11192)

arXiv:1903.06142 [pdf]

doi 10.1007/s11192-020-03397-6

The practice of self-citations: a longitudinal study

Authors: Silvio Peroni, Paolo Ciancarini, Aldo Gangemi, Andrea Giovanni Nuzzolese, Francesco Poggi, Valentina Presutti

Abstract: In this article, we discuss the outcomes of an experiment where we analysed whether and to what extent the introduction, in 2012, of the new research assessment exercise in Italy (a.k.a. Italian Scientific Habilitation) affected self-citation behaviours in the Italian research community. The Italian Scientific Habilitation attests to the scientific maturity of researchers and in Italy, as in many… ▽ More In this article, we discuss the outcomes of an experiment where we analysed whether and to what extent the introduction, in 2012, of the new research assessment exercise in Italy (a.k.a. Italian Scientific Habilitation) affected self-citation behaviours in the Italian research community. The Italian Scientific Habilitation attests to the scientific maturity of researchers and in Italy, as in many other countries, is a requirement for accessing to a professorship. To this end, we obtained from ScienceDirect 35,673 articles published from 1957 and 2016 by the participants to the 2012 Italian Scientific Habilitation, that resulted in the extraction of 1,379,050 citations retrieved through Semantic Publishing technologies. Our analysis showed an overall increment in author self-citations (i.e. where the citing article and the cited article share at least one author) in several of the 24 academic disciplines considered. However, we depicted a stronger causal relation between such increment and the rules introduced by the 2012 Italian Scientific Habilitation in 10 out of 24 disciplines analysed. △ Less

Submitted 19 February, 2020; v1 submitted 14 March, 2019; originally announced March 2019.

arXiv:1902.03287 [pdf]

Open data to evaluate academic researchers: an experiment with the Italian Scientific Habilitation

Authors: Angelo Di Iorio, Silvio Peroni, Francesco Poggi

Abstract: The need for scholarly open data is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still a few examples of free citation networks and their coverage is partial. One of the results is that most of the evaluation processes based on citation counts rely on commercial citation databases. Things are changing under the pressure of the… ▽ More The need for scholarly open data is ever increasing. While there are large repositories of open access articles and free publication indexes, there are still a few examples of free citation networks and their coverage is partial. One of the results is that most of the evaluation processes based on citation counts rely on commercial citation databases. Things are changing under the pressure of the Initiative for Open Citations (I4OC), whose goal is to campaign for scholarly publishers to make their citations as totally open. This paper investigates the growth of open citations with an experiment on the Italian Scientific Habilitation, the National process for University Professor qualification which instead uses data from commercial indexes. We simulated the procedure by only using open data and explored similarities and differences with the official results. The outcomes of the experiment show that the amount of open citation data currently available is not yet enough for obtaining similar results. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Comments: 12 pages, 1 figure, 6 tables, submitted to the 17th International Conference on Scientometrics and Informentrics (ISSI 2019)

arXiv:1902.02534 [pdf]

Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal

Authors: Ivan Heibi, Silvio Peroni, David Shotton

Abstract: In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we… ▽ More In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we propose a strategy whereby the community (e.g. scholars and publishers) can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations Index. △ Less

Submitted 21 June, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: 7 pages, 3 figures, accepted to ISSI 2019 (https://www.issi2019.org/)

arXiv:1812.11813 [pdf, other]

doi 10.1007/s11192-018-2988-z

Do altmetrics work for assessing research quality?

Authors: Andrea Giovanni Nuzzolese, Paolo Ciancarini, Aldo Gangemi, Silvio Peroni, Francesco Poggi, Valentina Presutti

Abstract: Alternative metrics (aka altmetrics) are gaining increasing interest in the scientometrics community as they can capture both the volume and quality of attention that a research work receives online. Nevertheless, there is limited knowledge about their effectiveness as a mean for measuring the impact of research if compared to traditional citation-based indicators. This work aims at rigorously inv… ▽ More Alternative metrics (aka altmetrics) are gaining increasing interest in the scientometrics community as they can capture both the volume and quality of attention that a research work receives online. Nevertheless, there is limited knowledge about their effectiveness as a mean for measuring the impact of research if compared to traditional citation-based indicators. This work aims at rigorously investigating if any correlation exists among indicators, either traditional (i.e. citation count and h-index) or alternative (i.e. altmetrics) and which of them may be effective for evaluating scholars. The study is based on the analysis of real data coming from the National Scientific Qualification procedure held in Italy by committees of peers on behalf of the Italian Ministry of Education, Universities and Research. △ Less

Submitted 31 December, 2018; originally announced December 2018.

arXiv:1605.01188 [pdf]

doi 10.1145/3051487

Enhancing semantic expressivity in the cultural heritage domain: exposing the Zeri Photo Archive as Linked Open Data

Authors: Marilena Daquino, Francesca Mambelli, Silvio Peroni, Francesca Tomasi, Fabio Vitali

Abstract: Describing cultural heritage objects from the perspective of Linked Open Data (LOD) is not a trivial task. The process often requires not only choosing pertinent ontologies, but also develo** new models that preserve the most information and express the semantic power of cultural heritage data. Indeed, data managed in archives, libraries and museums are complex objects themselves, which require… ▽ More Describing cultural heritage objects from the perspective of Linked Open Data (LOD) is not a trivial task. The process often requires not only choosing pertinent ontologies, but also develo** new models that preserve the most information and express the semantic power of cultural heritage data. Indeed, data managed in archives, libraries and museums are complex objects themselves, which require a deep reflection on even non-conventional conceptual models. Starting from these considerations, this paper describes a research project: to expose the vastness of one of the most important collections of European cultural heritage, the Zeri Photo Archive, as Linked Open Data. We describe here the steps we undertook to this end: firstly, we developed two ad hoc ontologies for describing all the issues not completely covered by existent models (the F Entry and the OA Entry Ontology); then we mapped into RDF the descriptive elements used in the current Zeri Photo Archive catalog, converting into CIDOC-CRM and into the two new aforementioned models the source data based on the Italian content standards Scheda F (Photography Entry, in English) and Scheda OA (Work of Art Entry, in English); and finally, we created an RDF dataset of the output of the map** that could show a result capable of demonstrating the complexity of our scenario. △ Less

Submitted 12 February, 2017; v1 submitted 4 May, 2016; originally announced May 2016.

Comments: 25 pages, 4 figures, journal article

Showing 1–45 of 45 results for author: Peroni, S