-
OEKG: The Open Event Knowledge Graph
Authors:
Simon Gottschalk,
Endri Kacupaj,
Sara Abdollahi,
Diego Alves,
Gabriel Amaral,
Elisavet Koutsiana,
Tin Kuculo,
Daniela Major,
Caio Mello,
Gullal S. Cheema,
Abdul Sittar,
Swati,
Golsa Tahmasebzadeh,
Gaurish Thakkar
Abstract:
Accessing and understanding contemporary and historical events of global impact such as the US elections and the Olympic Games is a major prerequisite for cross-lingual event analytics that investigate event causes, perception and consequences across country borders. In this paper, we present the Open Event Knowledge Graph (OEKG), a multilingual, event-centric, temporal knowledge graph composed of…
▽ More
Accessing and understanding contemporary and historical events of global impact such as the US elections and the Olympic Games is a major prerequisite for cross-lingual event analytics that investigate event causes, perception and consequences across country borders. In this paper, we present the Open Event Knowledge Graph (OEKG), a multilingual, event-centric, temporal knowledge graph composed of seven different data sets from multiple application domains, including question answering, entity recommendation and named entity recognition. These data sets are all integrated through an easy-to-use and robust pipeline and by linking to the event-centric knowledge graph EventKG. We describe their common schema and demonstrate the use of the OEKG at the example of three use cases: type-specific image retrieval, hybrid question answering over knowledge graphs and news articles, as well as language-specific event recommendation. The OEKG and its query endpoint are publicly available.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia
Authors:
Diego Alves,
Gaurish Thakkar,
Gabriel Amaral,
Tin Kuculo,
Marko Tadić
Abstract:
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available…
▽ More
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources
Authors:
Gabriel Amaral,
Odinaldo Rodrigues,
Elena Simperl
Abstract:
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance…
▽ More
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph's information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with 87.5% accuracy and 82.9% F1-macro on text-rich sources. The evaluation data and scripts used in this paper are available on GitHub and Figshare.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Statistical and Neural Methods for Cross-lingual Entity Label Map** in Knowledge Graphs
Authors:
Gabriel Amaral,
Mārcis Pinnis,
Inguna Skadiņa,
Odinaldo Rodrigues,
Elena Simperl
Abstract:
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, w…
▽ More
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that map** between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
WDV: A Broad Data Verbalisation Dataset Built from Wikidata
Authors:
Gabriel Amaral,
Odinaldo Rodrigues,
Elena Simperl
Abstract:
Data verbalisation is a task of great importance in the current field of natural language processing, as there is great benefit in the transformation of our abundant structured and semi-structured data into human-readable formats. Verbalising Knowledge Graph (KG) data focuses on converting interconnected triple-based claims, formed of subject, predicate, and object, into text. Although KG verbalis…
▽ More
Data verbalisation is a task of great importance in the current field of natural language processing, as there is great benefit in the transformation of our abundant structured and semi-structured data into human-readable formats. Verbalising Knowledge Graph (KG) data focuses on converting interconnected triple-based claims, formed of subject, predicate, and object, into text. Although KG verbalisation datasets exist for some KGs, there are still gaps in their fitness for use in many scenarios. This is especially true for Wikidata, where available datasets either loosely couple claim sets with textual information or heavily focus on predicates around biographies, cities, and countries. To address these gaps, we propose WDV, a large KG claim verbalisation dataset built from Wikidata, with a tight coupling between triples and text, covering a wide variety of entities and predicates. We also evaluate the quality of our verbalisations through a reusable workflow for measuring human-centred fluency and adequacy scores. Our data and code are openly available in the hopes of furthering research towards KG verbalisation.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Assessing the quality of sources in Wikidata across languages: a hybrid approach
Authors:
Gabriel Amaral,
Alessandro Piscopo,
Lucie-Aimée Kaffee,
Odinaldo Rodrigues,
Elena Simperl
Abstract:
Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this e…
▽ More
Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019
Authors:
Nacira Abbas,
Kholoud Alghamdi,
Mortaza Alinam,
Francesca Alloatti,
Glenda Amaral,
Claudia d'Amato,
Luigi Asprino,
Martin Beno,
Felix Bensmann,
Russa Biswas,
Ling Cai,
Riley Capshaw,
Valentina Anita Carriero,
Irene Celino,
Amine Dadoun,
Stefano De Giorgis,
Harm Delva,
John Domingue,
Michel Dumontier,
Vincent Emonet,
Marieke van Erp,
Paola Espinoza Arias,
Omaima Fallatah,
Sebastián Ferrada,
Marc Gallofré Ocaña
, et al. (49 additional authors not shown)
Abstract:
One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this fur…
▽ More
One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this further by asking if we can create a knowledge graph of "everything" ranging from common sense concepts to location based entities. This knowledge graph should be "open to the public" in a FAIR manner democratizing this mass amount of knowledge." Although linked open data (LOD) is one knowledge graph, it is the closest realisation (and probably the only one) to a public FAIR Knowledge Graph (KG) of everything. Surely, LOD provides a unique testbed for experimenting and evaluating research hypotheses on open and FAIR KG. One of the most neglected FAIR issues about KGs is their ongoing evolution and long term preservation. We want to investigate this problem, that is to understand what preserving and supporting the evolution of KGs means and how these problems can be addressed. Clearly, the problem can be approached from different perspectives and may require the development of different approaches, including new theories, ontologies, metrics, strategies, procedures, etc. This document reports a collaborative effort performed by 9 teams of students, each guided by a senior researcher as their mentor, attending the International Semantic Web Research School (ISWS 2019). Each team provides a different perspective to the problem of knowledge graph evolution substantiated by a set of research questions as the main subject of their investigation. In addition, they provide their working definition for KG preservation and evolution.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
UNER: Universal Named-Entity RecognitionFramework
Authors:
Diego Alves,
Tin Kuculo,
Gabriel Amaral,
Gaurish Thakkar,
Marko Tadic
Abstract:
We introduce the Universal Named-Entity Recognition (UNER)framework, a 4-level classification hierarchy, and the methodology that isbeing adopted to create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. First, the English SETimescorpus will be annotated using existing tools and knowledge bases. Afterevaluating the resulting annotations through crowdsou…
▽ More
We introduce the Universal Named-Entity Recognition (UNER)framework, a 4-level classification hierarchy, and the methodology that isbeing adopted to create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. First, the English SETimescorpus will be annotated using existing tools and knowledge bases. Afterevaluating the resulting annotations through crowdsourcing campaigns,they will be propagated automatically to other languages within the SE-Times corpora. Finally, as an extrinsic evaluation, the UNER multilin-gual dataset will be used to train and test available NER tools. As part offuture research directions, we aim to increase the number of languages inthe UNER corpus and to investigate possible ways of integrating UNERwith available knowledge graphs to improve named-entity recognition.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Visual-Quality-Driven Learning for Underwater Vision Enhancement
Authors:
Walysson Vital Barbosa,
Henrique Grandinetti Barbosa Amaral,
Thiago Lages Rocha,
Erickson Rangel Nascimento
Abstract:
The image processing community has witnessed remarkable advances in enhancing and restoring images. Nevertheless, restoring the visual quality of underwater images remains a great challenge. End-to-end frameworks might fail to enhance the visual quality of underwater images since in several scenarios it is not feasible to provide the ground truth of the scene radiance. In this work, we propose a C…
▽ More
The image processing community has witnessed remarkable advances in enhancing and restoring images. Nevertheless, restoring the visual quality of underwater images remains a great challenge. End-to-end frameworks might fail to enhance the visual quality of underwater images since in several scenarios it is not feasible to provide the ground truth of the scene radiance. In this work, we propose a CNN-based approach that does not require ground truth data since it uses a set of image quality metrics to guide the restoration learning process. The experiments showed that our method improved the visual quality of underwater images preserving their edges and also performed well considering the UCIQE metric.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
An Optimal Polarization Tracking Algorithm for Lithium-Niobate-based Polarization Controllers
Authors:
Joaquim D. Garcia,
Gustavo C. Amaral
Abstract:
We present an optimal algorithm for the three-stage arbitrary polarization tracking using Lithium-Niobate-based Polarization Controllers: device calibration, polarization state rotation, and stabilization. The theoretical model representing the lithium-niobate-based polarization controller is derived and the methodology is successfully applied. Results are numerically simulated in the MATLAB envir…
▽ More
We present an optimal algorithm for the three-stage arbitrary polarization tracking using Lithium-Niobate-based Polarization Controllers: device calibration, polarization state rotation, and stabilization. The theoretical model representing the lithium-niobate-based polarization controller is derived and the methodology is successfully applied. Results are numerically simulated in the MATLAB environment.
△ Less
Submitted 11 March, 2016;
originally announced March 2016.
-
Linear-Optic Heralded Photon Source
Authors:
Thiago Ferreira da Silva,
Gustavo C. Amaral,
Guilherme P. Temporão,
Jean Pierre von der Weid
Abstract:
We present a Heralded Photon Source based only on linear optics and weak coherent states. By time-tuning a Hong-Ou-Mandel interferometer fed with frequency-displaced coherent states, the output photons can be synchronously heralded following sub-Poisson statistics, which is indicated by the second-order correlation function ($g^2\left(0\right)=0.556$). The absence of phase-matching restrictions ma…
▽ More
We present a Heralded Photon Source based only on linear optics and weak coherent states. By time-tuning a Hong-Ou-Mandel interferometer fed with frequency-displaced coherent states, the output photons can be synchronously heralded following sub-Poisson statistics, which is indicated by the second-order correlation function ($g^2\left(0\right)=0.556$). The absence of phase-matching restrictions makes the source widely tunable, with 100-nm spectral tunability on the telecom bands. The technique presents yield comparable to state-of-the-art spontaneous parametric down-conversion-based sources, with high coherence and fiber-optic quantum communication compatibility.
△ Less
Submitted 22 November, 2015; v1 submitted 7 July, 2015;
originally announced July 2015.