Search | arXiv e-print repository

News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation

Authors: Andreea Iana, Fabian David Schmidt, Goran Glavaš, Heiko Paulheim

Abstract: Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbon… ▽ More Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation. △ Less

Submitted 18 June, 2024; originally announced June 2024.

ACM Class: I.2.7; H.3.3

arXiv:2406.02650 [pdf, other]

By Fair Means or Foul: Quantifying Collusion in a Market Simulation with Deep Reinforcement Learning

Authors: Michael Schlechtinger, Damaris Kosack, Franz Krause, Heiko Paulheim

Abstract: In the rapidly evolving landscape of eCommerce, Artificial Intelligence (AI) based pricing algorithms, particularly those utilizing Reinforcement Learning (RL), are becoming increasingly prevalent. This rise has led to an inextricable pricing situation with the potential for market collusion. Our research employs an experimental oligopoly model of repeated price competition, systematically varying… ▽ More In the rapidly evolving landscape of eCommerce, Artificial Intelligence (AI) based pricing algorithms, particularly those utilizing Reinforcement Learning (RL), are becoming increasingly prevalent. This rise has led to an inextricable pricing situation with the potential for market collusion. Our research employs an experimental oligopoly model of repeated price competition, systematically varying the environment to cover scenarios from basic economic theory to subjective consumer demand preferences. We also introduce a novel demand framework that enables the implementation of various demand models, allowing for a weighted blending of different models. In contrast to existing research in this domain, we aim to investigate the strategies and emerging pricing patterns developed by the agents, which may lead to a collusive outcome. Furthermore, we investigate a scenario where agents cannot observe their competitors' prices. Finally, we provide a comprehensive legal analysis across all scenarios. Our findings indicate that RL-based AI agents converge to a collusive state characterized by the charging of supracompetitive prices, without necessarily requiring inter-agent communication. Implementing alternative RL algorithms, altering the number of agents or simulation settings, and restricting the scope of the agents' observation space does not significantly impact the collusive market outcome behavior. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Preprint for IJCAI 2024

arXiv:2405.15375 [pdf, other]

A Planet Scale Spatial-Temporal Knowledge Graph Based On OpenStreetMap And H3 Grid

Authors: Martin Böckling, Heiko Paulheim, Sarah Detzler

Abstract: Geospatial data plays a central role in modeling our world, for which OpenStreetMap (OSM) provides a rich source of such data. While often spatial data is represented in a tabular format, a graph based representation provides the possibility to interconnect entities which would have been separated in a tabular representation. We propose in our paper a framework which supports a planet scale transf… ▽ More Geospatial data plays a central role in modeling our world, for which OpenStreetMap (OSM) provides a rich source of such data. While often spatial data is represented in a tabular format, a graph based representation provides the possibility to interconnect entities which would have been separated in a tabular representation. We propose in our paper a framework which supports a planet scale transformation of OpenStreetMap data into a Spatial Temporal Knowledge Graph. In addition to OpenStreetMap data, we align the different OpenStreetMap geometries on individual h3 grid cells. We compare our constructed spatial knowledge graph to other spatial knowledge graphs and outline our contribution in this paper. As a basis for our computation, we use Apache Sedona as a computational framework for our Spatial Temporal Knowledge Graph construction △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 12 pages, 2 figures, GeoLD2024: 6th Geospatial Linked Data Workshop, May 26, 2024, Hersonissos, Greece

arXiv:2404.14970 [pdf, other]

Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

Authors: Rita T. Sousa, Heiko Paulheim

Abstract: Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and… ▽ More Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration. KG embedding methods are then employed to generate vector representations, serving as inputs for a classifier. Experiments demonstrated the efficacy of our approach, revealing improvements in diabetes prediction when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 11 pages, 4 figures, 7th Workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics at ESWC2024

ACM Class: J.3

arXiv:2403.17876 [pdf, other]

MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation

Authors: Andreea Iana, Goran Glavaš, Heiko Paulheim

Abstract: Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation effo… ▽ More Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in both zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual language model, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. The xMIND dataset is available at https://github.com/andreeaiana/xMIND. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

ACM Class: H.3.3

arXiv:2312.10370 [pdf, other]

Do Similar Entities have Similar Embeddings?

Authors: Nicolas Hubert, Heiko Paulheim, Armelle Brun, Davy Monticolo

Abstract: Knowledge graph embedding models (KGEMs) developed for link prediction learn vector representations for entities in a knowledge graph, known as embeddings. A common tacit assumption is the KGE entity similarity assumption, which states that these KGEMs retain the graph's structure within their embedding space, \textit{i.e.}, position similar entities within the graph close to one another. This des… ▽ More Knowledge graph embedding models (KGEMs) developed for link prediction learn vector representations for entities in a knowledge graph, known as embeddings. A common tacit assumption is the KGE entity similarity assumption, which states that these KGEMs retain the graph's structure within their embedding space, \textit{i.e.}, position similar entities within the graph close to one another. This desirable property make KGEMs widely used in downstream tasks such as recommender systems or drug repurposing. Yet, the relation of entity similarity and similarity in the embedding space has rarely been formally evaluated. Typically, KGEMs are assessed based on their sole link prediction capabilities, using ranked-based metrics such as Hits@K or Mean Rank. This paper challenges the prevailing assumption that entity similarity in the graph is inherently mirrored in the embedding space. Therefore, we conduct extensive experiments to measure the capability of KGEMs to cluster similar entities together, and investigate the nature of the underlying factors. Moreover, we study if different KGEMs expose a different notion of similarity. Datasets, pre-trained embeddings and code are available at: https://github.com/nicolas-hbt/similar-embeddings/. △ Less

Submitted 28 March, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: Accepted at ESWC 2024

arXiv:2312.04997 [pdf, other]

Beyond Transduction: A Survey on Inductive, Few Shot, and Zero Shot Link Prediction in Knowledge Graphs

Authors: Nicolas Hubert, Pierre Monnin, Heiko Paulheim

Abstract: Knowledge graphs (KGs) comprise entities interconnected by relations of different semantic meanings. KGs are being used in a wide range of applications. However, they inherently suffer from incompleteness, i.e. entities or facts about entities are missing. Consequently, a larger body of works focuses on the completion of missing information in KGs, which is commonly referred to as link prediction… ▽ More Knowledge graphs (KGs) comprise entities interconnected by relations of different semantic meanings. KGs are being used in a wide range of applications. However, they inherently suffer from incompleteness, i.e. entities or facts about entities are missing. Consequently, a larger body of works focuses on the completion of missing information in KGs, which is commonly referred to as link prediction (LP). This task has traditionally and extensively been studied in the transductive setting, where all entities and relations in the testing set are observed during training. Recently, several works have tackled the LP task under more challenging settings, where entities and relations in the test set may be unobserved during training, or appear in only a few facts. These works are known as inductive, few-shot, and zero-shot link prediction. In this work, we conduct a systematic review of existing works in this area. A thorough analysis leads us to point out the undesirable existence of diverging terminologies and task definitions for the aforementioned settings, which further limits the possibility of comparison between recent works. We consequently aim at dissecting each setting thoroughly, attempting to reveal its intrinsic characteristics. A unifying nomenclature is ultimately proposed to refer to each of them in a simple and consistent manner. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.03837 [pdf, other]

doi 10.1145/3587259.3627571

OLaLa: Ontology Matching with Large Language Models

Authors: Sven Hertling, Heiko Paulheim

Abstract: Ontology (and more generally: Knowledge Graph) Matching is a challenging task where information in natural language is one of the most important signals to process. With the rise of Large Language Models, it is possible to incorporate this knowledge in a better way into the matching pipeline. A number of decisions still need to be taken, e.g., how to generate a prompt that is useful to the model,… ▽ More Ontology (and more generally: Knowledge Graph) Matching is a challenging task where information in natural language is one of the most important signals to process. With the rise of Large Language Models, it is possible to incorporate this knowledge in a better way into the matching pipeline. A number of decisions still need to be taken, e.g., how to generate a prompt that is useful to the model, how information in the KG can be formulated in prompts, which Large Language Model to choose, how to provide existing correspondences to the model, how to generate candidates, etc. In this paper, we present a prototype that explores these questions by applying zero-shot and few-shot prompting with multiple open Large Language Models to different tasks of the Ontology Alignment Evaluation Initiative (OAEI). We show that with only a handful of examples and a well-designed prompt, it is possible to achieve results that are en par with supervised matching systems which use a much larger portion of the ground truth. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Accepted at K-CAP 2023 conference

arXiv:2310.01146 [pdf, other]

NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation

Authors: Andreea Iana, Goran Glavaš, Heiko Paulheim

Abstract: NewsRecLib is an open-source library based on Pytorch-Lightning and Hydra developed for training and evaluating neural news recommendation models. The foremost goals of NewsRecLib are to promote reproducible research and rigorous experimental evaluation by (i) providing a unified and highly configurable framework for exhaustive experimental studies and (ii) enabling a thorough analysis of the perf… ▽ More NewsRecLib is an open-source library based on Pytorch-Lightning and Hydra developed for training and evaluating neural news recommendation models. The foremost goals of NewsRecLib are to promote reproducible research and rigorous experimental evaluation by (i) providing a unified and highly configurable framework for exhaustive experimental studies and (ii) enabling a thorough analysis of the performance contribution of different model architecture components and training regimes. NewsRecLib is highly modular, allows specifying experiments in a single configuration file, and includes extensive logging facilities. Moreover, NewsRecLib provides out-of-the-box implementations of several prominent neural models, training methods, standard evaluation benchmarks, and evaluation metrics for news recommendation. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: Accepted at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

ACM Class: H.3.3; D.2.13

arXiv:2309.13939 [pdf, other]

The Time Traveler's Guide to Semantic Web Research: Analyzing Fictitious Research Themes in the ESWC "Next 20 Years" Track

Authors: Irene Celino, Heiko Paulheim

Abstract: What will Semantic Web research focus on in 20 years from now? We asked this question to the community and collected their visions in the "Next 20 years" track of ESWC 2023. We challenged the participants to submit "future" research papers, as if they were submitting to the 2043 edition of the conference. The submissions - entirely fictitious - were expected to be full scientific papers, with rese… ▽ More What will Semantic Web research focus on in 20 years from now? We asked this question to the community and collected their visions in the "Next 20 years" track of ESWC 2023. We challenged the participants to submit "future" research papers, as if they were submitting to the 2043 edition of the conference. The submissions - entirely fictitious - were expected to be full scientific papers, with research questions, state of the art references, experimental results and future work, with the goal to get an idea of the research agenda for the late 2040s and early 2050s. We received ten submissions, eight of which were accepted for presentation at the conference, that mixed serious ideas of potential future research themes and discussion topics with some fun and irony. In this paper, we intend to provide a survey of those "science fiction" papers, considering the emerging research themes and topics, analysing the research methods applied by the authors in these very special submissions, and investigating also the most fictitious parts (e.g., neologisms, fabricated references). Our goal is twofold: on the one hand, we investigate what this special track tells us about the Semantic Web community and, on the other hand, we aim at getting some insights on future research practices and directions. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 13 pages, 8 figures, 2 tables

arXiv:2309.03023 [pdf, other]

Universal Preprocessing Operators for Embedding Knowledge Graphs with Literals

Authors: Patryk Preisner, Heiko Paulheim

Abstract: Knowledge graph embeddings are dense numerical representations of entities in a knowledge graph (KG). While the majority of approaches concentrate only on relational information, i.e., relations between entities, fewer approaches exist which also take information about literal values (e.g., textual descriptions or numerical information) into account. Those which exist are typically tailored toward… ▽ More Knowledge graph embeddings are dense numerical representations of entities in a knowledge graph (KG). While the majority of approaches concentrate only on relational information, i.e., relations between entities, fewer approaches exist which also take information about literal values (e.g., textual descriptions or numerical information) into account. Those which exist are typically tailored towards a particular modality of literal and a particular embedding method. In this paper, we propose a set of universal preprocessing operators which can be used to transform KGs with literals for numerical, temporal, textual, and image information, so that the transformed KGs can be embedded with any method. The results on the kgbench dataset with three different embedding methods show promising results. △ Less

Submitted 6 September, 2023; originally announced September 2023.

Comments: Accepted for DL4KG Workshop at ISWC 2023

arXiv:2309.00550 [pdf, other]

NeMig -- A Bilingual News Collection and Knowledge Graph about Migration

Authors: Andreea Iana, Mehwish Alam, Alexander Grote, Nevena Nikolajevic, Katharina Ludwig, Philipp Müller, Christof Weinhardt, Heiko Paulheim

Abstract: News recommendation plays a critical role in sha** the public's worldviews through the way in which it filters and disseminates information about different topics. Given the crucial impact that media plays in opinion formation, especially for sensitive topics, understanding the effects of personalized recommendation beyond accuracy has become essential in today's digital society. In this work, w… ▽ More News recommendation plays a critical role in sha** the public's worldviews through the way in which it filters and disseminates information about different topics. Given the crucial impact that media plays in opinion formation, especially for sensitive topics, understanding the effects of personalized recommendation beyond accuracy has become essential in today's digital society. In this work, we present NeMig, a bilingual news collection on the topic of migration, and corresponding rich user data. In comparison to existing news recommendation datasets, which comprise a large variety of monolingual news, NeMig covers articles on a single controversial topic, published in both Germany and the US. We annotate the sentiment polarization of the articles and the political leanings of the media outlets, in addition to extracting subtopics and named entities disambiguated through Wikidata. These features can be used to analyze the effects of algorithmic news curation beyond accuracy-based performance, such as recommender biases and the creation of filter bubbles. We construct domain-specific knowledge graphs from the news text and metadata, thus encoding knowledge-level connections between articles. Importantly, while existing datasets include only click behavior, we collect user socio-demographic and political information in addition to explicit click feedback. We demonstrate the utility of NeMig through experiments on the tasks of news recommenders benchmarking, analysis of biases in recommenders, and news trends analysis. NeMig aims to provide a useful resource for the news recommendation community and to foster interdisciplinary research into the multidimensional effects of algorithmic news curation. △ Less

Submitted 1 September, 2023; originally announced September 2023.

Comments: Accepted at the 11th International Workshop on News Recommendation and Analytics (INRA 2023) in conjunction with ACM RecSys 2023

arXiv:2308.10537 [pdf, other]

KGrEaT: A Framework to Evaluate Knowledge Graphs via Downstream Tasks

Authors: Nicolas Heist, Sven Hertling, Heiko Paulheim

Abstract: In recent years, countless research papers have addressed the topics of knowledge graph creation, extension, or completion in order to create knowledge graphs that are larger, more correct, or more diverse. This research is typically motivated by the argumentation that using such enhanced knowledge graphs to solve downstream tasks will improve performance. Nonetheless, this is hardly ever evaluate… ▽ More In recent years, countless research papers have addressed the topics of knowledge graph creation, extension, or completion in order to create knowledge graphs that are larger, more correct, or more diverse. This research is typically motivated by the argumentation that using such enhanced knowledge graphs to solve downstream tasks will improve performance. Nonetheless, this is hardly ever evaluated. Instead, the predominant evaluation metrics - aiming at correctness and completeness - are undoubtedly valuable but fail to capture the complete picture, i.e., how useful the created or enhanced knowledge graph actually is. Further, the accessibility of such a knowledge graph is rarely considered (e.g., whether it contains expressive labels, descriptions, and sufficient context information to link textual mentions to the entities of the knowledge graph). To better judge how well knowledge graphs perform on actual tasks, we present KGrEaT - a framework to estimate the quality of knowledge graphs via actual downstream tasks like classification, clustering, or recommendation. Instead of comparing different methods of processing knowledge graphs with respect to a single task, the purpose of KGrEaT is to compare various knowledge graphs as such by evaluating them on a fixed task setup. The framework takes a knowledge graph as input, automatically maps it to the datasets to be evaluated on, and computes performance metrics for the defined tasks. It is built in a modular way to be easily extendable with additional tasks and datasets. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted for the Short Paper track of CIKM'23, October 21-25, 2023, Birmingham, United Kingdom

arXiv:2308.03447 [pdf, other]

Biomedical Knowledge Graph Embeddings with Negative Statements

Authors: Rita T. Sousa, Sara Silva, Heiko Paulheim, Catia Pesquita

Abstract: A knowledge graph is a powerful representation of real-world entities and their relations. The vast majority of these relations are defined as positive statements, but the importance of negative statements is increasingly recognized, especially under an Open World Assumption. Explicitly considering negative statements has been shown to improve performance on tasks such as entity summarization and… ▽ More A knowledge graph is a powerful representation of real-world entities and their relations. The vast majority of these relations are defined as positive statements, but the importance of negative statements is increasingly recognized, especially under an Open World Assumption. Explicitly considering negative statements has been shown to improve performance on tasks such as entity summarization and question answering or domain-specific tasks such as protein function prediction. However, no attention has been given to the exploration of negative statements by knowledge graph embedding approaches despite the potential of negative statements to produce more accurate representations of entities in a knowledge graph. We propose a novel approach, TrueWalks, to incorporate negative statements into the knowledge graph representation learning process. In particular, we present a novel walk-generation method that is able to not only differentiate between positive and negative statements but also take into account the semantic implications of negation in ontology-rich knowledge graphs. This is of particular importance for applications in the biomedical domain, where the inadequacy of embedding approaches regarding negative statements at the ontology level has been identified as a crucial limitation. We evaluate TrueWalks in ontology-rich biomedical knowledge graphs in two different predictive tasks based on KG embeddings: protein-protein interaction prediction and gene-disease association prediction. We conduct an extensive analysis over established benchmarks and demonstrate that our method is able to improve the performance of knowledge graph embeddings on all tasks. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 19 pages, 4 figures

arXiv:2307.16089 [pdf, other]

Train Once, Use Flexibly: A Modular Framework for Multi-Aspect Neural News Recommendation

Authors: Andreea Iana, Goran Glavaš, Heiko Paulheim

Abstract: Recent neural news recommenders (NNRs) extend content-based recommendation (1) by aligning additional aspects (e.g., topic, sentiment) between candidate news and user history or (2) by diversifying recommendations w.r.t. these aspects. This customization is achieved by "hardcoding" additional constraints into the NNR's architecture and/or training objectives: any change in the desired recommendati… ▽ More Recent neural news recommenders (NNRs) extend content-based recommendation (1) by aligning additional aspects (e.g., topic, sentiment) between candidate news and user history or (2) by diversifying recommendations w.r.t. these aspects. This customization is achieved by "hardcoding" additional constraints into the NNR's architecture and/or training objectives: any change in the desired recommendation behavior thus requires retraining the model with a modified objective. This impedes widespread adoption of multi-aspect news recommenders. In this work, we introduce MANNeR, a modular framework for multi-aspect neural news recommendation that supports on-the-fly customization over individual aspects at inference time. With metric-based learning as its backbone, MANNeR learns aspect-specialized news encoders and then flexibly and linearly combines the resulting aspect-specific similarity scores into different ranking functions, alleviating the need for ranking function-specific retraining of the model. Extensive experimental results show that MANNeR consistently outperforms state-of-the-art NNRs on both standard content-based recommendation and single- and multi-aspect customization. Lastly, we validate that MANNeR's aspect-customization module is robust to language and domain transfer. △ Less

Submitted 16 April, 2024; v1 submitted 29 July, 2023; originally announced July 2023.

ACM Class: H.3.3; I.2.7

arXiv:2306.03659 [pdf, other]

Schema First! Learn Versatile Knowledge Graph Embeddings by Capturing Semantics with MASCHInE

Authors: Nicolas Hubert, Heiko Paulheim, Pierre Monnin, Armelle Brun, Davy Monticolo

Abstract: Knowledge graph embedding models (KGEMs) have gained considerable traction in recent years. These models learn a vector representation of knowledge graph entities and relations, a.k.a. knowledge graph embeddings (KGEs). Learning versatile KGEs is desirable as it makes them useful for a broad range of tasks. However, KGEMs are usually trained for a specific task, which makes their embeddings task-d… ▽ More Knowledge graph embedding models (KGEMs) have gained considerable traction in recent years. These models learn a vector representation of knowledge graph entities and relations, a.k.a. knowledge graph embeddings (KGEs). Learning versatile KGEs is desirable as it makes them useful for a broad range of tasks. However, KGEMs are usually trained for a specific task, which makes their embeddings task-dependent. In parallel, the widespread assumption that KGEMs actually create a semantic representation of the underlying entities and relations (e.g., project similar entities closer than dissimilar ones) has been challenged. In this work, we design heuristics for generating protographs -- small, modified versions of a KG that leverage RDF/S information. The learnt protograph-based embeddings are meant to encapsulate the semantics of a KG, and can be leveraged in learning KGEs that, in turn, also better capture semantics. Extensive experiments on various evaluation benchmarks demonstrate the soundness of this approach, which we call Modular and Agnostic SCHema-based Integration of protograph Embeddings (MASCHInE). In particular, MASCHInE helps produce more versatile KGEs that yield substantially better performance for entity clustering and node classification tasks. For link prediction, using MASCHinE substantially increases the number of semantically valid predictions with equivalent rank-based performance. △ Less

Submitted 19 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.02966 [pdf, other]

ExeKGLib: Knowledge Graphs-Empowered Machine Learning Analytics

Authors: Antonis Klironomos, Baifan Zhou, Zhipeng Tan, Zhuoxun Zheng, Gad-Elrab Mohamed, Heiko Paulheim, Evgeny Kharlamov

Abstract: Many machine learning (ML) libraries are accessible online for ML practitioners. Typical ML pipelines are complex and consist of a series of steps, each of them invoking several ML libraries. In this demo paper, we present ExeKGLib, a Python library that allows users with coding skills and minimal ML knowledge to build ML pipelines. ExeKGLib relies on knowledge graphs to improve the transparency a… ▽ More Many machine learning (ML) libraries are accessible online for ML practitioners. Typical ML pipelines are complex and consist of a series of steps, each of them invoking several ML libraries. In this demo paper, we present ExeKGLib, a Python library that allows users with coding skills and minimal ML knowledge to build ML pipelines. ExeKGLib relies on knowledge graphs to improve the transparency and reusability of the built ML workflows, and to ensure that they are executable. We demonstrate the usage of ExeKGLib and compare it with conventional ML code to show its benefits. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: This paper has been accepted as a Demo paper at ESWC 2023

arXiv:2304.03112 [pdf, other]

Simplifying Content-Based Neural News Recommendation: On User Modeling and Training Objectives

Authors: Andreea Iana, Goran Glavaš, Heiko Paulheim

Abstract: The advent of personalized news recommendation has given rise to increasingly complex recommender architectures. Most neural news recommenders rely on user click behavior and typically introduce dedicated user encoders that aggregate the content of clicked news into user embeddings (early fusion). These models are predominantly trained with standard point-wise classification objectives. The existi… ▽ More The advent of personalized news recommendation has given rise to increasingly complex recommender architectures. Most neural news recommenders rely on user click behavior and typically introduce dedicated user encoders that aggregate the content of clicked news into user embeddings (early fusion). These models are predominantly trained with standard point-wise classification objectives. The existing body of work exhibits two main shortcomings: (1) despite general design homogeneity, direct comparisons between models are hindered by varying evaluation datasets and protocols; (2) it leaves alternative model designs and training objectives vastly unexplored. In this work, we present a unified framework for news recommendation, allowing for a systematic and fair comparison of news recommenders across several crucial design dimensions: (i) candidate-awareness in user modeling, (ii) click behavior fusion, and (iii) training objectives. Our findings challenge the status quo in neural news recommendation. We show that replacing sizable user encoders with parameter-efficient dot products between candidate and clicked news embeddings (late fusion) often yields substantial performance gains. Moreover, our results render contrastive training a viable alternative to point-wise classification objectives. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Comments: Accepted at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)

ACM Class: H.3.3

arXiv:2303.15113 [pdf, other]

Describing and Organizing Semantic Web and Machine Learning Systems in the SWeMLS-KG

Authors: Fajar J. Ekaputra, Majlinda Llugiqi, Marta Sabou, Andreas Ekelhart, Heiko Paulheim, Anna Breit, Artem Revenko, Laura Waltersdorfer, Kheir Eddine Farfar, Sören Auer

Abstract: In line with the general trend in artificial intelligence research to create intelligent systems that combine learning and symbolic components, a new sub-area has emerged that focuses on combining machine learning (ML) components with techniques developed by the Semantic Web (SW) community - Semantic Web Machine Learning (SWeML for short). Due to its rapid growth and impact on several communities… ▽ More In line with the general trend in artificial intelligence research to create intelligent systems that combine learning and symbolic components, a new sub-area has emerged that focuses on combining machine learning (ML) components with techniques developed by the Semantic Web (SW) community - Semantic Web Machine Learning (SWeML for short). Due to its rapid growth and impact on several communities in the last two decades, there is a need to better understand the space of these SWeML Systems, their characteristics, and trends. Yet, surveys that adopt principled and unbiased approaches are missing. To fill this gap, we performed a systematic study and analyzed nearly 500 papers published in the last decade in this area, where we focused on evaluating architectural, and application-specific features. Our analysis identified a rapidly growing interest in SWeML Systems, with a high impact on several application domains and tasks. Catalysts for this rapid growth are the increased application of deep learning and knowledge graph technologies. By leveraging the in-depth understanding of this area acquired through this study, a further key contribution of this paper is a classification system for SWeML Systems which we publish as ontology. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: Preprint of a paper in the resource track of the 20th Extended Semantic Web Conference (ESWC'23)

arXiv:2303.04426 [pdf, other]

NASTyLinker: NIL-Aware Scalable Transformer-based Entity Linker

Authors: Nicolas Heist, Heiko Paulheim

Abstract: Entity Linking (EL) is the task of detecting mentions of entities in text and disambiguating them to a reference knowledge base. Most prevalent EL approaches assume that the reference knowledge base is complete. In practice, however, it is necessary to deal with the case of linking to an entity that is not contained in the knowledge base (NIL entity). Recent works have shown that, instead of focus… ▽ More Entity Linking (EL) is the task of detecting mentions of entities in text and disambiguating them to a reference knowledge base. Most prevalent EL approaches assume that the reference knowledge base is complete. In practice, however, it is necessary to deal with the case of linking to an entity that is not contained in the knowledge base (NIL entity). Recent works have shown that, instead of focusing only on affinities between mentions and entities, considering inter-mention affinities can be used to represent NIL entities by producing clusters of mentions. At the same time, inter-mention affinities can help to substantially improve linking performance for known entities. With NASTyLinker, we introduce an EL approach that is aware of NIL entities and produces corresponding mention clusters while maintaining high linking performance for known entities. The approach clusters mentions and entities based on dense representations from Transformers and resolves conflicts (if more than one entity is assigned to a cluster) by computing transitive mention-entity affinities. We show the effectiveness and scalability of NASTyLinker on NILK, a dataset that is explicitly constructed to evaluate EL with respect to NIL entities. Further, we apply the presented approach to an actual EL task, namely to knowledge graph population by linking entities in Wikipedia listings, and provide an analysis of the outcome. △ Less

Submitted 13 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: Preprint of a paper in the research track of the 20th Extended Semantic Web Conference (ESWC'23)

arXiv:2210.02864 [pdf, other]

DBkWik++ -- Multi Source Matching of Knowledge Graphs

Authors: Sven Hertling, Heiko Paulheim

Abstract: Large knowledge graphs like DBpedia and YAGO are always based on the same source, i.e., Wikipedia. But there are more wikis that contain information about long-tail entities such as wiki hosting platforms like Fandom. In this paper, we present the approach and analysis of DBkWik++, a fused Knowledge Graph from thousands of wikis. A modified version of the DBpedia framework is applied to each wiki… ▽ More Large knowledge graphs like DBpedia and YAGO are always based on the same source, i.e., Wikipedia. But there are more wikis that contain information about long-tail entities such as wiki hosting platforms like Fandom. In this paper, we present the approach and analysis of DBkWik++, a fused Knowledge Graph from thousands of wikis. A modified version of the DBpedia framework is applied to each wiki which results in many isolated Knowledge Graphs. With an incremental merge based approach, we reuse one-to-one matching systems to solve the multi source KG matching task. Based on this alignment we create a consolidated knowledge graph with more than 15 million instances. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: Published at KGSWC 2022

arXiv:2210.01482 [pdf, other]

Transformer-based Subject Entity Detection in Wikipedia Listings

Authors: Nicolas Heist, Heiko Paulheim

Abstract: In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like… ▽ More In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities at the token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extracting 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: Published at Deep Learning for Knowledge Graphs workshop (DL4KG) at International Semantic Web Conference 2022 (ISWC 2022)

arXiv:2209.07479 [pdf, other]

Gollum: A Gold Standard for Large Scale Multi Source Knowledge Graph Matching

Authors: Sven Hertling, Heiko Paulheim

Abstract: The number of Knowledge Graphs (KGs) generated with automatic and manual approaches is constantly growing. For an integrated view and usage, an alignment between these KGs is necessary on the schema as well as instance level. While there are approaches that try to tackle this multi source knowledge graph matching problem, large gold standards are missing to evaluate their effectiveness and scalabi… ▽ More The number of Knowledge Graphs (KGs) generated with automatic and manual approaches is constantly growing. For an integrated view and usage, an alignment between these KGs is necessary on the schema as well as instance level. While there are approaches that try to tackle this multi source knowledge graph matching problem, large gold standards are missing to evaluate their effectiveness and scalability. We close this gap by presenting Gollum -- a gold standard for large-scale multi source knowledge graph matching with over 275,000 correspondences between 4,149 different KGs. They originate from knowledge graphs derived by applying the DBpedia extraction framework to a large wiki farm. Three variations of the gold standard are made available: (1) a version with all correspondences for evaluating unsupervised matching approaches, and two versions for evaluating supervised matching: (2) one where each KG is contained both in the train and test set, and (3) one where each KG is exclusively contained in the train or the test set. △ Less

Submitted 16 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: accepted at AKBC 2022

arXiv:2207.14094 [pdf, other]

Entity Type Prediction Leveraging Graph Walks and Entity Descriptions

Authors: Russa Biswas, Jan Portisch, Heiko Paulheim, Harald Sack, Mehwish Alam

Abstract: The entity type information in Knowledge Graphs (KGs) such as DBpedia, Freebase, etc. is often incomplete due to automated generation or human curation. Entity ty** is the task of assigning or inferring the semantic type of an entity in a KG. This paper presents \textit{GRAND}, a novel approach for entity ty** leveraging different graph walk strategies in RDF2vec together with textual entity d… ▽ More The entity type information in Knowledge Graphs (KGs) such as DBpedia, Freebase, etc. is often incomplete due to automated generation or human curation. Entity ty** is the task of assigning or inferring the semantic type of an entity in a KG. This paper presents \textit{GRAND}, a novel approach for entity ty** leveraging different graph walk strategies in RDF2vec together with textual entity descriptions. RDF2vec first generates graph walks and then uses a language model to obtain embeddings for each node in the graph. This study shows that the walk generation strategy and the embedding model have a significant effect on the performance of the entity ty** task. The proposed approach outperforms the baseline approaches on the benchmark datasets DBpedia and FIGER for entity ty** in KGs for both fine-grained and coarse-grained classes. The results show that the combination of order-aware RDF2vec variants together with the contextual embeddings of the textual entity descriptions achieve the best results. △ Less

Submitted 29 July, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.09964 [pdf, other]

On a Generalized Framework for Time-Aware Knowledge Graphs

Authors: Franz Krause, Tobias Weller, Heiko Paulheim

Abstract: Knowledge graphs have emerged as an effective tool for managing and standardizing semistructured domain knowledge in a human- and machine-interpretable way. In terms of graph-based domain applications, such as embeddings and graph neural networks, current research is increasingly taking into account the time-related evolution of the information encoded within a graph. Algorithms and models for sta… ▽ More Knowledge graphs have emerged as an effective tool for managing and standardizing semistructured domain knowledge in a human- and machine-interpretable way. In terms of graph-based domain applications, such as embeddings and graph neural networks, current research is increasingly taking into account the time-related evolution of the information encoded within a graph. Algorithms and models for stationary and static knowledge graphs are extended to make them accessible for time-aware domains, where time-awareness can be interpreted in different ways. In particular, a distinction needs to be made between the validity period and the traceability of facts as objectives of time-related knowledge graph extensions. In this context, terms and definitions such as dynamic and temporal are often used inconsistently or interchangeably in the literature. Therefore, with this paper we aim to provide a short but well-defined overview of time-aware knowledge graph extensions and thus faciliate future research in this field as well. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: Accepted for publication at Semantics 2022

arXiv:2207.06014 [pdf, other]

The DLCC Node Classification Benchmark for Analyzing Knowledge Graph Embeddings

Authors: Jan Portisch, Heiko Paulheim

Abstract: Knowledge graph embedding is a representation learning technique that projects entities and relations in a knowledge graph to continuous vector spaces. Embeddings have gained a lot of uptake and have been heavily used in link prediction and other downstream prediction tasks. Most approaches are evaluated on a single task or a single group of tasks to determine their overall performance. The evalua… ▽ More Knowledge graph embedding is a representation learning technique that projects entities and relations in a knowledge graph to continuous vector spaces. Embeddings have gained a lot of uptake and have been heavily used in link prediction and other downstream prediction tasks. Most approaches are evaluated on a single task or a single group of tasks to determine their overall performance. The evaluation is then assessed in terms of how well the embedding approach performs on the task at hand. Still, it is hardly evaluated (and often not even deeply understood) what information the embedding approaches are actually learning to represent. To fill this gap, we present the DLCC (Description Logic Class Constructors) benchmark, a resource to analyze embedding approaches in terms of which kinds of classes they can represent. Two gold standards are presented, one based on the real-world knowledge graph DBpedia and one synthetic gold standard. In addition, an evaluation framework is provided that implements an experiment protocol so that researchers can directly use the gold standard. To demonstrate the use of DLCC, we compare multiple embedding approaches using the gold standards. We find that many DL constructors on DBpedia are actually learned by recognizing different correlated patterns than those defined in the gold standard and that specific DL constructors, such as cardinality constraints, are particularly hard to be learned for most embedding approaches. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: Accepted at International Semantic Web Conference (ISWC) 2022

arXiv:2204.13931 [pdf, ps, other]

KERMIT -- A Transformer-Based Approach for Knowledge Graph Matching

Authors: Sven Hertling, Jan Portisch, Heiko Paulheim

Abstract: One of the strongest signals for automated matching of knowledge graphs and ontologies are textual concept descriptions. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is available to researchers. However, performing pairwise comparisons of all textual descriptions of concepts in two knowledge graphs is expensive and scales quadr… ▽ More One of the strongest signals for automated matching of knowledge graphs and ontologies are textual concept descriptions. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is available to researchers. However, performing pairwise comparisons of all textual descriptions of concepts in two knowledge graphs is expensive and scales quadratically (or even worse if concepts have more than one description). To overcome this problem, we follow a two-step approach: we first generate matching candidates using a pre-trained sentence transformer (so called bi-encoder). In a second step, we use fine-tuned transformer cross-encoders to generate the best candidates. We evaluate our approach on multiple datasets and show that it is feasible and produces competitive results. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: accepted at the DeepOntoNLP Workshop at the ESWC 2022

arXiv:2204.13329 [pdf, other]

Refining Diagnosis Paths for Medical Diagnosis based on an Augmented Knowledge Graph

Authors: Niclas Heilig, Jan Kirchhoff, Florian Stumpe, Joan Plepi, Lucie Flek, Heiko Paulheim

Abstract: Medical diagnosis is the process of making a prediction of the disease a patient is likely to have, given a set of symptoms and observations. This requires extensive expert knowledge, in particular when covering a large variety of diseases. Such knowledge can be coded in a knowledge graph -- encompassing diseases, symptoms, and diagnosis paths. Since both the knowledge itself and its encoding can… ▽ More Medical diagnosis is the process of making a prediction of the disease a patient is likely to have, given a set of symptoms and observations. This requires extensive expert knowledge, in particular when covering a large variety of diseases. Such knowledge can be coded in a knowledge graph -- encompassing diseases, symptoms, and diagnosis paths. Since both the knowledge itself and its encoding can be incomplete, refining the knowledge graph with additional information helps physicians making better predictions. At the same time, for deployment in a hospital, the diagnosis must be explainable and transparent. In this paper, we present an approach using diagnosis paths in a medical knowledge graph. We show that those graphs can be refined using latent representations with RDF2vec, while the final diagnosis is still made in an explainable way. Using both an intrinsic as well as an expert-based evaluation, we show that the embedding-based prediction approach is beneficial for refining the graph with additional valid conditions. △ Less

Submitted 28 April, 2022; originally announced April 2022.

Comments: Accepted at the 5th Workshop on Semantic Web solutions for large-scale biomedical data analytics

arXiv:2204.04040 [pdf, other]

Ontology Matching Through Absolute Orientation of Embedding Spaces

Authors: Jan Portisch, Guilherme Costa, Karolin Stefani, Katharina Kreplin, Michael Hladik, Heiko Paulheim

Abstract: Ontology matching is a core task when creating interoperable and linked open datasets. In this paper, we explore a novel structure-based map** approach which is based on knowledge graph embeddings: The ontologies to be matched are embedded, and an approach known as absolute orientation is used to align the two embedding spaces. Next to the approach, the paper presents a first, preliminary evalua… ▽ More Ontology matching is a core task when creating interoperable and linked open datasets. In this paper, we explore a novel structure-based map** approach which is based on knowledge graph embeddings: The ontologies to be matched are embedded, and an approach known as absolute orientation is used to align the two embedding spaces. Next to the approach, the paper presents a first, preliminary evaluation using synthetic and real-world datasets. We find in experiments with synthetic data, that the approach works very well on similarly structured graphs; it handles alignment noise better than size and structural differences in the ontologies. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: accepted at the ESWC Posters and Demos Track

arXiv:2204.02777 [pdf, other]

Walk this Way! Entity Walks and Property Walks for RDF2vec

Authors: Jan Portisch, Heiko Paulheim

Abstract: RDF2vec is a knowledge graph embedding mechanism which first extracts sequences from knowledge graphs by performing random walks, then feeds those into the word embedding algorithm word2vec for computing vector representations for entities. In this poster, we introduce two new flavors of walk extraction coined e-walks and p-walks, which put an emphasis on the structure or the neighborhood of an en… ▽ More RDF2vec is a knowledge graph embedding mechanism which first extracts sequences from knowledge graphs by performing random walks, then feeds those into the word embedding algorithm word2vec for computing vector representations for entities. In this poster, we introduce two new flavors of walk extraction coined e-walks and p-walks, which put an emphasis on the structure or the neighborhood of an entity respectively, and thereby allow for creating embeddings which focus on similarity or relatedness. By combining the walk strategies with order-aware and classic RDF2vec, as well as CBOW and skip-gram word2vec embeddings, we conduct a preliminary evaluation with a total of 12 RDF2vec variants. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: accepted at the ESWC Posters and Demos Track

arXiv:2203.05824 [pdf, ps, other]

doi 10.1145/3487553.3524674

Towards Analyzing the Bias of News Recommender Systems Using Sentiment and Stance Detection

Authors: Mehwish Alam, Andreea Iana, Alexander Grote, Katharina Ludwig, Philipp Müller, Heiko Paulheim

Abstract: News recommender systems are used by online news providers to alleviate information overload and to provide personalized content to users. However, algorithmic news curation has been hypothesized to create filter bubbles and to intensify users' selective exposure, potentially increasing their vulnerability to polarized opinions and fake news. In this paper, we show how information on news items' s… ▽ More News recommender systems are used by online news providers to alleviate information overload and to provide personalized content to users. However, algorithmic news curation has been hypothesized to create filter bubbles and to intensify users' selective exposure, potentially increasing their vulnerability to polarized opinions and fake news. In this paper, we show how information on news items' stance and sentiment can be utilized to analyze and quantify the extent to which recommender systems suffer from biases. To that end, we have annotated a German news corpus on the topic of migration using stance detection and sentiment analysis. In an experimental evaluation with four different recommender systems, our results show a slight tendency of all four models for recommending articles with negative sentiments and stances against the topic of refugees and migration. Moreover, we observed a positive correlation between the sentiment and stance bias of the text-based recommenders and the preexisting user bias, which indicates that these systems amplify users' opinions and decrease the diversity of recommended news. The knowledge-aware model appears to be the least prone to such biases, at the cost of predictive accuracy. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: Accepted at the 2nd International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2022) collocated with The Web Conference 2022 (WWW'22), 25-29 April 2022, Lyon, France

arXiv:2111.02239 [pdf, other]

doi 10.1145/3460210.3493556

Order Matters: Matching Multiple Knowledge Graphs

Authors: Sven Hertling, Heiko Paulheim

Abstract: Knowledge graphs (KGs) provide information in machine interpretable form. In cases where multiple KGs are used in the same system, that information needs to be integrated. This is usually done by automated matching systems. Most of those systems consider only 1:1 (binary) matching tasks. Thus, matching a larger number of knowledge graphs with such systems would lead to quadratic efforts. In this p… ▽ More Knowledge graphs (KGs) provide information in machine interpretable form. In cases where multiple KGs are used in the same system, that information needs to be integrated. This is usually done by automated matching systems. Most of those systems consider only 1:1 (binary) matching tasks. Thus, matching a larger number of knowledge graphs with such systems would lead to quadratic efforts. In this paper, we empirically analyze different approaches to reduce the task of multi-source matching to a linear number of executions of binary matching systems. We show that the matching order of KGs and the multi-source strategy actually matter and that near-optimal results can be achieved with linear efforts. △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2110.05028 [pdf, other]

The CaLiGraph Ontology as a Challenge for OWL Reasoners

Authors: Nicolas Heist, Heiko Paulheim

Abstract: CaLiGraph is a large-scale cross-domain knowledge graph generated from Wikipedia by exploiting the category system, list pages, and other list structures in Wikipedia, containing more than 15 million typed entities and around 10 million relation assertions. Other than knowledge graphs such as DBpedia and YAGO, whose ontologies are comparably simplistic, CaLiGraph also has a rich ontology, comprisi… ▽ More CaLiGraph is a large-scale cross-domain knowledge graph generated from Wikipedia by exploiting the category system, list pages, and other list structures in Wikipedia, containing more than 15 million typed entities and around 10 million relation assertions. Other than knowledge graphs such as DBpedia and YAGO, whose ontologies are comparably simplistic, CaLiGraph also has a rich ontology, comprising more than 200,000 class restrictions. Those two properties - a large A-box and a rich ontology - make it an interesting challenge for benchmarking reasoners. In this paper, we show that a reasoning task which is particularly relevant for CaLiGraph, i.e., the materialization of owl:hasValue constraints into assertions between individuals and between individuals and literals, is insufficiently supported by available reasoning systems. We provide differently sized benchmark subsets of CaLiGraph, which can be used for performance analysis of reasoning systems. △ Less

Submitted 4 January, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: Winner of the Dataset Track of the Semantic Reasoning Evaluation Challenge at the International Semantic Web Conference (ISWC), 2021

arXiv:2109.07401 [pdf, other]

Matching with Transformers in MELT

Authors: Sven Hertling, Jan Portisch, Heiko Paulheim

Abstract: One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and therefore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather th… ▽ More One of the strongest signals for automated matching of ontologies and knowledge graphs are the textual descriptions of the concepts. The methods that are typically applied (such as character- or token-based comparisons) are relatively simple, and therefore do not capture the actual meaning of the texts. With the rise of transformer-based language models, text comparison based on meaning (rather than lexical features) is possible. In this paper, we model the ontology matching task as classification problem and present approaches based on transformer models. We further provide an easy to use implementation in the MELT framework which is suited for ontology and knowledge graph matching. We show that a transformer-based filter helps to choose the correct correspondences given a high-recall alignment and already achieves a good result with simple alignment post-processing methods. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: accepted at the Ontology Matching Workshop at the International Semantic Web Conference (ISWC 2021)

arXiv:2108.05280 [pdf, other]

Putting RDF2vec in Order

Authors: Jan Portisch, Heiko Paulheim

Abstract: The RDF2vec method for creating node embeddings on knowledge graphs is based on word2vec, which, in turn, is agnostic towards the position of context words. In this paper, we argue that this might be a shortcoming when training RDF2vec, and show that using a word2vec variant which respects order yields considerable performance gains especially on tasks where entities of different classes are invol… ▽ More The RDF2vec method for creating node embeddings on knowledge graphs is based on word2vec, which, in turn, is agnostic towards the position of context words. In this paper, we argue that this might be a shortcoming when training RDF2vec, and show that using a word2vec variant which respects order yields considerable performance gains especially on tasks where entities of different classes are involved. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: Accepted at the ISWC 2021 posters and demos track

arXiv:2107.01856 [pdf, other]

Winning at Any Cost -- Infringing the Cartel Prohibition With Reinforcement Learning

Authors: Michael Schlechtinger, Damaris Kosack, Heiko Paulheim, Thomas Fetzer

Abstract: Pricing decisions are increasingly made by AI. Thanks to their ability to train with live market data while making decisions on the fly, deep reinforcement learning algorithms are especially effective in taking such pricing decisions. In e-commerce scenarios, multiple reinforcement learning agents can set prices based on their competitor's prices. Therefore, research states that agents might end u… ▽ More Pricing decisions are increasingly made by AI. Thanks to their ability to train with live market data while making decisions on the fly, deep reinforcement learning algorithms are especially effective in taking such pricing decisions. In e-commerce scenarios, multiple reinforcement learning agents can set prices based on their competitor's prices. Therefore, research states that agents might end up in a state of collusion in the long run. To further analyze this issue, we build a scenario that is based on a modified version of a prisoner's dilemma where three agents play the game of rock paper scissors. Our results indicate that the action selection can be dissected into specific stages, establishing the possibility to develop collusion prevention systems that are able to recognize situations which might lead to a collusion between competitors. We furthermore provide evidence for a situation where agents are capable of performing a tacit cooperation strategy without being explicitly trained to do so. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: accepted at the 19th International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS 2021)

arXiv:2107.00873 [pdf, other]

On-Demand and Lightweight Knowledge Graph Generation -- a Demonstration with DBpedia

Authors: Malte Brockmeier, Yawen Liu, Sunita Pateer, Sven Hertling, Heiko Paulheim

Abstract: Modern large-scale knowledge graphs, such as DBpedia, are datasets which require large computational resources to serve and process. Moreover, they often have longer release cycles, which leads to outdated information in those graphs. In this paper, we present DBpedia on Demand -- a system which serves DBpedia resources on demand without the need to materialize and store the entire graph, and whic… ▽ More Modern large-scale knowledge graphs, such as DBpedia, are datasets which require large computational resources to serve and process. Moreover, they often have longer release cycles, which leads to outdated information in those graphs. In this paper, we present DBpedia on Demand -- a system which serves DBpedia resources on demand without the need to materialize and store the entire graph, and which even provides limited querying functionality. △ Less

Submitted 2 July, 2021; originally announced July 2021.

Comments: Accepted at Semantics 2021

arXiv:2107.00001 [pdf, other]

Background Knowledge in Schema Matching: Strategy vs. Data

Authors: Jan Portisch, Michael Hladik, Heiko Paulheim

Abstract: The use of external background knowledge can be beneficial for the task of matching schemas or ontologies automatically. In this paper, we exploit six general-purpose knowledge graphs as sources of background knowledge for the matching task. The background sources are evaluated by applying three different exploitation strategies. We find that explicit strategies still outperform latent ones and th… ▽ More The use of external background knowledge can be beneficial for the task of matching schemas or ontologies automatically. In this paper, we exploit six general-purpose knowledge graphs as sources of background knowledge for the matching task. The background sources are evaluated by applying three different exploitation strategies. We find that explicit strategies still outperform latent ones and that the choice of the strategy has a greater impact on the final alignment than the actual background dataset on which the strategy is applied. While we could not identify a universally superior resource, BabelNet achieved consistently good results. Our best matcher configuration with BabelNet performs very competitively when compared to other matching systems even though no dataset-specific optimizations were made. △ Less

Submitted 29 June, 2021; originally announced July 2021.

Comments: accepted at the International Semantic Web Conference '21 (ISWC 2021)

arXiv:2106.12340 [pdf, other]

doi 10.1109/JCDL52503.2021.00021

GraphConfRec: A Graph Neural Network-Based Conference Recommender System

Authors: Andreea Iana, Heiko Paulheim

Abstract: In today's academic publishing model, especially in Computer Science, conferences commonly constitute the main platforms for releasing the latest peer-reviewed advancements in their respective fields. However, choosing a suitable academic venue for publishing one's research can represent a challenging task considering the plethora of available conferences, particularly for those at the start of th… ▽ More In today's academic publishing model, especially in Computer Science, conferences commonly constitute the main platforms for releasing the latest peer-reviewed advancements in their respective fields. However, choosing a suitable academic venue for publishing one's research can represent a challenging task considering the plethora of available conferences, particularly for those at the start of their academic careers, or for those seeking to publish outside of their usual domain. In this paper, we propose GraphConfRec, a conference recommender system which combines SciGraph and graph neural networks, to infer suggestions based not only on title and abstract, but also on co-authorship and citation relationships. GraphConfRec achieves a recall@10 of up to 0.580 and a MAP of up to 0.336 with a graph attention network-based recommendation model. A user study with 25 subjects supports the positive results. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: Accepted at the Joint Conference on Digital Libraries (JCDL 2021)

Journal ref: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021, pp. 90-99

arXiv:2105.01305 [pdf, other]

Large-scale Taxonomy Induction Using Entity and Word Embeddings

Authors: Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto, Heiko Paulheim

Abstract: Taxonomies are an important ingredient of knowledge organization, and serve as a backbone for more sophisticated knowledge representations in intelligent systems, such as formal ontologies. However, building taxonomies manually is a costly endeavor, and hence, automatic methods for taxonomy induction are a good alternative to build large-scale taxonomies. In this paper, we propose TIEmb, an approa… ▽ More Taxonomies are an important ingredient of knowledge organization, and serve as a backbone for more sophisticated knowledge representations in intelligent systems, such as formal ontologies. However, building taxonomies manually is a costly endeavor, and hence, automatic methods for taxonomy induction are a good alternative to build large-scale taxonomies. In this paper, we propose TIEmb, an approach for automatic unsupervised class subsumption axiom extraction from knowledge bases using entity and text embeddings. We apply the approach on the WebIsA database, a database of subsumption relations extracted from the large portion of the World Wide Web, to extract class hierarchies in the Person and Place domain. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: Published at IEEE/WIC/ACM International Conference on Web Intelligence 2017 (WI'17)

arXiv:2105.00674 [pdf, other]

Bias in Knowledge Graphs -- an Empirical Study with Movie Recommendation and Different Language Editions of DBpedia

Authors: Michael Matthias Voit, Heiko Paulheim

Abstract: Public knowledge graphs such as DBpedia and Wikidata have been recognized as interesting sources of background knowledge to build content-based recommender systems. They can be used to add information about the items to be recommended and links between those. While quite a few approaches for exploiting knowledge graphs have been proposed, most of them aim at optimizing the recommendation strategy… ▽ More Public knowledge graphs such as DBpedia and Wikidata have been recognized as interesting sources of background knowledge to build content-based recommender systems. They can be used to add information about the items to be recommended and links between those. While quite a few approaches for exploiting knowledge graphs have been proposed, most of them aim at optimizing the recommendation strategy while using a fixed knowledge graph. In this paper, we take a different approach, i.e., we fix the recommendation strategy and observe changes when using different underlying knowledge graphs. Particularly, we use different language editions of DBpedia. We show that the usage of different knowledge graphs does not only lead to differently biased recommender systems, but also to recommender systems that differ in performance for particular fields of recommendations. △ Less

Submitted 3 May, 2021; originally announced May 2021.

Comments: Accepted for publication at 3rd Conference on Language, Data and Knowledge (LDK 2021)

arXiv:2103.05110 [pdf, other]

Web Table Classification based on Visual Features

Authors: Babette Bühler, Heiko Paulheim

Abstract: Tables on the web constitute a valuable data source for many applications, like factual search and knowledge base augmentation. However, as genuine tables containing relational knowledge only account for a small proportion of tables on the web, reliable genuine web table classification is a crucial first step of table extraction. Previous works usually rely on explicit feature construction from th… ▽ More Tables on the web constitute a valuable data source for many applications, like factual search and knowledge base augmentation. However, as genuine tables containing relational knowledge only account for a small proportion of tables on the web, reliable genuine web table classification is a crucial first step of table extraction. Previous works usually rely on explicit feature construction from the HTML code. In contrast, we propose an approach for web table classification by exploiting the full visual appearance of a table, which works purely by applying a convolutional neural network on the rendered image of the web table. Since these visual features can be extracted automatically, our approach circumvents the need for explicit feature construction. A new hand labeled gold standard dataset containing HTML source code and images for 13,112 tables was generated for this task. Transfer learning techniques are applied to well known VGG16 and ResNet50 architectures. The evaluation of CNN image classification with fine tuned ResNet50 (F1 93.29%) shows that this approach achieves results comparable to previous solutions using explicitly defined HTML code based features. By combining visual and explicit features, an F-measure of 93.70% can be achieved by Random Forest classification, which beats current state of the art methods. △ Less

Submitted 25 February, 2021; originally announced March 2021.

Comments: Accepted at International Conference of Web Engineering (ICWE 2021)

arXiv:2103.01576 [pdf, other]

FinMatcher at FinSim-2: Hypernym Detection in the Financial Services Domain using Knowledge Graphs

Authors: Jan Portisch, Michael Hladik, Heiko Paulheim

Abstract: This paper presents the FinMatcher system and its results for the FinSim 2021 shared task which is co-located with the Workshop on Financial Technology on the Web (FinWeb) in conjunction with The Web Conference. The FinSim-2 shared task consists of a set of concept labels from the financial services domain. The goal is to find the most relevant top-level concept from a given set of concepts. The F… ▽ More This paper presents the FinMatcher system and its results for the FinSim 2021 shared task which is co-located with the Workshop on Financial Technology on the Web (FinWeb) in conjunction with The Web Conference. The FinSim-2 shared task consists of a set of concept labels from the financial services domain. The goal is to find the most relevant top-level concept from a given set of concepts. The FinMatcher system exploits three publicly available knowledge graphs, namely WordNet, Wikidata, and WebIsALOD. The graphs are used to generate explicit features as well as latent features which are fed into a neural classifier to predict the closest hypernym. △ Less

Submitted 2 March, 2021; originally announced March 2021.

arXiv:2102.05444 [pdf, other]

doi 10.1145/3442381.3449836

Information Extraction From Co-Occurring Similar Entities

Authors: Nicolas Heist, Heiko Paulheim

Abstract: Knowledge about entities and their interrelations is a crucial factor of success for tasks like question answering or text summarization. Publicly available knowledge graphs like Wikidata or DBpedia are, however, far from being complete. In this paper, we explore how information extracted from similar entities that co-occur in structures like tables or lists can help to increase the coverage of su… ▽ More Knowledge about entities and their interrelations is a crucial factor of success for tasks like question answering or text summarization. Publicly available knowledge graphs like Wikidata or DBpedia are, however, far from being complete. In this paper, we explore how information extracted from similar entities that co-occur in structures like tables or lists can help to increase the coverage of such knowledge graphs. In contrast to existing approaches, we do not focus on relationships within a listing (e.g., between two entities in a table row) but on the relationship between a listing's subject entities and the context of the listing. To that end, we propose a descriptive rule mining approach that uses distant supervision to derive rules for these relationships based on a listing's context. Extracted from a suitable data corpus, the rules can be used to extend a knowledge graph with novel entities and assertions. In our experiments we demonstrate that the approach is able to extract up to 3M novel entities and 30M additional assertions from listings in Wikipedia. We find that the extracted information is of high quality and thus suitable to extend Wikipedia-based knowledge graphs like DBpedia, YAGO, and CaLiGraph. For the case of DBpedia, this would result in an increase of covered entities by roughly 50%. △ Less

Submitted 15 February, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: Preprint of a paper accepted for the research track of the Web Conference (WWW'21), April 19-23, 2021, Ljubljana, Slovenia

arXiv:2012.11936 [pdf, other]

Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Authors: Nacira Abbas, Kholoud Alghamdi, Mortaza Alinam, Francesca Alloatti, Glenda Amaral, Claudia d'Amato, Luigi Asprino, Martin Beno, Felix Bensmann, Russa Biswas, Ling Cai, Riley Capshaw, Valentina Anita Carriero, Irene Celino, Amine Dadoun, Stefano De Giorgis, Harm Delva, John Domingue, Michel Dumontier, Vincent Emonet, Marieke van Erp, Paola Espinoza Arias, Omaima Fallatah, Sebastián Ferrada, Marc Gallofré Ocaña , et al. (49 additional authors not shown)

Abstract: One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this fur… ▽ More One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this further by asking if we can create a knowledge graph of "everything" ranging from common sense concepts to location based entities. This knowledge graph should be "open to the public" in a FAIR manner democratizing this mass amount of knowledge." Although linked open data (LOD) is one knowledge graph, it is the closest realisation (and probably the only one) to a public FAIR Knowledge Graph (KG) of everything. Surely, LOD provides a unique testbed for experimenting and evaluating research hypotheses on open and FAIR KG. One of the most neglected FAIR issues about KGs is their ongoing evolution and long term preservation. We want to investigate this problem, that is to understand what preserving and supporting the evolution of KGs means and how these problems can be addressed. Clearly, the problem can be approached from different perspectives and may require the development of different approaches, including new theories, ontologies, metrics, strategies, procedures, etc. This document reports a collaborative effort performed by 9 teams of students, each guided by a senior researcher as their mentor, attending the International Semantic Web Research School (ISWS 2019). Each team provides a different perspective to the problem of knowledge graph evolution substantiated by a set of research questions as the main subject of their investigation. In addition, they provide their working definition for KG preservation and evolution. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2009.11102 [pdf, other]

Supervised Ontology and Instance Matching with MELT

Authors: Sven Hertling, Jan Portisch, Heiko Paulheim

Abstract: In this paper, we present MELT-ML, a machine learning extension to the Matching and EvaLuation Toolkit (MELT) which facilitates the application of supervised learning for ontology and instance matching. Our contributions are twofold: We present an open source machine learning extension to the matching toolkit as well as two supervised learning use cases demonstrating the capabilities of the new ex… ▽ More In this paper, we present MELT-ML, a machine learning extension to the Matching and EvaLuation Toolkit (MELT) which facilitates the application of supervised learning for ontology and instance matching. Our contributions are twofold: We present an open source machine learning extension to the matching toolkit as well as two supervised learning use cases demonstrating the capabilities of the new extension. △ Less

Submitted 20 September, 2020; originally announced September 2020.

Comments: accepted at the the Fifteenth International Workshop on Ontology Matching collocated with the 19th International Semantic Web Conference ISWC-2020

arXiv:2009.07659 [pdf, other]

RDF2Vec Light -- A Lightweight Approach for Knowledge Graph Embeddings

Authors: Jan Portisch, Michael Hladik, Heiko Paulheim

Abstract: Knowledge graph embedding approaches represent nodes and edges of graphs as mathematical vectors. Current approaches focus on embedding complete knowledge graphs, i.e. all nodes and edges. This leads to very high computational requirements on large graphs such as DBpedia or Wikidata. However, for most downstream application scenarios, only a small subset of concepts is of actual interest. In this… ▽ More Knowledge graph embedding approaches represent nodes and edges of graphs as mathematical vectors. Current approaches focus on embedding complete knowledge graphs, i.e. all nodes and edges. This leads to very high computational requirements on large graphs such as DBpedia or Wikidata. However, for most downstream application scenarios, only a small subset of concepts is of actual interest. In this paper, we present RDF2Vec Light, a lightweight embedding approach based on RDF2Vec which generates vectors for only a subset of entities. To that end, RDF2Vec Light only traverses and processes a subgraph of the knowledge graph. Our method allows the application of embeddings of very large knowledge graphs in scenarios where such embeddings were not possible before due to a significantly lower runtime and significantly reduced hardware requirements. △ Less

Submitted 17 September, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: Accepted at International Semantic Web Conference (ISWC) 2020, Posters and Demonstrations Track

arXiv:2009.04404 [pdf, ps, other]

Walk Extraction Strategies for Node Embeddings with RDF2Vec in Knowledge Graphs

Authors: Gilles Vandewiele, Bram Steenwinckel, Pieter Bonte, Michael Weyns, Heiko Paulheim, Petar Ristoski, Filip De Turck, Femke Ongenae

Abstract: As KGs are symbolic constructs, specialized techniques have to be applied in order to make them compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-agnostic numerical representations of the nodes in a KG by extending successful language modelling techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality of the repr… ▽ More As KGs are symbolic constructs, specialized techniques have to be applied in order to make them compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-agnostic numerical representations of the nodes in a KG by extending successful language modelling techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality of the representations. However, in this work, we show both formally and empirically that the WL kernel does little to improve walk embeddings in the context of a single KG. As an alternative to the WL kernel, we propose five different strategies to extract information complementary to basic random walks. We compare these walks on several benchmark datasets to show that the \emph{n-gram} strategy performs best on average on node classification tasks and that tuning the walk strategy can result in improved predictive performances. △ Less

Submitted 9 September, 2020; originally announced September 2020.

arXiv:2009.00318 [pdf, other]

More is not Always Better: The Negative Impact of A-box Materialization on RDF2vec Knowledge Graph Embeddings

Authors: Andreea Iana, Heiko Paulheim

Abstract: RDF2vec is an embedding technique for representing knowledge graph entities in a continuous vector space. In this paper, we investigate the effect of materializing implicit A-box axioms induced by subproperties, as well as symmetric and transitive properties. While it might be a reasonable assumption that such a materialization before computing embeddings might lead to better embeddings, we conduc… ▽ More RDF2vec is an embedding technique for representing knowledge graph entities in a continuous vector space. In this paper, we investigate the effect of materializing implicit A-box axioms induced by subproperties, as well as symmetric and transitive properties. While it might be a reasonable assumption that such a materialization before computing embeddings might lead to better embeddings, we conduct a set of experiments on DBpedia which demonstrate that the materialization actually has a negative effect on the performance of RDF2vec. In our analysis, we argue that despite the huge body of work devoted on completing missing information in knowledge graphs, such missing implicit information is actually a signal, not a defect, and we show examples illustrating that assumption. △ Less

Submitted 1 September, 2020; originally announced September 2020.

Comments: Accepted at the Workshop on Combining Symbolic and Sub-symbolic methods and their Applications (CSSA 2020)

arXiv:2008.05239 [pdf, other]

A Knowledge Graph for Assessing Aggressive Tax Planning Strategies

Authors: Niklas Lüdemann, Ageda Shiba, Nikolaos Thymianis, Nicolas Heist, Christopher Ludwig, Heiko Paulheim

Abstract: The taxation of multi-national companies is a complex field, since it is influenced by the legislation of several states. Laws in different states may have unforeseen interaction effects, which can be exploited by allowing multinational companies to minimize taxes, a concept known as tax planning. In this paper, we present a knowledge graph of multinational companies and their relationships, compr… ▽ More The taxation of multi-national companies is a complex field, since it is influenced by the legislation of several states. Laws in different states may have unforeseen interaction effects, which can be exploited by allowing multinational companies to minimize taxes, a concept known as tax planning. In this paper, we present a knowledge graph of multinational companies and their relationships, comprising almost 1.5M business entities. We show that commonly known tax planning strategies can be formulated as subgraph queries to that graph, which allows for identifying companies using certain strategies. Moreover, we demonstrate that we can identify anomalies in the graph which hint at potential tax planning strategies, and we show how to enhance those analyses by incorporating information from Wikidata using federated queries. △ Less

Submitted 16 October, 2020; v1 submitted 12 August, 2020; originally announced August 2020.

Comments: Accepted to International Semantic Web Conference 2020

Showing 1–50 of 58 results for author: Paulheim, H