Search | arXiv e-print repository

GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs

Authors: Ruijie Wang, Luca Rossetto, Michael Cochez, Abraham Bernstein

Abstract: Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of… ▽ More Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of reasoning chains of multi-hop questions, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. In this paper, we propose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to solve this issue. GNN2R can provide both final answers and reasoning subgraphs as a rationale behind final answers efficiently with only weak supervision that is available through question-final answer pairs. We extensively evaluated GNN2R with detailed analyses in experiments. The results demonstrate that, in terms of effectiveness, efficiency, and quality of generated explanations, GNN2R outperforms existing state-of-the-art methods that are applicable to this task. Our code and pre-trained models are available at https://github.com/ruijie-wang-uzh/GNN2R. △ Less

Submitted 20 January, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.05778 [pdf, other]

DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing Learning Efficiency

Authors: Azhar Shaikh, Michael Cochez, Denis Diachkov, Michiel de Rijcke, Sahar Yousefi

Abstract: This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments an… ▽ More This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments and edge devices is hindered by high memory and computational demands, particularly in large-scale request services. To overcome these challenges, we propose an optimization strategy based on knowledge distillation and model pruning. Our paradigm to produce DONUT-hole, reduces the model denisty by 54\% while preserving performance. We also achieve a global representational similarity index between DONUT and DONUT-hole based on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the effectiveness of DONUT-hole in the document image key information extraction (KIE) task, highlighting its potential for develo** more efficient VDU systems for logistic companies. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2310.04598 [pdf, other]

A Neuro-Symbolic Framework for Answering Graph Pattern Queries in Knowledge Graphs

Authors: Tamara Cucumides, Daniel Daza, Pablo Barceló, Michael Cochez, Floris Geerts, Juan L Reutter, Miguel Romero

Abstract: The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most n… ▽ More The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most neuro-symbolic query processors are constrained to tree-like graph pattern queries. These queries admit a bottom-up execution with constant values or anchors at the leaves and the target variable at the root. While expressive, tree-like queries fail to capture critical properties in knowledge graphs, such as the existence of multiple edges between entities or the presence of triangles. We introduce a framework for answering arbitrary graph pattern queries over incomplete knowledge graphs, encompassing both cyclic queries and tree-like queries with existentially quantified leaves. These classes of queries are vital for practical applications but are beyond the scope of most current neuro-symbolic models. Our approach employs an approximation scheme that facilitates acyclic traversals for cyclic patterns, thereby embedding additional symbolic bias into the query execution process. Our experimental evaluation demonstrates that our framework performs competitively on three datasets, effectively handling cyclic queries through our approximation strategy. Additionally, it maintains the performance of existing neuro-symbolic models on anchored tree-like queries and extends their capabilities to queries with existentially quantified variables. △ Less

Submitted 5 June, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2308.06585 [pdf, other]

Approximate Answering of Graph Queries

Authors: Michael Cochez, Dimitrios Alivanistos, Erik Arakelyan, Max Berrendorf, Daniel Daza, Mikhail Galkin, Pasquale Minervini, Mathias Niepert, Hongyu Ren

Abstract: Knowledge graphs (KGs) are inherently incomplete because of incomplete world knowledge and bias in what is the input to the KG. Additionally, world knowledge constantly expands and evolves, making existing facts deprecated or introducing new ones. However, we would still want to be able to answer queries as if the graph were complete. In this chapter, we will give an overview of several methods wh… ▽ More Knowledge graphs (KGs) are inherently incomplete because of incomplete world knowledge and bias in what is the input to the KG. Additionally, world knowledge constantly expands and evolves, making existing facts deprecated or introducing new ones. However, we would still want to be able to answer queries as if the graph were complete. In this chapter, we will give an overview of several methods which have been proposed to answer queries in such a setting. We will first provide an overview of the different query types which can be supported by these methods and datasets typically used for evaluation, as well as an insight into their limitations. Then, we give an overview of the different approaches and describe them in terms of expressiveness, supported graph types, and inference capabilities. △ Less

Submitted 12 August, 2023; originally announced August 2023.

Comments: Preprint of Ch. 17 "Approximate Answering of Graph Queries" in "Compendium of Neurosymbolic Artificial Intelligence", https://ebooks.iospress.nl/ISBN/978-1-64368-406-2

arXiv:2306.03606 [pdf, other]

BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs

Authors: Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth

Abstract: Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate… ▽ More Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. We find settings involving low degree entities, which make up for a substantial amount of the set of entities in the KG, where our method outperforms the baselines. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. Our implementation is available at https://github.com/elsevier-AI-Lab/BioBLP . △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2304.11949 [pdf, other]

Geometric Relational Embeddings: A Survey

Authors: Bo Xiong, Mojtaba Nayyeri, Ming **, Yunjie He, Michael Cochez, Shirui Pan, Steffen Staab

Abstract: Geometric relational embeddings map relational data as geometric objects that combine vector information suitable for machine learning and structured/relational information for structured/relational reasoning, typically in low dimensions. Their preservation of relational structures and their appealing properties and interpretability have led to their uptake for tasks such as knowledge graph comple… ▽ More Geometric relational embeddings map relational data as geometric objects that combine vector information suitable for machine learning and structured/relational information for structured/relational reasoning, typically in low dimensions. Their preservation of relational structures and their appealing properties and interpretability have led to their uptake for tasks such as knowledge graph completion, ontology and hierarchy reasoning, logical query answering, and hierarchical multi-label classification. We survey methods that underly geometric relational embeddings and categorize them based on (i) the embedding geometries that are used to represent the data; and (ii) the relational reasoning tasks that they aim to improve. We identify the desired properties (i.e., inductive biases) of each kind of embedding and discuss some potential future work. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: Work in progress

arXiv:2303.14617 [pdf, other]

Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases

Authors: Hongyu Ren, Mikhail Galkin, Michael Cochez, Zhaocheng Zhu, Jure Leskovec

Abstract: Complex logical query answering (CLQA) is a recently emerged task of graph machine learning that goes beyond simple one-hop link prediction and solves a far more complex task of multi-hop logical reasoning over massive, potentially incomplete graphs in a latent space. The task received a significant traction in the community; numerous works expanded the field along theoretical and practical axes t… ▽ More Complex logical query answering (CLQA) is a recently emerged task of graph machine learning that goes beyond simple one-hop link prediction and solves a far more complex task of multi-hop logical reasoning over massive, potentially incomplete graphs in a latent space. The task received a significant traction in the community; numerous works expanded the field along theoretical and practical axes to tackle different types of complex queries and graph modalities with efficient systems. In this paper, we provide a holistic survey of CLQA with a detailed taxonomy studying the field from multiple angles, including graph types (modality, reasoning domain, background semantics), modeling aspects (encoder, processor, decoder), supported queries (operators, patterns, projected variables), datasets, evaluation metrics, and applications. Refining the CLQA task, we introduce the concept of Neural Graph Databases (NGDBs). Extending the idea of graph databases (graph DBs), NGDB consists of a Neural Graph Storage and a Neural Graph Engine. Inside Neural Graph Storage, we design a graph store, a feature store, and further embed information in a latent embedding store using an encoder. Given a query, Neural Query Engine learns how to perform query planning and execution in order to efficiently retrieve the correct results by interacting with the Neural Graph Storage. Compared with traditional graph DBs, NGDBs allow for a flexible and unified modeling of features in diverse modalities using the embedding store. Moreover, when the graph is incomplete, they can provide robust retrieval of answers which a normal graph DB cannot recover. Finally, we point out promising directions, unsolved problems and applications of NGDB for future research. △ Less

Submitted 26 March, 2023; originally announced March 2023.

arXiv:2301.12313 [pdf, other]

Adapting Neural Link Predictors for Data-Efficient Complex Query Answering

Authors: Erik Arakelyan, Pasquale Minervini, Daniel Daza, Michael Cochez, Isabelle Augenstein

Abstract: Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Prior work in the literature has proposed to address this problem by designing architectures trained end-to-end for the complex query answering task with a reasoning process that is hard to interpret while requiring data and reso… ▽ More Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Prior work in the literature has proposed to address this problem by designing architectures trained end-to-end for the complex query answering task with a reasoning process that is hard to interpret while requiring data and resource-intensive training. Other lines of research have proposed re-using simple neural link predictors to answer complex queries, reducing the amount of training data by orders of magnitude while providing interpretable answers. The neural link predictor used in such approaches is not explicitly optimised for the complex query answering task, implying that its scores are not calibrated to interact together. We propose to address these problems via CQD$^{\mathcal{A}}$, a parameter-efficient score \emph{adaptation} model optimised to re-calibrate neural link prediction scores for the complex query answering task. While the neural link predictor is frozen, the adaptation component -- which only increases the number of model parameters by $0.03\%$ -- is trained on the downstream complex query answering task. Furthermore, the calibration component enables us to support reasoning over queries that include atomic negations, which was previously impossible with link predictors. In our experiments, CQD$^{\mathcal{A}}$ produces significantly more accurate results than current state-of-the-art methods, improving from $34.4$ to $35.1$ Mean Reciprocal Rank values averaged across all datasets and query types while using $\leq 30\%$ of the available training query types. We further show that CQD$^{\mathcal{A}}$ is data-efficient, achieving competitive results with only $1\%$ of the training complex queries, and robust in out-of-domain evaluations. △ Less

Submitted 11 July, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

arXiv:2212.13261 [pdf, other]

Explainable AI for Bioinformatics: Methods, Tools, and Applications

Authors: Md. Rezaul Karim, Tanhim Islam, Oya Beyan, Christoph Lange, Michael Cochez, Dietrich Rebholz-Schuhmann, Stefan Decker

Abstract: Artificial intelligence (AI) systems utilizing deep neural networks (DNNs) and machine learning (ML) algorithms are widely used for solving important problems in bioinformatics, biomedical informatics, and precision medicine. However, complex DNNs or ML models, which are often perceived as opaque and black-box, can make it difficult to understand the reasoning behind their decisions. This lack of… ▽ More Artificial intelligence (AI) systems utilizing deep neural networks (DNNs) and machine learning (ML) algorithms are widely used for solving important problems in bioinformatics, biomedical informatics, and precision medicine. However, complex DNNs or ML models, which are often perceived as opaque and black-box, can make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. Additionally, in sensitive areas like healthcare, explainability and accountability are not only desirable but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable artificial intelligence (XAI) aims to overcome the opaqueness of black-box models and provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and the factors that influence their outcomes. However, most state-of-the-art interpretable ML methods are domain-agnostic and evolved from fields like computer vision, automated reasoning, or statistics, making direct application to bioinformatics problems challenging without customization and domain-specific adaptation. In this paper, we discuss the importance of explainability in the context of bioinformatics, provide an overview of model-specific and model-agnostic interpretable ML methods and tools, and outline their potential caveats and drawbacks. Besides, we discuss how to customize existing interpretable ML methods for bioinformatics problems. Nevertheless, we demonstrate how XAI methods can improve transparency through case studies in bioimaging, cancer genomics, and text mining. △ Less

Submitted 23 February, 2023; v1 submitted 25 December, 2022; originally announced December 2022.

arXiv:2212.02098 [pdf, other]

doi 10.1609/aaai.v37i1.25075

A Machine with Short-Term, Episodic, and Semantic Memory Systems

Authors: Taewoon Kim, Michael Cochez, Vincent François-Lavet, Mark Neerincx, Piek Vossen

Abstract: Inspired by the cognitive science theory of the explicit human memory systems, we have modeled an agent with short-term, episodic, and semantic memory systems, each of which is modeled with a knowledge graph. To evaluate this system and analyze the behavior of this agent, we designed and released our own reinforcement learning agent environment, "the Room", where an agent has to learn how to encod… ▽ More Inspired by the cognitive science theory of the explicit human memory systems, we have modeled an agent with short-term, episodic, and semantic memory systems, each of which is modeled with a knowledge graph. To evaluate this system and analyze the behavior of this agent, we designed and released our own reinforcement learning agent environment, "the Room", where an agent has to learn how to encode, store, and retrieve memories to maximize its return by answering questions. We show that our deep Q-learning based agent successfully learns whether a short-term memory should be forgotten, or rather be stored in the episodic or semantic memory systems. Our experiments indicate that an agent with human-like memory systems can outperform an agent without this memory structure in the environment. △ Less

Submitted 8 July, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence (2023), 37(1), 48-56

arXiv:2208.11057 [pdf, other]

Prompting as Probing: Using Language Models for Knowledge Base Construction

Authors: Dimitrios Alivanistos, Selene Báez Santamaría, Michael Cochez, Jan-Christoph Kalo, Emile van Krieken, Thiviyan Thanapalasingam

Abstract: Language Models (LMs) have proven to be useful in various downstream applications, such as summarisation, translation, question answering and text classification. LMs are becoming increasingly important tools in Artificial Intelligence, because of the vast quantity of information they can store. In this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a large Language Model orig… ▽ More Language Models (LMs) have proven to be useful in various downstream applications, such as summarisation, translation, question answering and text classification. LMs are becoming increasingly important tools in Artificial Intelligence, because of the vast quantity of information they can store. In this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a large Language Model originally proposed by OpenAI in 2020, to perform the task of Knowledge Base Construction (KBC). ProP implements a multi-step approach that combines a variety of prompting techniques to achieve this. Our results show that manual prompt curation is essential, that the LM must be encouraged to give answer sets of variable lengths, in particular including empty answer sets, that true/false questions are a useful device to increase precision on suggestions generated by the LM, that the size of the LM is a crucial factor, and that a dictionary of entity aliases improves the LM score. Our evaluation study indicates that these proposed techniques can substantially enhance the quality of the final predictions: ProP won track 2 of the LM-KBC competition, outperforming the baseline by 36.4 percentage points. Our implementation is available on https://github.com/HEmile/iswc-challenge. △ Less

Submitted 19 June, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: Published in LM-KBC 22: Knowledge Base Construction from Pre-trained Language Models, Challenge at ISWC 2022. 12+12 pages

arXiv:2206.01818 [pdf, other]

QAGCN: Answering Multi-Relation Questions via Single-Step Implicit Reasoning over Knowledge Graphs

Authors: Ruijie Wang, Luca Rossetto, Michael Cochez, Abraham Bernstein

Abstract: Multi-relation question answering (QA) is a challenging task, where given questions usually require long reasoning chains in KGs that consist of multiple relations. Recently, methods with explicit multi-step reasoning over KGs have been prominently used in this task and have demonstrated promising performance. Examples include methods that perform stepwise label propagation through KG triples and… ▽ More Multi-relation question answering (QA) is a challenging task, where given questions usually require long reasoning chains in KGs that consist of multiple relations. Recently, methods with explicit multi-step reasoning over KGs have been prominently used in this task and have demonstrated promising performance. Examples include methods that perform stepwise label propagation through KG triples and methods that navigate over KG triples based on reinforcement learning. A main weakness of these methods is that their reasoning mechanisms are usually complex and difficult to implement or train. In this paper, we argue that multi-relation QA can be achieved via end-to-end single-step implicit reasoning, which is simpler, more efficient, and easier to adopt. We propose QAGCN -- a Question-Aware Graph Convolutional Network (GCN)-based method that includes a novel GCN architecture with controlled question-dependent message propagation for the implicit reasoning. Extensive experiments have been conducted, where QAGCN achieved competitive and even superior performance compared to state-of-the-art explicit-reasoning methods. Our code and pre-trained models are available in the repository: https://github.com/ruijie-wang-uzh/QAGCN △ Less

Submitted 28 March, 2024; v1 submitted 3 June, 2022; originally announced June 2022.

arXiv:2204.01611 [pdf, other]

A Machine With Human-Like Memory Systems

Authors: Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, Piek Vossen

Abstract: Inspired by the cognitive science theory, we explicitly model an agent with both semantic and episodic memory systems, and show that it is better than having just one of the two memory systems. In order to show this, we have designed and released our own challenging environment, "the Room", compatible with OpenAI Gym, where an agent has to properly learn how to encode, store, and retrieve memories… ▽ More Inspired by the cognitive science theory, we explicitly model an agent with both semantic and episodic memory systems, and show that it is better than having just one of the two memory systems. In order to show this, we have designed and released our own challenging environment, "the Room", compatible with OpenAI Gym, where an agent has to properly learn how to encode, store, and retrieve memories to maximize its rewards. The Room environment allows for a hybrid intelligence setup where machines and humans can collaborate. We show that two agents collaborating with each other results in better performance than one agent acting alone. We have open-sourced our code and models at https://github.com/tae898/explicit-memory. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Submitted to Human-Centered Design of Symbiotic Hybrid Intelligence 2022 (https://ii.tudelft.nl/humancenteredsymbioticHI/)

arXiv:2203.02622 [pdf, other]

doi 10.1145/3487553.3524719

Scaling R-GCN Training with Graph Summarization

Authors: Alessandro Generale, Till Blume, Michael Cochez

Abstract: Training of Relational Graph Convolutional Networks (R-GCN) is a memory intense task. The amount of gradient information that needs to be stored during training for real-world graphs is often too large for the amount of memory available on most GPUs. In this work, we experiment with the use of graph summarization techniques to compress the graph and hence reduce the amount of memory needed. After… ▽ More Training of Relational Graph Convolutional Networks (R-GCN) is a memory intense task. The amount of gradient information that needs to be stored during training for real-world graphs is often too large for the amount of memory available on most GPUs. In this work, we experiment with the use of graph summarization techniques to compress the graph and hence reduce the amount of memory needed. After training the R-GCN on the graph summary, we transfer the weights back to the original graph and attempt to perform inference on it. We obtain reasonable results on the AIFB, MUTAG and AM datasets. Our experiments show that training on the graph summary can yield a comparable or higher accuracy to training on the original graphs.Furthermore, if we take the time to compute the summary out of the equation, we observe that the smaller graph representations obtained with graph summarization methods reduces the computational overhead. However, further experiments are needed to evaluate additional graph summary models and whether our findings also holds true for very large graphs. △ Less

Submitted 20 March, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

Comments: Companion Proceedings of the Web Conference 2022, April 25-29, 2022

arXiv:2109.10896 [pdf, other]

Updating Embeddings for Dynamic Knowledge Graphs

Authors: Christopher Wewer, Florian Lemmerich, Michael Cochez

Abstract: Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machin… ▽ More Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machine learning algorithm. If a graph update occurs later on -- specifically when nodes are added or removed -- the training has to be done all over again. This is undesirable, because of the time it takes and also because downstream models which were trained with these embeddings have to be retrained if they change significantly. In this paper, we investigate embedding updates that do not require full retraining and evaluate them in combination with various embedding models on real dynamic Knowledge Graphs covering multiple use cases. We study approaches that place newly appearing nodes optimally according to local information, but notice that this does not work well. However, we find that if we continue the training of the old embedding, interleaved with epochs during which we only optimize for the added and removed parts, we obtain good results in terms of typical metrics used in link prediction. This performance is obtained much faster than with a complete retraining and hence makes it possible to maintain embeddings for dynamic Knowledge Graphs. △ Less

Submitted 22 September, 2021; originally announced September 2021.

arXiv:2106.08166 [pdf, other]

Query Embedding on Hyper-relational Knowledge Graphs

Authors: Dimitrios Alivanistos, Max Berrendorf, Michael Cochez, Mikhail Galkin

Abstract: Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have sever… ▽ More Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns. △ Less

Submitted 6 September, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: Presented at ICLR2022. https://openreview.net/forum?id=4rLw09TgRw9

arXiv:2103.00082 [pdf, other]

Secure Evaluation of Knowledge Graph Merging Gain

Authors: Leandro Eichenberger, Michael Cochez, Benjamin Heitmann, Stefan Decker

Abstract: Finding out the differences and commonalities between the knowledge of two parties is an important task. Such a comparison becomes necessary, when one party wants to determine how much it is worth to acquire the knowledge of the second party, or similarly when two parties try to determine, whether a collaboration could be beneficial. When these two parties cannot trust each other (for example, due… ▽ More Finding out the differences and commonalities between the knowledge of two parties is an important task. Such a comparison becomes necessary, when one party wants to determine how much it is worth to acquire the knowledge of the second party, or similarly when two parties try to determine, whether a collaboration could be beneficial. When these two parties cannot trust each other (for example, due to them being competitors) performing such a comparison is challenging as neither of them would be willing to share any of their assets. This paper addresses this problem for knowledge graphs, without a need for non-disclosure agreements nor a third party during the protocol. During the protocol, the intersection between the two knowledge graphs is determined in a privacy preserving fashion. This is followed by the computation of various metrics, which give an indication of the potential gain from obtaining the other parties knowledge graph, while still kee** the actual knowledge graph contents secret. The protocol makes use of blind signatures and (counting) Bloom filters to reduce the amount of leaked information. Finally, the party who wants to obtain the other's knowledge graph can get a part of such in a way that neither party is able to know beforehand which parts of the graph are obtained (i.e., they cannot choose to only get or share the good parts). After inspection of the quality of this part, the Buyer can decide to proceed with the transaction. The analysis of the protocol indicates that the developed protocol is secure against malicious participants. Further experimental analysis shows that the resource consumption scales linear with the number of statements in the knowledge graph. △ Less

Submitted 26 February, 2021; originally announced March 2021.

arXiv:2102.11389 [pdf, other]

Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification

Authors: Ruud van Bakel, Teodor Aleksiev, Daniel Daza, Dimitrios Alivanistos, Michael Cochez

Abstract: Large, heterogeneous datasets are characterized by missing or even erroneous information. This is more evident when they are the product of community effort or automatic fact extraction methods from external sources, such as text. A special case of the aforementioned phenomenon can be seen in knowledge graphs, where this mostly appears in the form of missing or incorrect edges and nodes. Structu… ▽ More Large, heterogeneous datasets are characterized by missing or even erroneous information. This is more evident when they are the product of community effort or automatic fact extraction methods from external sources, such as text. A special case of the aforementioned phenomenon can be seen in knowledge graphs, where this mostly appears in the form of missing or incorrect edges and nodes. Structured querying on such incomplete graphs will result in incomplete sets of answers, even if the correct entities exist in the graph, since one or more edges needed to match the pattern are missing. To overcome this problem, several algorithms for approximate structured query answering have been proposed. Inspired by modern Information Retrieval metrics, these algorithms produce a ranking of all entities in the graph, and their performance is further evaluated based on how high in this ranking the correct answers appear. In this work we take a critical look at this way of evaluation. We argue that performing a ranking-based evaluation is not sufficient to assess methods for complex query answering. To solve this, we introduce Message Passing Query Boxes (MPQB), which takes binary classification metrics back into use and shows the effect this has on the recently proposed query embedding method MPQE. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: To be published in Lecture Notes in Artificial Intelligence (Springer)

arXiv:2011.03459 [pdf, other]

Complex Query Answering with Neural Link Predictors

Authors: Erik Arakelyan, Daniel Daza, Pasquale Minervini, Michael Cochez

Abstract: Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions ($\land$), disjunctions ($\lor$) and existential quantifiers ($\exists$), while accounting for missing edges. In this work, w… ▽ More Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions ($\land$), disjunctions ($\lor$) and existential quantifiers ($\exists$), while accounting for missing edges. In this work, we propose a framework for efficiently answering complex queries on incomplete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods -- black-box neural models trained on millions of generated queries -- without the need of training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across different knowledge graphs containing factual information. Finally, we demonstrate that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online, at https://github.com/uclnlp/cqd. △ Less

Submitted 18 March, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021, oral presentation)

arXiv:2010.03496 [pdf, other]

doi 10.1145/3442381.3450141

Inductive Entity Representations from Text via Link Prediction

Authors: Daniel Daza, Michael Cochez, Paul Groth

Abstract: Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector represe… ▽ More Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work. △ Less

Submitted 14 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

arXiv:2004.07807 [pdf, other]

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network

Authors: Md. Rezaul Karim, Bharathi Raja Chakravarthi, John P. McCrae, Michael Cochez

Abstract: Exponential growths of social media and micro-blogging sites not only provide platforms for empowering freedom of expressions and individual voices but also enables people to express anti-social behaviour like online harassment, cyberbullying, and hate speech. Numerous works have been proposed to utilize these data for social and anti-social behaviours analysis, document characterization, and sent… ▽ More Exponential growths of social media and micro-blogging sites not only provide platforms for empowering freedom of expressions and individual voices but also enables people to express anti-social behaviour like online harassment, cyberbullying, and hate speech. Numerous works have been proposed to utilize these data for social and anti-social behaviours analysis, document characterization, and sentiment analysis by predicting the contexts mostly for highly resourced languages such as English. However, there are languages that are under-resources, e.g., South Asian languages like Bengali, Tamil, Assamese, Telugu that lack of computational resources for the NLP tasks. In this paper, we provide several classification benchmarks for Bengali, an under-resourced language. We prepared three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively. We built the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText. We perform three different experiments, covering document classification, sentiment analysis, and hate speech detection. We incorporate word embeddings into a Multichannel Convolutional-LSTM (MConv-LSTM) network for predicting different types of hate speech, document classification, and sentiment analysis. Experiments demonstrate that BengFastText can capture the semantics of words from respective contexts correctly. Evaluations against several baseline embedding models, e.g., Word2Vec and GloVe yield up to 92.30%, 82.25%, and 90.45% F1-scores in case of document classification, sentiment analysis, and hate speech detection, respectively during 5-fold cross-validation tests. △ Less

Submitted 19 April, 2020; v1 submitted 11 April, 2020; originally announced April 2020.

Comments: This paper is under review in the Journal of Natural Language Engineering

arXiv:2004.04582 [pdf, other]

DeepCOVIDExplainer: Explainable COVID-19 Diagnosis Based on Chest X-ray Images

Authors: Md. Rezaul Karim, Till Döhmen, Dietrich Rebholz-Schuhmann, Stefan Decker, Michael Cochez, Oya Beyan

Abstract: Amid the coronavirus disease(COVID-19) pandemic, humanity experiences a rapid increase in infection numbers across the world. Challenge hospitals are faced with, in the fight against the virus, is the effective screening of incoming patients. One methodology is the assessment of chest radiography(CXR) images, which usually requires expert radiologist's knowledge. In this paper, we propose an expla… ▽ More Amid the coronavirus disease(COVID-19) pandemic, humanity experiences a rapid increase in infection numbers across the world. Challenge hospitals are faced with, in the fight against the virus, is the effective screening of incoming patients. One methodology is the assessment of chest radiography(CXR) images, which usually requires expert radiologist's knowledge. In this paper, we propose an explainable deep neural networks(DNN)-based method for automatic detection of COVID-19 symptoms from CXR images, which we call DeepCOVIDExplainer. We used 15,959 CXR images of 15,854 patients, covering normal, pneumonia, and COVID-19 cases. CXR images are first comprehensively preprocessed, before being augmented and classified with a neural ensemble method, followed by highlighting class-discriminating regions using gradient-guided class activation maps(Grad-CAM++) and layer-wise relevance propagation(LRP). Further, we provide human-interpretable explanations of the predictions. Evaluation results based on hold-out data show that our approach can identify COVID-19 confidently with a positive predictive value(PPV) of 91.6%, 92.45%, and 96.12%; precision, recall, and F1 score of 94.6%, 94.3%, and 94.6%, respectively for normal, pneumonia, and COVID-19 cases, respectively, making it comparable or improved results over recent approaches. We hope that our findings will be a useful contribution to the fight against COVID-19 and, in more general, towards an increasing acceptance and adoption of AI-assisted applications in the clinical practice. △ Less

Submitted 6 June, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

arXiv:2003.02320 [pdf, other]

doi 10.1145/3447772

Knowledge Graphs

Authors: Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann

Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss th… ▽ More In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs. △ Less

Submitted 11 September, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

Comments: Revision from v5: Correcting errata from previous version for entailment/models, and some other minor typos

Journal ref: ACM Comput. Surv. 54(4): 71:1-71:37 (2021)

arXiv:2002.02406 [pdf, other]

Message Passing Query Embedding

Authors: Daniel Daza, Michael Cochez

Abstract: Recent works on representation learning for Knowledge Graphs have moved beyond the problem of link prediction, to answering queries of an arbitrary structure. Existing methods are based on ad-hoc mechanisms that require training with a diverse set of query structures. We propose a more general architecture that employs a graph neural network to encode a graph representation of the query, where nod… ▽ More Recent works on representation learning for Knowledge Graphs have moved beyond the problem of link prediction, to answering queries of an arbitrary structure. Existing methods are based on ad-hoc mechanisms that require training with a diverse set of query structures. We propose a more general architecture that employs a graph neural network to encode a graph representation of the query, where nodes correspond to entities and variables. The generality of our method allows it to encode a more diverse set of query types in comparison to previous work. Our method shows competitive performance against previous models for complex queries, and in contrast with these models, it can answer complex queries when trained for link prediction only. We show that the model learns entity embeddings that capture the notion of entity type without explicit supervision. △ Less

Submitted 24 June, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

Comments: Presented at ICML 2020 - GRL+ Workshop

arXiv:1912.10979 [pdf, other]

Privacy Attacks on Network Embeddings

Authors: Michael Ellers, Michael Cochez, Tobias Schumacher, Markus Strohmaier, Florian Lemmerich

Abstract: Data ownership and data protection are increasingly important topics with ethical and legal implications, e.g., with the right to erasure established in the European General Data Protection Regulation (GDPR). In this light, we investigate network embeddings, i.e., the representation of network nodes as low-dimensional vectors. We consider a typical social network scenario with nodes representing u… ▽ More Data ownership and data protection are increasingly important topics with ethical and legal implications, e.g., with the right to erasure established in the European General Data Protection Regulation (GDPR). In this light, we investigate network embeddings, i.e., the representation of network nodes as low-dimensional vectors. We consider a typical social network scenario with nodes representing users and edges relationships between them. We assume that a network embedding of the nodes has been trained. After that, a user demands the removal of his data, requiring the full deletion of the corresponding network information, in particular the corresponding node and incident edges. In that setting, we analyze whether after the removal of the node from the network and the deletion of the vector representation of the respective node in the embedding significant information about the link structure of the removed node is still encoded in the embedding vectors of the remaining nodes. This would require a (potentially computationally expensive) retraining of the embedding. For that purpose, we deploy an attack that leverages information from the remaining network and embedding to recover information about the neighbors of the removed node. The attack is based on (i) measuring distance changes in network embeddings and (ii) a machine learning classifier that is trained on networks that are constructed by removing additional nodes. Our experiments demonstrate that substantial information about the edges of a removed node/user can be retrieved across many different datasets. This implies that to fully protect the privacy of users, node deletion requires complete retraining - or at least a significant modification - of original network embeddings. Our results suggest that deleting the corresponding vector representation from network embeddings alone is not sufficient from a privacy perspective. △ Less

Submitted 23 December, 2019; originally announced December 2019.

arXiv:1909.04169 [pdf, other]

OncoNetExplainer: Explainable Predictions of Cancer Types Based on Gene Expression Data

Authors: Md. Rezaul Karim, Michael Cochez, Oya Beyan, Stefan Decker, Christoph Lange

Abstract: The discovery of important biomarkers is a significant step towards understanding the molecular mechanisms of carcinogenesis; enabling accurate diagnosis for, and prognosis of, a certain cancer type. Before recommending any diagnosis, genomics data such as gene expressions(GE) and clinical outcomes need to be analyzed. However, complex nature, high dimensionality, and heterogeneity in genomics dat… ▽ More The discovery of important biomarkers is a significant step towards understanding the molecular mechanisms of carcinogenesis; enabling accurate diagnosis for, and prognosis of, a certain cancer type. Before recommending any diagnosis, genomics data such as gene expressions(GE) and clinical outcomes need to be analyzed. However, complex nature, high dimensionality, and heterogeneity in genomics data make the overall analysis challenging. Convolutional neural networks(CNN) have shown tremendous success in solving such problems. However, neural network models are perceived mostly as `black box' methods because of their not well-understood internal functioning. However, interpretability is important to provide insights on why a given cancer case has a certain type. Besides, finding the most important biomarkers can help in recommending more accurate treatments and drug repositioning. In this paper, we propose a new approach called OncoNetExplainer to make explainable predictions of cancer types based on GE data. We used genomics data about 9,074 cancer patients covering 33 different cancer types from the Pan-Cancer Atlas on which we trained CNN and VGG16 networks using guided-gradient class activation maps++(GradCAM++). Further, we generate class-specific heat maps to identify significant biomarkers and computed feature importance in terms of mean absolute impact to rank top genes across all the cancer types. Quantitative and qualitative analyses show that both models exhibit high confidence at predicting the cancer types correctly giving an average precision of 96.25%. To provide comparisons with the baselines, we identified top genes, and cancer-specific driver genes using gradient boosted trees and SHapley Additive exPlanations(SHAP). Finally, our findings were validated with the annotations provided by the TumorPortal. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: In proc. of 19th IEEE International Conference on Bioinformatics and Bioengineering(IEEE BIBE 2019)

Journal ref: IEEE International Conference on Bioinformatics and Bioengineering(IEEE BIBE 2019)

arXiv:1909.02930 [pdf, other]

doi 10.1007/s10115-019-01401-x

Structured Query Construction via Knowledge Graph Embedding

Authors: Ruijie Wang, Meng Wang, Jun Liu, Michael Cochez, Stefan Decker

Abstract: In order to facilitate the accesses of general users to knowledge graphs, an increasing effort is being exerted to construct graph-structured queries of given natural language questions. At the core of the construction is to deduce the structure of the target query and determine the vertices/edges which constitute the query. Existing query construction methods rely on question understanding and co… ▽ More In order to facilitate the accesses of general users to knowledge graphs, an increasing effort is being exerted to construct graph-structured queries of given natural language questions. At the core of the construction is to deduce the structure of the target query and determine the vertices/edges which constitute the query. Existing query construction methods rely on question understanding and conventional graph-based algorithms which lead to inefficient and degraded performances facing complex natural language questions over knowledge graphs with large scales. In this paper, we focus on this problem and propose a novel framework standing on recent knowledge graph embedding techniques. Our framework first encodes the underlying knowledge graph into a low-dimensional embedding space by leveraging generalized local knowledge graphs. Given a natural language question, the learned embedding representations of the knowledge graph are utilized to compute the query structure and assemble vertices/edges into the target query. Extensive experiments were conducted on the benchmark dataset, and the results demonstrate that our framework outperforms state-of-the-art baseline models regarding effectiveness and efficiency. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Journal ref: Knowledge and information Systems 62 (2020): 1819-1846

arXiv:1908.06917 [pdf, other]

Message Passing for Complex Question Answering over Knowledge Graphs

Authors: Svitlana Vakulenko, Javier David Fernandez Garcia, Axel Polleres, Maarten de Rijke, Michael Cochez

Abstract: Question answering over knowledge graphs (KGQA) has evolved from simple single-fact questions to complex questions that require graph traversal and aggregation. We propose a novel approach for complex KGQA that uses unsupervised message passing, which propagates confidence scores obtained by parsing an input question and matching terms in the knowledge graph to a set of possible answers. First, we… ▽ More Question answering over knowledge graphs (KGQA) has evolved from simple single-fact questions to complex questions that require graph traversal and aggregation. We propose a novel approach for complex KGQA that uses unsupervised message passing, which propagates confidence scores obtained by parsing an input question and matching terms in the knowledge graph to a set of possible answers. First, we identify entity, relationship, and class names mentioned in a natural language question, and map these to their counterparts in the graph. Then, the confidence scores of these map**s propagate through the graph structure to locate the answer entities. Finally, these are aggregated depending on the identified question type. This approach can be efficiently implemented as a series of sparse matrix multiplications mimicking joins over small local subgraphs. Our evaluation results show that the proposed approach outperforms the state-of-the-art on the LC-QuAD benchmark. Moreover, we show that the performance of the approach depends only on the quality of the question interpretation results, i.e., given a correct relevance score distribution, our approach always produces a correct answer ranking. Our error analysis reveals correct answers missing from the benchmark dataset and inconsistencies in the DBpedia knowledge graph. Finally, we provide a comprehensive evaluation of the proposed approach accompanied with an ablation study and an error analysis, which showcase the pitfalls for each of the question answering components in more detail. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: Accepted in CIKM 2019

arXiv:1908.03629 [pdf, other]

Transferring knowledge from monitored to unmonitored areas for forecasting parking spaces

Authors: Andrei Ionita, André Pomp, Michael Cochez, Tobias Meisen, Stefan Decker

Abstract: Smart cities around the world have begun monitoring parking areas in order to estimate available parking spots and help drivers looking for parking. The current results are promising, indeed. However, existing approaches are limited by the high cost of sensors that need to be installed throughout the city in order to achieve an accurate estimation. This work investigates the extension of estimatin… ▽ More Smart cities around the world have begun monitoring parking areas in order to estimate available parking spots and help drivers looking for parking. The current results are promising, indeed. However, existing approaches are limited by the high cost of sensors that need to be installed throughout the city in order to achieve an accurate estimation. This work investigates the extension of estimating parking information from areas equipped with sensors to areas where they are missing. To this end, the similarity between city neighborhoods is determined based on background data, i.e., from geographic information systems. Using the derived similarity values, we analyze the adaptation of occupancy rates from monitored- to unmonitored parking areas. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: Preprint of an article to be published in Int J. on Artificial Intelligence Tools (IJAIT)

arXiv:1908.01288 [pdf, other]

Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network

Authors: Md. Rezaul Karim, Michael Cochez, Joao Bosco Jares, Mamtaz Uddin, Oya Beyan, Stefan Decker

Abstract: Interference between pharmacological substances can cause serious medical injuries. Correctly predicting so-called drug-drug interactions (DDI) does not only reduce these cases but can also result in a reduction of drug development cost. Presently, most drug-related knowledge is the result of clinical evaluations and post-marketing surveillance; resulting in a limited amount of information. Existi… ▽ More Interference between pharmacological substances can cause serious medical injuries. Correctly predicting so-called drug-drug interactions (DDI) does not only reduce these cases but can also result in a reduction of drug development cost. Presently, most drug-related knowledge is the result of clinical evaluations and post-marketing surveillance; resulting in a limited amount of information. Existing data-driven prediction approaches for DDIs typically rely on a single source of information, while using information from multiple sources would help improve predictions. Machine learning (ML) techniques are used, but the techniques are often unable to deal with skewness in the data. Hence, we propose a new ML approach for predicting DDIs based on multiple data sources. For this task, we use 12,000 drug features from DrugBank, PharmGKB, and KEGG drugs, which are integrated using Knowledge Graphs (KGs). To train our prediction model, we first embed the nodes in the graph using various embedding approaches. We found that the best performing combination was a ComplEx embedding method creating using PyTorch-BigGraph (PBG) with a Convolutional-LSTM network and classic machine learning-based prediction models. The model averaging ensemble method of three best classifiers yields up to 0.94, 0.92, 0.80 for AUPR, F1-score, and MCC, respectively during 5-fold cross-validation tests. △ Less

Submitted 4 August, 2019; originally announced August 2019.

arXiv:1903.12554 [pdf, other]

Linked Open Data Validity -- A Technical Report from ISWS 2018

Authors: Tayeb Abderrahmani Ghor, Esha Agrawal, Mehwish Alam, Omar Alqawasmeh, Claudia D'amato, Amina Annane, Amr Azzam, Andrew Berezovskyi, Russa Biswas, Mathias Bonduel, Quentin Brabant, Cristina-iulia Bucur, Elena Camossi, Valentina Anita Carriero, Shruthi Chari, David Chaves Fraga, Fiorela Ciroku, Michael Cochez, Hubert Curien, Vincenzo Cutrona, Rahma Dandan, Danilo Dess, Valerio Di Carlo, Ahmed El Amine Djebri, Marieke Van Erp , et al. (46 additional authors not shown)

Abstract: Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natu… ▽ More Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natural Language Processing, Information Retrieval, Computer Vision, Speech Recognition, and many more. Nevertheless, regardless of the specific tasks that LOD-based tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g. semantic heterogeneity, provenance, and data quality. As aptly stated by Heath et al. Linked Data might be outdated, imprecise, or simply wrong": there arouses a necessity to investigate the problem of linked data validity. This work reports a collaborative effort performed by nine teams of students, guided by an equal number of senior researchers, attending the International Semantic Web Research School (ISWS 2018) towards addressing such investigation from different perspectives coupled with different approaches to tackle the issue. △ Less

Submitted 26 March, 2019; originally announced March 2019.

arXiv:1808.00434 [pdf, ps, other]

Leveraging Knowledge Graph Embedding Techniques for Industry 4.0 Use Cases

Authors: Martina Garofalo, Maria Angela Pellegrino, Abdulrahman Altabba, Michael Cochez

Abstract: Industry is evolving towards Industry 4.0, which holds the promise of increased flexibility in manufacturing, better quality and improved productivity. A core actor of this growth is using sensors, which must capture data that can used in unforeseen ways to achieve a performance not achievable without them. However, the complexity of this improved setting is much greater than what is currently use… ▽ More Industry is evolving towards Industry 4.0, which holds the promise of increased flexibility in manufacturing, better quality and improved productivity. A core actor of this growth is using sensors, which must capture data that can used in unforeseen ways to achieve a performance not achievable without them. However, the complexity of this improved setting is much greater than what is currently used in practice. Hence, it is imperative that the management cannot only be performed by human labor force, but part of that will be done by automated algorithms instead. A natural way to represent the data generated by this large amount of sensors, which are not acting measuring independent variables, and the interaction of the different devices is by using a graph data model. Then, machine learning could be used to aid the Industry 4.0 system to, for example, perform predictive maintenance. However, machine learning directly on graphs, needs feature engineering and has scalability issues. In this paper we discuss methods to convert (embed) the graph in a vector space, such that it becomes feasible to use traditional machine learning methods for Industry 4.0 settings. △ Less

Submitted 31 July, 2018; originally announced August 2018.

Comments: Accepted for publication in NATO Science Series. arXiv admin note: text overlap with arXiv:1709.07604 by other authors

arXiv:1807.11761 [pdf, other]

A First Experiment on Including Text Literals in KGloVe

Authors: Michael Cochez, Martina Garofalo, Jérôme Lenßen, Maria Angela Pellegrino

Abstract: Graph embedding models produce embedding vectors for entities and relations in Knowledge Graphs, often without taking literal properties into account. We show an initial idea based on the combination of global graph structure with additional information provided by textual information in properties. Our initial experiment shows that this approach might be useful, but does not clearly outperform ea… ▽ More Graph embedding models produce embedding vectors for entities and relations in Knowledge Graphs, often without taking literal properties into account. We show an initial idea based on the combination of global graph structure with additional information provided by textual information in properties. Our initial experiment shows that this approach might be useful, but does not clearly outperform earlier approaches when evaluated on machine learning tasks. △ Less

Submitted 31 July, 2018; originally announced July 2018.

Comments: Presented at the 4th Workshop on Semantic Deep Learning (SemDeep-4)

arXiv:1806.06411 [pdf, other]

Measuring Semantic Coherence of a Conversation

Authors: Svitlana Vakulenko, Maarten de Rijke, Michael Cochez, Vadim Savenkov, Axel Polleres

Abstract: Conversational systems have become increasingly popular as a way for humans to interact with computers. To be able to provide intelligent responses, conversational systems must correctly model the structure and semantics of a conversation. We introduce the task of measuring semantic (in)coherence in a conversation with respect to background knowledge, which relies on the identification of semantic… ▽ More Conversational systems have become increasingly popular as a way for humans to interact with computers. To be able to provide intelligent responses, conversational systems must correctly model the structure and semantics of a conversation. We introduce the task of measuring semantic (in)coherence in a conversation with respect to background knowledge, which relies on the identification of semantic relations between concepts introduced during a conversation. We propose and evaluate graph-based and machine learning-based approaches for measuring semantic coherence using knowledge graphs, their vector space embeddings and word embedding models, as sources of background knowledge. We demonstrate how these approaches are able to uncover different coherence patterns in conversations on the Ubuntu Dialogue Corpus. △ Less

Submitted 17 June, 2018; originally announced June 2018.

arXiv:1805.12218 [pdf, other]

Convolutional Embedded Networks for Population Scale Clustering and Bio-ancestry Inferencing

Authors: Md. Rezaul Karim, Michael Cochez, Achille Zappa, Ratnesh Sahay, Oya Beyan, Dietrich-Rebholz Schuhmann, Stefan Decker

Abstract: The study of genetic variants can help find correlating population groups to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning algorithms are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not… ▽ More The study of genetic variants can help find correlating population groups to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning algorithms are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks can learn non-linear map**s that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we proposed convolutional embedded networks in which we combine two DNN architectures called convolutional embedded clustering and convolutional autoencoder classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning on 95 million GVs from the 1000 genomes and Simons genome diversity projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index of 0.915, the normalized mutual information of 0.92, and the clustering accuracy of 89%. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient(MCC) score of 0.9004 and 0.8245, respectively. To provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees(GBT) and SHAP. Overall, our approach is transparent and faster than the baseline methods, and scalable for 5% to 100% of the full human genome. △ Less

Submitted 19 April, 2020; v1 submitted 30 May, 2018; originally announced May 2018.

Comments: This article is under review in IEEE/ACM Transactions on Computational Biology and Bioinformatics. It is based on a workshop paper discussed at the Extended Semantic Web Conference (ESWC'2017) workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics (SeWeBMeDA), Slovenia, May, 28-29, 2017

arXiv:1802.06609 [pdf, other]

On the computation of Shannon Entropy from Counting Bloom Filters

Authors: Michael Cochez

Abstract: In this short note a method for computing the naive plugin estimator of discrete entropy from a counting Bloom filter will be presented. The method does work reasonably as long as the collision probability in the bloom filter is kept low. In this short note a method for computing the naive plugin estimator of discrete entropy from a counting Bloom filter will be presented. The method does work reasonably as long as the collision probability in the bloom filter is kept low. △ Less

Submitted 19 February, 2018; originally announced February 2018.

arXiv:1607.04809 [pdf, other]

Knowledge Representation on the Web revisited: Tools for Prototype Based Ontologies

Authors: Michael Cochez, Stefan Decker, Eric Prud'hommeaux

Abstract: In recent years RDF and OWL have become the most common knowledge representation languages in use on the Web, propelled by the recommendation of the W3C. In this paper we present a practical implementation of a different kind of knowledge representation based on Prototypes. In detail, we present a concrete syntax easily and effectively parsable by applications. We also present extensible implement… ▽ More In recent years RDF and OWL have become the most common knowledge representation languages in use on the Web, propelled by the recommendation of the W3C. In this paper we present a practical implementation of a different kind of knowledge representation based on Prototypes. In detail, we present a concrete syntax easily and effectively parsable by applications. We also present extensible implementations of a prototype knowledge base, specifically designed for storage of Prototypes. These implementations are written in Java and can be extended by using the implementation as a library. Alternatively, the software can be deployed as such. Further, results of benchmarks for both local and web deployment are presented. This paper augments a research paper, in which we describe the more theoretical aspects of our Prototype system. △ Less

Submitted 16 July, 2016; originally announced July 2016.

Comments: Related software available from https://github.com/miselico/knowledgebase/

Showing 1–37 of 37 results for author: Cochez, M