Search | arXiv e-print repository

Hospitalization Length of Stay Prediction using Patient Event Sequences

Authors: Emil Riis Hansen, Thomas Dyhre Nielsen, Thomas Mulvad, Mads Nibe Strausholm, Tomer Sagi, Katja Hose

Abstract: Predicting patients hospital length of stay (LOS) is essential for improving resource allocation and supporting decision-making in healthcare organizations. This paper proposes a novel approach for predicting LOS by modeling patient information as sequences of events. Specifically, we present a transformer-based model, termed Medic-BERT (M-BERT), for LOS prediction using the unique features descri… ▽ More Predicting patients hospital length of stay (LOS) is essential for improving resource allocation and supporting decision-making in healthcare organizations. This paper proposes a novel approach for predicting LOS by modeling patient information as sequences of events. Specifically, we present a transformer-based model, termed Medic-BERT (M-BERT), for LOS prediction using the unique features describing patients medical event sequences. We performed empirical experiments on a cohort of more than 45k emergency care patients from a large Danish hospital. Experimental results show that M-BERT can achieve high accuracy on a variety of LOS problems and outperforms traditional nonsequence-based machine learning approaches. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: 11 pages, 5 figures

MSC Class: 68T07 ACM Class: I.2.7; J.3

arXiv:2303.02204 [pdf, other]

KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science

Authors: Mossad Helali, Niki Monjazeb, Shubham Vashisth, Philippe Carrier, Ahmed Helal, Antonio Cavalcante, Khaled Ammar, Katja Hose, Essam Mansour

Abstract: In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. In this process, a myriad of artifacts (datasets, pipeline scripts, etc.) are created. However, there has been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those art… ▽ More In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. In this process, a myriad of artifacts (datasets, pipeline scripts, etc.) are created. However, there has been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists recover information and expertise from colleagues or learn via trial and error. Hence, this paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML. It shows that KGLiDS is significantly faster with a lower memory footprint than the state-of-the-art systems while achieving comparable or better accuracy. △ Less

Submitted 12 June, 2024; v1 submitted 3 March, 2023; originally announced March 2023.

Comments: 15 pages, 9 figures

arXiv:2210.05781 [pdf, other]

Transforming RDF-star to Property Graphs: A Preliminary Analysis of Transformation Approaches -- extended version

Authors: Ghadeer Abuoda, Daniele Dell'Aglio, Arthur Keen, Katja Hose

Abstract: RDF and property graph models have many similarities, such as using basic graph concepts like nodes and edges. However, such models differ in their modeling approach, expressivity, serialization, and the nature of applications. RDF is the de-facto standard model for knowledge graphs on the Semantic Web and supported by a rich ecosystem for inference and processing. The property graph model, in con… ▽ More RDF and property graph models have many similarities, such as using basic graph concepts like nodes and edges. However, such models differ in their modeling approach, expressivity, serialization, and the nature of applications. RDF is the de-facto standard model for knowledge graphs on the Semantic Web and supported by a rich ecosystem for inference and processing. The property graph model, in contrast, provides advantages in scalable graph analytical tasks, such as graph matching, path analysis, and graph traversal. RDF-star extends RDF and allows capturing metadata as a first-class citizen. To tap on the advantages of alternative models, the literature proposes different ways of transforming knowledge graphs between property graphs and RDF. However, most of these approaches cannot provide complete transformations for RDF-star graphs. Hence, this paper provides a step towards transforming RDF-star graphs into property graphs. In particular, we identify different cases to evaluate transformation approaches from RDF-star to property graphs. Specifically, we categorize two classes of transformation approaches and analyze them based on the test cases. The obtained insights will form the foundation for building complete transformation approaches in the future. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2209.09673 [pdf, other]

doi 10.3847/1538-4365/ac9da4

ExoClock Project III: 450 new exoplanet ephemerides from ground and space observations

Authors: A. Kokori, A. Tsiaras, B. Edwards, A. Jones, G. Pantelidou, G. Tinetti, L. Bewersdorff, A. Iliadou, Y. Jongen, G. Lekkas, A. Nastasi, E. Poultourtzidis, C. Sidiropoulos, F. Walter, A. Wünsche, R. Abraham, V. K. Agnihotri, R. Albanesi, E. Arce-Mansego, D. Arnot, M. Audejean, C. Aumasson, M. Bachschmidt, G. Baj, P. R. Barroy , et al. (192 additional authors not shown)

Abstract: The ExoClock project has been created with the aim of increasing the efficiency of the Ariel mission. It will achieve this by continuously monitoring and updating the ephemerides of Ariel candidates over an extended period, in order to produce a consistent catalogue of reliable and precise ephemerides. This work presents a homogenous catalogue of updated ephemerides for 450 planets, generated by t… ▽ More The ExoClock project has been created with the aim of increasing the efficiency of the Ariel mission. It will achieve this by continuously monitoring and updating the ephemerides of Ariel candidates over an extended period, in order to produce a consistent catalogue of reliable and precise ephemerides. This work presents a homogenous catalogue of updated ephemerides for 450 planets, generated by the integration of $\sim$18000 data points from multiple sources. These sources include observations from ground-based telescopes (ExoClock network and ETD), mid-time values from the literature and light-curves from space telescopes (Kepler/K2 and TESS). With all the above, we manage to collect observations for half of the post-discovery years (median), with data that have a median uncertainty less than one minute. In comparison with literature, the ephemerides generated by the project are more precise and less biased. More than 40\% of the initial literature ephemerides had to be updated to reach the goals of the project, as they were either of low precision or drifting. Moreover, the integrated approach of the project enables both the monitoring of the majority of the Ariel candidates (95\%), and also the identification of missing data. The dedicated ExoClock network effectively supports this task by contributing additional observations when a gap in the data is identified. These results highlight the need for continuous monitoring to increase the observing coverage of the candidate planets. Finally, the extended observing coverage of planets allows us to detect trends (TTVs - Transit Timing Variations) for a sample of 19 planets. All products, data, and codes used in this work are open and accessible to the wider scientific community. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: Recommended for publication to ApJS (reviewer's comments implemented). Main body: 13 pages, total: 77 pages, 7 figures, 7 tables. Data available at http://doi.org/10.17605/OSF.IO/P298N

arXiv:2209.04185 [pdf, other]

Simple and Powerful Architecture for Inductive Recommendation Using Knowledge Graph Convolutions

Authors: Theis E. Jendal, Matteo Lissandrini, Peter Dolog, Katja Hose

Abstract: Using graph models with relational information in recommender systems has shown promising results. Yet, most methods are transductive, i.e., they are based on dimensionality reduction architectures. Hence, they require heavy retraining every time new items or users are added. Conversely, inductive methods promise to solve these issues. Nonetheless, all inductive methods rely only on interactions,… ▽ More Using graph models with relational information in recommender systems has shown promising results. Yet, most methods are transductive, i.e., they are based on dimensionality reduction architectures. Hence, they require heavy retraining every time new items or users are added. Conversely, inductive methods promise to solve these issues. Nonetheless, all inductive methods rely only on interactions, making recommendations for users with few interactions sub-optimal and even impossible for new items. Therefore, we focus on inductive methods able to also exploit knowledge graphs (KGs). In this work, we propose SimpleRec, a strong baseline that uses a graph neural network and a KG to provide better recommendations than related inductive methods for new users and items. We show that it is unnecessary to create complex model architectures for user representations, but it is enough to allow users to be represented by the few ratings they provide and the indirect connections among them without any user metadata. As a result, we re-evaluate state-of-the-art methods, identify better evaluation protocols, highlight unwarranted conclusions from previous proposals, and showcase a novel, stronger baseline for this task. △ Less

Submitted 13 September, 2022; v1 submitted 9 September, 2022; originally announced September 2022.

arXiv:2208.14692 [pdf, other]

The Lothbrok approach for SPARQL Query Optimization over Decentralized Knowledge Graphs

Authors: Christian Aebeloe, Gabriela Montoya, Katja Hose

Abstract: While the Web of Data in principle offers access to a wide range of interlinked data, the architecture of the Semantic Web today relies mostly on the data providers to maintain access to their data through SPARQL endpoints. Several studies, however, have shown that such endpoints often experience downtime, meaning that the data they maintain becomes inaccessible. While decentralized systems based… ▽ More While the Web of Data in principle offers access to a wide range of interlinked data, the architecture of the Semantic Web today relies mostly on the data providers to maintain access to their data through SPARQL endpoints. Several studies, however, have shown that such endpoints often experience downtime, meaning that the data they maintain becomes inaccessible. While decentralized systems based on Peer-to-Peer (P2P) technology have previously shown to increase the availability of knowledge graphs, even when a large proportion of the nodes fail, processing queries in such a setup can be an expensive task since data necessary to answer a single query might be distributed over multiple nodes. In this paper, we therefore propose an approach to optimizing SPARQL queries over decentralized knowledge graphs, called Lothbrok. While there are potentially many aspects to consider when optimizing such queries, we focus on three aspects: cardinality estimation, locality awareness, and data fragmentation. We empirically show that Lothbrok is able to achieve significantly faster query processing performance compared to the state of the art when processing challenging queries as well as when the network is under high load. △ Less

Submitted 31 August, 2022; originally announced August 2022.

arXiv:2204.12270 [pdf, other]

Graph Neural Networks for Microbial Genome Recovery

Authors: Andre Lamurias, Alessandro Tibo, Katja Hose, Mads Albertsen, Thomas Dyhre Nielsen

Abstract: Microbes have a profound impact on our health and environment, but our understanding of the diversity and function of microbial communities is severely limited. Through DNA sequencing of microbial communities (metagenomics), DNA fragments (reads) of the individual microbes can be obtained, which through assembly graphs can be combined into long contiguous DNA sequences (contigs). Given the complex… ▽ More Microbes have a profound impact on our health and environment, but our understanding of the diversity and function of microbial communities is severely limited. Through DNA sequencing of microbial communities (metagenomics), DNA fragments (reads) of the individual microbes can be obtained, which through assembly graphs can be combined into long contiguous DNA sequences (contigs). Given the complexity of microbial communities, single contig microbial genomes are rarely obtained. Instead, contigs are eventually clustered into bins, with each bin ideally making up a full genome. This process is referred to as metagenomic binning. Current state-of-the-art techniques for metagenomic binning rely only on the local features for the individual contigs. These techniques therefore fail to exploit the similarities between contigs as encoded by the assembly graph, in which the contigs are organized. In this paper, we propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning. Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph. We explore several types of GNNs and demonstrate that VaeG-Bin recovers more high-quality genomes than other state-of-the-art binners on both simulated and real-world datasets. △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2111.13186 [pdf, other]

Federated Data Science to Break Down Silos [Vision]

Authors: Essam Mansour, Kavitha Srinivas, Katja Hose

Abstract: Similar to Open Data initiatives, data science as a community has launched initiatives for sharing not only data but entire pipelines, derivatives, artifacts, etc. (Open Data Science). However, the few efforts that exist focus on the technical part on how to facilitate sharing, conversion, etc. This vision paper goes a step further and proposes KEK, an open federated data science platform that doe… ▽ More Similar to Open Data initiatives, data science as a community has launched initiatives for sharing not only data but entire pipelines, derivatives, artifacts, etc. (Open Data Science). However, the few efforts that exist focus on the technical part on how to facilitate sharing, conversion, etc. This vision paper goes a step further and proposes KEK, an open federated data science platform that does not only allow for sharing data science pipelines and their (meta)data but also provides methods for efficient search and, in the ideal case, even allows for combining and defining pipelines across platforms in a federated manner. In doing so, KEK addresses the so far neglected challenge of actually finding artifacts that are semantically related and that can be combined to achieve a certain goal. △ Less

Submitted 25 November, 2021; originally announced November 2021.

Comments: Accepted at SIGMOD Record

arXiv:2106.04209 [pdf, other]

doi 10.1145/3340531.3412759

MindReader: Recommendation over Knowledge Graph Entities with Explicit User Ratings

Authors: Anders H. Brams, Anders L. Jakobsen, Theis E. Jendal, Matteo Lissandrini, Peter Dolog, Katja Hose

Abstract: Knowledge Graphs (KGs) have been integrated in several models of recommendation to augment the informational value of an item by means of its related entities in the graph. Yet, existing datasets only provide explicit ratings on items and no information is provided about user opinions of other (non-recommendable) entities. To overcome this limitation, we introduce a new dataset, called the MindRea… ▽ More Knowledge Graphs (KGs) have been integrated in several models of recommendation to augment the informational value of an item by means of its related entities in the graph. Yet, existing datasets only provide explicit ratings on items and no information is provided about user opinions of other (non-recommendable) entities. To overcome this limitation, we introduce a new dataset, called the MindReader, providing explicit user ratings both for items and for KG entities. In this first version, the MindReader dataset provides more than 102 thousands explicit ratings collected from 1,174 real users on both items and entities from a KG in the movie domain. This dataset has been collected through an online interview application that we also release open source. As a demonstration of the importance of this new dataset, we present a comparative study of the effect of the inclusion of ratings on non-item KG entities in a variety of state-of-the-art recommendation models. In particular, we show that most models, whether designed specifically for graph data or not, see improvements in recommendation quality when trained on explicit non-item ratings. Moreover, for some models, we show that non-item ratings can effectively replace item ratings without loss of recommendation quality. This finding, thanks also to an observed greater familiarity of users towards common KG entities than towards long-tail items, motivates the use of KG entities for both warm and cold-start recommendations. △ Less

Submitted 8 June, 2021; originally announced June 2021.

arXiv:2012.06171 [pdf, other]

doi 10.1145/3434642

The Future is Big Graphs! A Community View on Graph Processing Systems

Authors: Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A. Boncz, Khuzaima Daudjee, Emanuele Della Valle, Stefania Dumbrava, Olaf Hartig, Bernhard Haslhofer, Tim Hegeman, Jan Hidders, Katja Hose, Adriana Iamnitchi, Vasiliki Kalavri, Hugo Kapp, Wim Martens, M. Tamer Özsu, Eric Peukert, Stefan Plantikow , et al. (16 additional authors not shown)

Abstract: Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue t… ▽ More Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed? △ Less

Submitted 11 December, 2020; originally announced December 2020.

Comments: 12 pages, 3 figures, collaboration between the large-scale systems and data management communities, work started at the Dagstuhl Seminar 19491 on Big Graph Processing Systems, to be published in the Communications of the ACM

ACM Class: C.3; E.0; H.2; J.0

arXiv:2006.07180 [pdf, other]

doi 10.3233/SW-210429

High-Level ETL for Semantic Data Warehouses -- Full Version

Authors: Rudra Pratap Deb Nath, Oscar Romero, Torben Bach Pedersen, Katja Hose

Abstract: The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-L… ▽ More The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual map** at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 44 pages including reference, 13 figures and 4 tables. This paper is submitted to Semantic Web Journal and now it is under review

Journal ref: Semantic Web, vol. 13, no. 1, pp. 85-132, 2022

arXiv:2002.09172 [pdf, other]

Star Pattern Fragments: Accessing Knowledge Graphs through Star Patterns

Authors: Christian Aebeloe, Ilkcan Keles, Gabriela Montoya, Katja Hose

Abstract: The Semantic Web offers access to a vast Web of interlinked information accessible via SPARQL endpoints. Such endpoints offer a well-defined interface to retrieve results for complex SPARQL queries. The computational load for processing such SPARQL endpoints offer access to a vast amount of interlinked information. While they offer a well-defined interface for efficiently retrieving results for co… ▽ More The Semantic Web offers access to a vast Web of interlinked information accessible via SPARQL endpoints. Such endpoints offer a well-defined interface to retrieve results for complex SPARQL queries. The computational load for processing such SPARQL endpoints offer access to a vast amount of interlinked information. While they offer a well-defined interface for efficiently retrieving results for complex SPARQL queries, complex query loads can easily overload or crash endpoints as all the computational load of answering the queries resides entirely with the server hosting the endpoint. Recently proposed interfaces, such as Triple Pattern Fragments, have therefore shifted some of the query processing load from the server to the client at the expense of increased network traffic in the case of non-selective triple patterns. This paper therefore proposes Star Pattern Fragments (SPF), an RDF interface enabling a better load balancing between server and client by decomposing SPARQL queries into star-shaped subqueries, evaluating them on the server side. Experiments using synthetic data (WatDiv), as well as real data (DBpedia), show that SPF does not only significantly reduce network traffic, it is also up to two orders of magnitude faster than the state-of-the-art interfaces under high query load. △ Less

Submitted 9 November, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

arXiv:2002.06608 [pdf, other]

Multidimensional Enrichment of Spatial RDF Data for SOLAP -- Full Version

Authors: Nurefsan Gür, Torben Bach Pedersen, Katja Hose, Mikael Midtgaard

Abstract: Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have s… ▽ More Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have spatial information, such as coordinates, the lack of spatial semantics and spatial multidimensional concepts in QB4OLAP and QB prevents users from employing SOLAP queries over spatial data using SPARQL. The QB4SOLAP vocabulary, on the other hand, fully supports annotating spatial and multidimensional data on the Semantic Web and enables users to query endpoints with SOLAP operators in SPARQL. To bridge the gap between QB/QB4OLAP and QB4SOLAP, we propose an RDF2SOLAP enrichment model that automatically annotates spatial multidimensional concepts with QB4SOLAP and in doing so enables SOLAP on existing QB and QB4OLAP data on the Semantic Web. Furthermore, we present and evaluate a wide range of enrichment algorithms and apply them on a non-trivial real-world use case involving governmental open data with complex geometry types. △ Less

Submitted 16 February, 2020; originally announced February 2020.

Comments: 33 pages, 8 figures, 7 tables, 10 listings, 7 algorithms, under review in Semantic Web Journal, available on http://www.semantic-web-journal.net/content/multidimensional-enrichment-spatial-rdf-data-solap

arXiv:1912.08010 [pdf, other]

Querying Linked Data: An Experimental Evaluation of State-of-the-Art Interfaces

Authors: Gabriela Montoya, Ilkcan Keles, Katja Hose

Abstract: The adoption of Semantic Web technologies, and in particular the Open Data initiative, has contributed to the steady growth of the number of datasets and triples accessible on the Web. Most commonly, queries over RDF data are evaluated over SPARQL endpoints. Recently, however, alternatives such as TPF have been proposed with the goal of shifting query processing load from the server running the SP… ▽ More The adoption of Semantic Web technologies, and in particular the Open Data initiative, has contributed to the steady growth of the number of datasets and triples accessible on the Web. Most commonly, queries over RDF data are evaluated over SPARQL endpoints. Recently, however, alternatives such as TPF have been proposed with the goal of shifting query processing load from the server running the SPARQL endpoint towards the client that issued the query. Although these interfaces have been evaluated against standard benchmarks and testbeds that showed their benefits over previous work in general, a fine-granular evaluation of what types of queries exploit the strengths of the different available interfaces has never been done. In this paper, we present the results of our in-depth evaluation of existing RDF interfaces. In addition, we also examine the influence of the backend on the performance of these interfaces. Using representative and diverse query loads based on the query log of a public SPARQL endpoint, we stress test the different interfaces and backends and identify their strengths and weaknesses. △ Less

Submitted 17 December, 2019; originally announced December 2019.

Comments: 18 pages, 14 figures

arXiv:1902.05134 [pdf, other]

Efficient Continuous Multi-Query Processing over Graph Streams

Authors: Lefteris Zervakis, Vinay Setty, Christos Tryfonopoulos, Katja Hose

Abstract: Graphs are ubiquitous and ever-present data structures that have a wide range of applications involving social networks, knowledge bases and biological interactions. The evolution of a graph in such scenarios can yield important insights about the nature and activities of the underlying network, which can then be utilized for applications such as news dissemination, network monitoring, and content… ▽ More Graphs are ubiquitous and ever-present data structures that have a wide range of applications involving social networks, knowledge bases and biological interactions. The evolution of a graph in such scenarios can yield important insights about the nature and activities of the underlying network, which can then be utilized for applications such as news dissemination, network monitoring, and content curation. Capturing the continuous evolution of a graph can be achieved by long-standing sub-graph queries. Although, for many applications this can only be achieved by a set of queries, state-of-the-art approaches focus on a single query scenario. In this paper, we therefore introduce the notion of continuous multi-query processing over graph streams and discuss its application to a number of use cases. To this end, we designed and developed a novel algorithmic solution for efficient multi-query evaluation against a stream of graph updates and experimentally demonstrated its applicability. Our results against two baseline approaches using real-world, as well as synthetic datasets, confirm a two orders of magnitude improvement of the proposed solution. △ Less

Submitted 13 February, 2019; originally announced February 2019.

arXiv:1705.06135 [pdf, other]

doi 10.1007/978-3-319-68288-4_28

The Odyssey Approach for Optimizing Federated SPARQL Queries

Authors: Gabriela Montoya, Hala Skaf-Molli, Katja Hose

Abstract: Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources.… ▽ More Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources. To overcome these challenges, most federated query engines rely on heuristics to reduce the space of possible query execution plans or on dynamic programming strategies to produce optimal plans. Nevertheless, these plans may still exhibit a high number of intermediate results or high execution times because of heuristics and inaccurate cost estimations. In this paper, we present Odyssey, an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans. Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than state-of-the-art optimizers. Our experiments using the FedBench benchmark show execution time gains of at least 25 times on average. △ Less

Submitted 2 November, 2017; v1 submitted 17 May, 2017; originally announced May 2017.

Comments: 16 pages, 10 figures

arXiv:1212.5636 [pdf, other]

Partout: A Distributed Engine for Efficient RDF Processing

Authors: Luis Galárraga, Katja Hose, Ralf Schenkel

Abstract: The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient.… ▽ More The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for efficient RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log, allocating the fragments to nodes in a cluster, and finding the optimal configuration. Partout can efficiently handle updates and its query optimizer produces efficient query execution plans for ad-hoc SPARQL queries. Our experiments show the superiority of our approach to state-of-the-art approaches for partitioning and distributed SPARQL query processing. △ Less

Submitted 21 December, 2012; originally announced December 2012.

arXiv:1210.5403 [pdf, other]

An Experience Report of Large Scale Federations

Authors: Andreas Schwarte, Peter Haase, Michael Schmidt, Katja Hose, Ralf Schenkel

Abstract: We present an experimental study of large-scale RDF federations on top of the Bio2RDF data sources, involving 29 data sets with more than four billion RDF triples deployed in a local federation. Our federation is driven by FedX, a highly optimized federation mediator for Linked Data. We discuss design decisions, technical aspects, and experiences made in setting up and optimizing the Bio2RDF feder… ▽ More We present an experimental study of large-scale RDF federations on top of the Bio2RDF data sources, involving 29 data sets with more than four billion RDF triples deployed in a local federation. Our federation is driven by FedX, a highly optimized federation mediator for Linked Data. We discuss design decisions, technical aspects, and experiences made in setting up and optimizing the Bio2RDF federation, and present an exhaustive experimental evaluation of the federation scenario. In addition to a controlled setting with local federation members, we study implications arising in a hybrid setting, where local federation members interact with remote federation members exhibiting higher network latency. The outcome demonstrates the feasibility of federated semantic data management in general and indicates remaining bottlenecks and research opportunities that shall serve as a guideline for future work in the area of federated semantic data processing. △ Less

Submitted 19 October, 2012; originally announced October 2012.

ACM Class: H.2.3; H.2.4; H.3.4

Showing 1–18 of 18 results for author: Hose, K