-
Validating ChatGPT Facts through RDF Knowledge Graphs and Sentence Similarity
Authors:
Michalis Mountantonakis,
Yannis Tzitzikas
Abstract:
Since ChatGPT offers detailed responses without justifications, and erroneous facts even for popular persons, events and places, in this paper we present a novel pipeline that retrieves the response of ChatGPT in RDF and tries to validate the ChatGPT facts using one or more RDF Knowledge Graphs (KGs). To this end we leverage DBpedia and LODsyndesis (an aggregated Knowledge Graph that contains 2 bi…
▽ More
Since ChatGPT offers detailed responses without justifications, and erroneous facts even for popular persons, events and places, in this paper we present a novel pipeline that retrieves the response of ChatGPT in RDF and tries to validate the ChatGPT facts using one or more RDF Knowledge Graphs (KGs). To this end we leverage DBpedia and LODsyndesis (an aggregated Knowledge Graph that contains 2 billion triples from 400 RDF KGs of many domains) and short sentence embeddings, and introduce an algorithm that returns the more relevant triple(s) accompanied by their provenance and a confidence score. This enables the validation of ChatGPT responses and their enrichment with justifications and provenance. To evaluate this service (such services in general), we create an evaluation benchmark that includes 2,000 ChatGPT facts; specifically 1,000 facts for famous Greek Persons, 500 facts for popular Greek Places, and 500 facts for Events related to Greece. The facts were manually labelled (approximately 73% of ChatGPT facts were correct and 27% of facts were erroneous). The results are promising; indicatively for the whole benchmark, we managed to verify the 85.3% of the correct facts of ChatGPT and to find the correct answer for the 58% of the erroneous ChatGPT facts.
△ Less
Submitted 17 November, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Using Multiple RDF Knowledge Graphs for Enriching ChatGPT Responses
Authors:
Michalis Mountantonakis,
Yannis Tzitzikas
Abstract:
There is a recent trend for using the novel Artificial Intelligence ChatGPT chatbox, which provides detailed responses and articulate answers across many domains of knowledge. However, in many cases it returns plausible-sounding but incorrect or inaccurate responses, whereas it does not provide evidence. Therefore, any user has to further search for checking the accuracy of the answer or/and for f…
▽ More
There is a recent trend for using the novel Artificial Intelligence ChatGPT chatbox, which provides detailed responses and articulate answers across many domains of knowledge. However, in many cases it returns plausible-sounding but incorrect or inaccurate responses, whereas it does not provide evidence. Therefore, any user has to further search for checking the accuracy of the answer or/and for finding more information about the entities of the response. At the same time there is a high proliferation of RDF Knowledge Graphs (KGs) over any real domain, that offer high quality structured data. For enabling the combination of ChatGPT and RDF KGs, we present a research prototype, called GPToLODS, which is able to enrich any ChatGPT response with more information from hundreds of RDF KGs. In particular, it identifies and annotates each entity of the response with statistics and hyperlinks to LODsyndesis KG (which contains integrated data from 400 RDF KGs and over 412 million entities). In this way, it is feasible to enrich the content of entities and to perform fact checking and validation for the facts of the response at real time.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
FastCat Catalogues: Interactive Entity-based Exploratory Analysis of Archival Documents
Authors:
Georgios Rinakakis,
Kostas Petrakis,
Yannis Tzitzikas,
Pavlos Fafalios
Abstract:
We describe FastCat Catalogues, a Web application that supports researchers studying archival material, such as historians, in exploring and quantitatively analysing the data (transcripts) of archival documents. The application was designed based on real information needs provided by a large group of researchers, makes use of JSON technology, and is configurable for use over any type of archival d…
▽ More
We describe FastCat Catalogues, a Web application that supports researchers studying archival material, such as historians, in exploring and quantitatively analysing the data (transcripts) of archival documents. The application was designed based on real information needs provided by a large group of researchers, makes use of JSON technology, and is configurable for use over any type of archival documents whose contents have been transcribed and exported in JSON format. The supported functionalities include a) source- or record-specific entity browsing, b) source-independent entity browsing, c) data filtering, d) inspection of provenance information, e) data aggregation and visualisation in charts, f) table and chart data export for further (external) analysis. The application is provided as open source and is currently used by historians in maritime history research.
△ Less
Submitted 20 June, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
A Workflow Model for Holistic Data Management and Semantic Interoperability in Quantitative Archival Research
Authors:
Pavlos Fafalios,
Yannis Marketakis,
Anastasia Axaridou,
Yannis Tzitzikas,
Martin Doerr
Abstract:
Archival research is a complicated task that involves several diverse activities for the extraction of evidence and knowledge from a set of archival documents. The involved activities are usually unconnected, in terms of data connection and flow, making difficult their recursive revision and execution, as well as the inspection of provenance information at data element level. This paper proposes a…
▽ More
Archival research is a complicated task that involves several diverse activities for the extraction of evidence and knowledge from a set of archival documents. The involved activities are usually unconnected, in terms of data connection and flow, making difficult their recursive revision and execution, as well as the inspection of provenance information at data element level. This paper proposes a workflow model for holistic data management in archival research; from transcribing and documenting a set of archival documents, to curating the transcribed data, integrating it to a rich semantic network (knowledge graph), and then exploring the integrated data quantitatively. The workflow is provenance-aware, highly-recursive and focuses on semantic interoperability, aiming at the production of sustainable data of high value and long-term validity. We provide implementation details for each step of the workflow and present its application in maritime history research. We also discuss relevant quality aspects and lessons learned from its application in a real context.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Estimating the Cost of Executing Link Traversal based SPARQL Queries
Authors:
Antonis Sklavos,
Pavlos Fafalios,
Yannis Tzitzikas
Abstract:
An increasing number of organisations in almost all fields have started adopting semantic web technologies for publishing their data as open, linked and interoperable (RDF) datasets, queryable through the SPARQL language and protocol. Link traversal has emerged as a SPARQL query processing method that exploits the Linked Data principles and the dynamic nature of the Web to dynamically discover dat…
▽ More
An increasing number of organisations in almost all fields have started adopting semantic web technologies for publishing their data as open, linked and interoperable (RDF) datasets, queryable through the SPARQL language and protocol. Link traversal has emerged as a SPARQL query processing method that exploits the Linked Data principles and the dynamic nature of the Web to dynamically discover data relevant for answering a query by resolving online resources (URIs) during query evaluation. However, the execution time of link traversal queries can become prohibitively high for certain query types due to the high number of resources that need to be accessed during query execution. In this paper we propose and evaluate baseline methods for estimating the evaluation cost of link traversal queries. Such methods can be very useful for deciding on-the-fly the query execution strategy to follow for a given query, thereby reducing the load of a SPARQL endpoint and increasing the overall reliability of the query service. To evaluate the performance of the proposed methods, we have created (and make publicly available) a ground truth dataset consisting of 2,425 queries.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Towards Semantic Interoperability in Historical Research: Documenting Research Data and Knowledge with Synthesis
Authors:
Pavlos Fafalios,
Konstantina Konsolaki,
Lida Charami,
Kostas Petrakis,
Manos Paterakis,
Dimitris Angelakis,
Yannis Tzitzikas,
Chrysoula Bekiari,
Martin Doerr
Abstract:
A vast area of research in historical science concerns the documentation and study of artefacts and related evidence. Current practice mostly uses spreadsheets or simple relational databases to organise the information as rows with multiple columns of related attributes. This form offers itself for data analysis and scholarly interpretation, however it also poses problems including i) the difficul…
▽ More
A vast area of research in historical science concerns the documentation and study of artefacts and related evidence. Current practice mostly uses spreadsheets or simple relational databases to organise the information as rows with multiple columns of related attributes. This form offers itself for data analysis and scholarly interpretation, however it also poses problems including i) the difficulty for collaborative but controlled documentation by a large number of users, ii) the lack of representation of the details from which the documented relations are inferred, iii) the difficulty to extend the underlying data structures as well as to combine and integrate data from multiple and diverse information sources, and iv) the limitation to reuse the data beyond the context of a particular research activity. To support historians to cope with these problems, in this paper we describe the Synthesis documentation system and its use by a large number of historians in the context of an ongoing research project in the field of History of Art. The system is Web-based and collaborative, and makes use of existing standards for information documentation and publication (CIDOC-CRM, RDF), focusing on semantic interoperability and the production of data of high value and long-term validity.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
FAST CAT: Collaborative Data Entry and Curation for Semantic Interoperability in Digital Humanities
Authors:
Pavlos Fafalios,
Kostas Petrakis,
Georgios Samaritakis,
Korina Doerr,
Athina Kritsotaki,
Yannis Tzitzikas,
Martin Doerr
Abstract:
Descriptive and empirical sciences, such as History, are the sciences that collect, observe and describe phenomena in order to explain them and draw interpretative conclusions about influences, driving forces and impacts under given circumstances. Spreadsheet software and relational database management systems are still the dominant tools for quantitative analysis and overall data management in th…
▽ More
Descriptive and empirical sciences, such as History, are the sciences that collect, observe and describe phenomena in order to explain them and draw interpretative conclusions about influences, driving forces and impacts under given circumstances. Spreadsheet software and relational database management systems are still the dominant tools for quantitative analysis and overall data management in these these sciences, allowing researchers to directly analyse the gathered data and perform scholarly interpretation. However, this current practice has a set of limitations, including the high dependency of the collected data on the initial research hypothesis, usually useless for other research, the lack of representation of the details from which the registered relations are inferred, and the difficulty to revisit the original data sources for verification, corrections or improvements. To cope with these problems, in this paper we present FAST CAT, a collaborative system for assistive data entry and curation in Digital Humanities and similar forms of empirical research. We describe the related challenges, the overall methodology we follow for supporting semantic interoperability, and discuss the use of FAST CAT in the context of a European (ERC) project of Maritime History, called SeaLiT, which examines economic, social and demographic impacts of the introduction of steamboats in the Mediterranean area between the 1850s and the 1920s.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
CS563-QA: A Collection for Evaluating Question Answering Systems
Authors:
Katerina Papantoniou,
Yannis Tzitzikas
Abstract:
Question Answering (QA) is a challenging topic since it requires tackling the various difficulties of natural language understanding. Since evaluation is important not only for identifying the strong and weak points of the various techniques for QA, but also for facilitating the inception of new methods and techniques, in this paper we present a collection for evaluating QA methods over free text…
▽ More
Question Answering (QA) is a challenging topic since it requires tackling the various difficulties of natural language understanding. Since evaluation is important not only for identifying the strong and weak points of the various techniques for QA, but also for facilitating the inception of new methods and techniques, in this paper we present a collection for evaluating QA methods over free text that we have created. Although it is a small collection, it contains cases of increasing difficulty, therefore it has an educational value and it can be used for rapid evaluation of QA systems.
△ Less
Submitted 6 February, 2021; v1 submitted 2 July, 2019;
originally announced July 2019.
-
CPOI: A Compact Method to Archive Versioned RDF Triple-Sets
Authors:
Maria Psaraki,
Yannis Tzitzikas
Abstract:
Large amounts of RDF/S data are produced and published lately, and several modern applications require the provision of versioning and archiving services over such datasets. In this paper we propose a novel storage index for archiving versions of such datasets, called CPOI (compact partial order index), that exploits the fact that an RDF Knowledge Base (KB), is a graph (or equivalently a set of tr…
▽ More
Large amounts of RDF/S data are produced and published lately, and several modern applications require the provision of versioning and archiving services over such datasets. In this paper we propose a novel storage index for archiving versions of such datasets, called CPOI (compact partial order index), that exploits the fact that an RDF Knowledge Base (KB), is a graph (or equivalently a set of triples), and thus it has not a unique serialization (as it happens with text). If we want to keep stored several versions we actually want to store multiple sets of triples. CPOI is a data structure for storing such sets aiming at reducing the storage space since this is important not only for reducing storage costs, but also for reducing the various communication costs and enabling hosting in main memory (and thus processing efficiently) large quantities of data. CPOI is based on a partial order structure over sets of triple identifiers, where the triple identifiers are represented in a gapped form using variable length encoding schemes. For this index we evaluate analytically and experimentally various identifier assignment techniques and their space savings. The results show significant storage savings, specifically, the storage space of the compressed sets in large and realistic synthetic datasets is about the 8% of the size of the uncompressed sets.
△ Less
Submitted 11 February, 2019;
originally announced February 2019.
-
How Many and What Types of SPARQL Queries can be Answered through Zero-Knowledge Link Traversal?
Authors:
Pavlos Fafalios,
Yannis Tzitzikas
Abstract:
The current de-facto way to query the Web of Data is through the SPARQL protocol, where a client sends queries to a server through a SPARQL endpoint. Contrary to an HTTP server, providing and maintaining a robust and reliable endpoint requires a significant effort that not all publishers are willing or able to make. An alternative query evaluation method is through link traversal, where a query is…
▽ More
The current de-facto way to query the Web of Data is through the SPARQL protocol, where a client sends queries to a server through a SPARQL endpoint. Contrary to an HTTP server, providing and maintaining a robust and reliable endpoint requires a significant effort that not all publishers are willing or able to make. An alternative query evaluation method is through link traversal, where a query is answered by dereferencing online web resources (URIs) at real time. While several approaches for such a lookup-based query evaluation method have been proposed, there exists no analysis of the types (patterns) of queries that can be directly answered on the live Web, without accessing local or remote endpoints and without a-priori knowledge of available data sources. In this paper, we first provide a method for checking if a SPARQL query (to be evaluated on a SPARQL endpoint) can be answered through zero-knowledge link traversal (without accessing the endpoint), and analyse a large corpus of real SPARQL query logs for finding the frequency and distribution of answerable and non-answerable query patterns. Subsequently, we provide an algorithm for transforming answerable queries to SPARQL-LD queries that bypass the endpoints. We report experimental results about the efficiency of the transformed queries and discuss the benefits and the limitations of this query evaluation method.
△ Less
Submitted 13 December, 2018;
originally announced January 2019.
-
Facetize: An Interactive Tool for Cleaning and Transforming Datasets for Facilitating Exploratory Search
Authors:
Anna Kokolaki,
Yannis Tzitzikas
Abstract:
There is a plethora of datasets in various formats which are usually stored in files, hosted in catalogs, or accessed through SPARQL endpoints. In most cases, these datasets cannot be straightforwardly explored by end users, for satisfying recall-oriented information needs. To fill this gap, in this paper we present the design and implementation of Facetize, an editor that allows users to transfor…
▽ More
There is a plethora of datasets in various formats which are usually stored in files, hosted in catalogs, or accessed through SPARQL endpoints. In most cases, these datasets cannot be straightforwardly explored by end users, for satisfying recall-oriented information needs. To fill this gap, in this paper we present the design and implementation of Facetize, an editor that allows users to transform (in an interactive manner) datasets, either static (i.e. stored in files), or dynamic (i.e. being the results of SPARQL queries), to datasets that can be directly explored effectively by themselves or other users. The latter (exploration) is achieved through the familiar interaction paradigm of Faceted Search (and Preference-enriched Faceted Search). Specifically in this paper we describe the requirements, we introduce the required set of transformations, and then we detail the functionality and the implementation of the editor Facetize that realizes these transformations. The supported operations cover a wide range of tasks (selection, visibility, deletions, edits, definition of hierarchies, intervals, derived attributes, and others) and Facetize enables the user to carry them out in a user-friendly and guided manner, without presupposing any technical background (regarding data representation or query languages). Finally we present the results of an evaluation with users. To the best of your knowledge, this is the first editor for this kind of tasks.
△ Less
Submitted 27 December, 2018;
originally announced December 2018.
-
Heuristics-based Query Reordering for Federated Queries in SPARQL 1.1 and SPARQL-LD
Authors:
Thanos Yannakis,
Pavlos Fafalios,
Yannis Tzitzikas
Abstract:
The federated query extension of SPARQL 1.1 allows executing queries distributed over different SPARQL endpoints. SPARQL-LD is a recent extension of SPARQL 1.1 which enables to directly query any HTTP web source containing RDF data, like web pages embedded with RDFa, JSON-LD or Microformats, without requiring the declaration of named graphs. This makes possible to query a large number of data sour…
▽ More
The federated query extension of SPARQL 1.1 allows executing queries distributed over different SPARQL endpoints. SPARQL-LD is a recent extension of SPARQL 1.1 which enables to directly query any HTTP web source containing RDF data, like web pages embedded with RDFa, JSON-LD or Microformats, without requiring the declaration of named graphs. This makes possible to query a large number of data sources (including SPARQL endpoints, online resources, or even Web APIs returning RDF data) through a single one concise query. However, not optimal formulation of SPARQL 1.1 and SPARQL-LD queries can lead to a large number of calls to remote resources which in turn can lead to extremely high query execution times. In this paper, we address this problem and propose a set of query reordering methods which make use of heuristics to reorder a set of SERVICE graph patterns based on their restrictiveness, without requiring the gathering and use of statistics from the remote sources. Such a query optimization approach is widely applicable since it can be exploited on top of existing SPARQL 1.1 and SPARQL-LD implementations. Evaluation results show that query reordering can highly decrease the query-execution time, while a method that considers the number and type of unbound variables and joins achieves the optimal query plan in 88% of the cases.
△ Less
Submitted 23 October, 2018;
originally announced October 2018.
-
LD-SDS: Towards an Expressive Spoken Dialogue System based on Linked-Data
Authors:
Alexandros Papangelis,
Panagiotis Papadakos,
Margarita Kotti,
Yannis Stylianou,
Yannis Tzitzikas,
Dimitris Plexousakis
Abstract:
In this work we discuss the related challenges and describe an approach towards the fusion of state-of-the-art technologies from the Spoken Dialogue Systems (SDS) and the Semantic Web and Information Retrieval domains. We envision a dialogue system named LD-SDS that will support advanced, expressive, and engaging user requests, over multiple, complex, rich, and open-domain data sources that will l…
▽ More
In this work we discuss the related challenges and describe an approach towards the fusion of state-of-the-art technologies from the Spoken Dialogue Systems (SDS) and the Semantic Web and Information Retrieval domains. We envision a dialogue system named LD-SDS that will support advanced, expressive, and engaging user requests, over multiple, complex, rich, and open-domain data sources that will leverage the wealth of the available Linked Data. Specifically, we focus on: a) improving the identification, disambiguation and linking of entities occurring in data sources and user input; b) offering advanced query services for exploiting the semantics of the data, with reasoning and exploratory capabilities; and c) expanding the typical information seeking dialogue model (slot filling) to better reflect real-world conversational search scenarios.
△ Less
Submitted 9 October, 2017;
originally announced October 2017.
-
Tasks that Require, or can Benefit from, Matching Blank Nodes
Authors:
Christina Lantzaki,
Yannis Tzitzikas
Abstract:
In various domains and cases, we observe the creation and usage of information elements which are unnamed. Such elements do not have a name, or may have a name that is not externally referable (usually meaningless and not persistent over time). This paper discusses why we will never `escape' from the problem of having to construct map**s between such unnamed elements in information systems. Sinc…
▽ More
In various domains and cases, we observe the creation and usage of information elements which are unnamed. Such elements do not have a name, or may have a name that is not externally referable (usually meaningless and not persistent over time). This paper discusses why we will never `escape' from the problem of having to construct map**s between such unnamed elements in information systems. Since unnamed elements nowadays occur very often in the framework of the Semantic Web and Linked Data as blank nodes, the paper describes scenarios that can benefit from methods that compute map**s between the unnamed elements. For each scenario, the corresponding bnode matching problem is formally defined. Based on this analysis, we try to reach to more a general formulation of the problem, which can be useful for guiding the required technological advances. To this end, the paper finally discusses methods to realize blank node matching, the implementations that exist, and identifies open issues and challenges.
△ Less
Submitted 30 October, 2014;
originally announced October 2014.
-
A Simple Method to Produce Algorithmic MIDI Music based on Randomness, Simple Probabilities and Multi-Threading
Authors:
Yannis Tzitzikas
Abstract:
This paper introduces a simple method for producing multichannel MIDI music that is based on randomness and simple probabilities. One distinctive feature of the method is that it produces and sends in parallel to the sound card more than one unsynchronized channels by exploiting the multi-threading capabilities of general purpose programming languages. As consequence the derived sound offers a qui…
▽ More
This paper introduces a simple method for producing multichannel MIDI music that is based on randomness and simple probabilities. One distinctive feature of the method is that it produces and sends in parallel to the sound card more than one unsynchronized channels by exploiting the multi-threading capabilities of general purpose programming languages. As consequence the derived sound offers a quite ``full" and ``unpredictable" acoustic experience to the listener. Subsequently the paper reports the results of an evaluation with users. The results were very surprising: the majority of users responded that they could tolerate this music in various occasions.
△ Less
Submitted 14 December, 2013;
originally announced December 2013.
-
Information Carriers and Identification of Information Objects: An Ontological Approach
Authors:
Martin Doerr,
Yannis Tzitzikas
Abstract:
Even though library and archival practice, as well as Digital Preservation, have a long tradition in identifying information objects, the question of their precise identity under change of carrier or migration is still a riddle to science. The objective of this paper is to provide criteria for the unique identification of some important kinds of information objects, independent from the kind of ca…
▽ More
Even though library and archival practice, as well as Digital Preservation, have a long tradition in identifying information objects, the question of their precise identity under change of carrier or migration is still a riddle to science. The objective of this paper is to provide criteria for the unique identification of some important kinds of information objects, independent from the kind of carrier or specific encoding. Our approach is based on the idea that the substance of some kinds of information objects can completely be described in terms of discrete arrangements of finite numbers of known kinds of symbols, such as those implied by style guides for scientific journal submissions. Our theory is also useful for selecting or describing what has to be preserved. This is a fundamental problem since curators and archivists would like to formally record the decisions of what has to be preserved over time and to decide (or verify) whether a migration (transformation) preserves the intended information content. Furthermore, it is important for reasoning about the authenticity of digital objects, as well as for reducing the cost of digital preservation.
△ Less
Submitted 12 December, 2012; v1 submitted 1 January, 2012;
originally announced January 2012.
-
Query processing in distributed, taxonomy-based information sources
Authors:
Carlo Meghini,
Yannis Tzitzikas,
Veronica Coltella,
Anastasia Analyti
Abstract:
We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient i…
▽ More
We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient in data complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. We evaluate the performance of the various architectures by simulation on a network with O(10^4) nodes, and derive final results. An extensive review of the relevant literature is finally included.
△ Less
Submitted 12 September, 2011;
originally announced September 2011.
-
Similarity-based Browsing over Linked Open Data
Authors:
Michael Hickson,
Yannis Kargakis,
Yannis Tzitzikas
Abstract:
An increasing amount of data is published on the Web according to the Linked Open Data (LOD) principles. End users would like to browse these data in a flexible manner. In this paper we focus on similarity-based browsing and we introduce a novel method for computing the similarity between two entities of a given RDF/S graph. The distinctive characteristics of the proposed metric is that it is gene…
▽ More
An increasing amount of data is published on the Web according to the Linked Open Data (LOD) principles. End users would like to browse these data in a flexible manner. In this paper we focus on similarity-based browsing and we introduce a novel method for computing the similarity between two entities of a given RDF/S graph. The distinctive characteristics of the proposed metric is that it is generic (it can be used to compare nodes of any kind), it takes into account the neighborhoods of the nodes, and it is configurable (with respect to the accuracy vs computational complexity tradeoff). We demonstrate the behavior of the metric using examples from an application over LOD. Finally, we generalize and elaborate on implementation approaches harmonized with the distributed nature of LOD which can be used for computing the most similar entities using neighborhood-based similarity metrics.
△ Less
Submitted 21 June, 2011;
originally announced June 2011.
-
Object-Relational Database Representations for Text Indexing
Authors:
Panagiotis Papadakos,
Yannis Theoharis,
Yannis Marketakis,
Nikos Armenatzoglou,
Yannis Tzitzikas
Abstract:
One of the distinctive features of Information Retrieval systems comparing to Database Management systems, is that they offer better compression for posting lists, resulting in better I/O performance and thus faster query evaluation. In this paper, we introduce database representations of the index that reduce the size (and thus the disk I/Os) of the posting lists. This is not achieved by redesi…
▽ More
One of the distinctive features of Information Retrieval systems comparing to Database Management systems, is that they offer better compression for posting lists, resulting in better I/O performance and thus faster query evaluation. In this paper, we introduce database representations of the index that reduce the size (and thus the disk I/Os) of the posting lists. This is not achieved by redesigning the DBMS, but by exploiting the non 1NF features that existing Object-Relational DBM systems (ORDBMS) already offer. Specifically, four different database representations are described and detailed experimental results for one million pages are reported. Three of these representations are one order of magnitude more space efficient and faster (in query evaluation) than the plain relational representation.
△ Less
Submitted 17 June, 2009;
originally announced June 2009.
-
The Anatomy of Mitos Web Search Engine
Authors:
Panagiotis Papadakos,
Giorgos Vasiliadis,
Yannis Theoharis,
Nikos Armenatzoglou,
Stella Kopidaki,
Yannis Marketakis,
Manos Daskalakis,
Kostas Karamaroudis,
Giorgos Linardakis,
Giannis Makrydakis,
Vangelis Papathanasiou,
Lefteris Sardis,
Petros Tsialiamanis,
Georgia Troullinou,
Kostas Vandikas,
Dimitris Velegrakis,
Yannis Tzitzikas
Abstract:
Engineering a Web search engine offering effective and efficient information retrieval is a challenging task. This document presents our experiences from designing and develo** a Web search engine offering a wide spectrum of functionalities and we report some interesting experimental results. A rather peculiar design choice of the engine is that its index is based on a DBMS, while some of the…
▽ More
Engineering a Web search engine offering effective and efficient information retrieval is a challenging task. This document presents our experiences from designing and develo** a Web search engine offering a wide spectrum of functionalities and we report some interesting experimental results. A rather peculiar design choice of the engine is that its index is based on a DBMS, while some of the distinctive functionalities that are offered include advanced Greek language stemming, real time result clustering, and advanced link analysis techniques (also for spam page detection).
△ Less
Submitted 16 March, 2008; v1 submitted 14 March, 2008;
originally announced March 2008.
-
Query Evaluation in P2P Systems of Taxonomy-based Sources: Algorithms, Complexity, and Optimizations
Authors:
Carlo Meghini,
Yannis Tzitzikas,
Anastasia Analyti
Abstract:
In this study, we address the problem of answering queries over a peer-to-peer system of taxonomy-based sources. A taxonomy states subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. To the end of laying the foundations of our study, we first consider the centralized case, deriving the complexity of the decision problem and of query eval…
▽ More
In this study, we address the problem of answering queries over a peer-to-peer system of taxonomy-based sources. A taxonomy states subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. To the end of laying the foundations of our study, we first consider the centralized case, deriving the complexity of the decision problem and of query evaluation. We conclude by presenting an algorithm that is efficient in data complexity and is based on hypergraphs. More expressive forms of taxonomies are also investigated, which however lead to intractability. We then move to the distributed case, and introduce a logical model of a network of taxonomy-based sources. On such network, a distributed version of the centralized algorithm is then presented, based on a message passing paradigm, and its correctness is proved. We finally discuss optimization issues, and relate our work to the literature.
△ Less
Submitted 19 September, 2007;
originally announced September 2007.