-
RAGE Against the Machine: Retrieval-Augmented LLM Explanations
Authors:
Joel Rorseth,
Parke Godfrey,
Lukasz Golab,
Divesh Srivastava,
Jaroslaw Szlichta
Abstract:
This paper demonstrates RAGE, an interactive tool for explaining Large Language Models (LLMs) augmented with retrieval capabilities; i.e., able to query external sources and pull relevant information into their input context. Our explanations are counterfactual in the sense that they identify parts of the input context that, when removed, change the answer to the question posed to the LLM. RAGE in…
▽ More
This paper demonstrates RAGE, an interactive tool for explaining Large Language Models (LLMs) augmented with retrieval capabilities; i.e., able to query external sources and pull relevant information into their input context. Our explanations are counterfactual in the sense that they identify parts of the input context that, when removed, change the answer to the question posed to the LLM. RAGE includes pruning methods to navigate the vast space of possible explanations, allowing users to view the provenance of the produced answers.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Explaining Expert Search and Team Formation Systems with ExES
Authors:
Kiarash Golzadeh,
Lukasz Golab,
Jaroslaw Szlichta
Abstract:
Expert search and team formation systems operate on collaboration networks, with nodes representing individuals, labeled with their skills, and edges denoting collaboration relationships. Given a keyword query corresponding to the desired skills, these systems identify experts that best match the query. However, state-of-the-art solutions to this problem lack transparency. To address this issue, w…
▽ More
Expert search and team formation systems operate on collaboration networks, with nodes representing individuals, labeled with their skills, and edges denoting collaboration relationships. Given a keyword query corresponding to the desired skills, these systems identify experts that best match the query. However, state-of-the-art solutions to this problem lack transparency. To address this issue, we propose ExES, a tool designed to explain expert search and team formation systems using factual and counterfactual methods from the field of explainable artificial intelligence (XAI). ExES uses factual explanations to highlight important skills and collaborations, and counterfactual explanations to suggest new skills and collaborations to increase the likelihood of being identified as an expert. Towards a practical deployment as an interactive explanation tool, we present and experimentally evaluate a suite of pruning strategies to speed up the explanation search. In many cases, our pruning strategies make ExES an order of magnitude faster than exhaustive search, while still producing concise and actionable explanations.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media
Authors:
Liam Hebert,
Gaurav Sahu,
Yuxuan Guo,
Nanda Kishore Sreenivas,
Lukasz Golab,
Robin Cohen
Abstract:
We present the Multi-Modal Discussion Transformer (mDT), a novel methodfor detecting hate speech in online social networks such as Reddit discussions. In contrast to traditional comment-only methods, our approach to labelling a comment as hate speech involves a holistic analysis of text and images grounded in the discussion context. This is done by leveraging graph transformers to capture the cont…
▽ More
We present the Multi-Modal Discussion Transformer (mDT), a novel methodfor detecting hate speech in online social networks such as Reddit discussions. In contrast to traditional comment-only methods, our approach to labelling a comment as hate speech involves a holistic analysis of text and images grounded in the discussion context. This is done by leveraging graph transformers to capture the contextual relationships in the discussion surrounding a comment and grounding the interwoven fusion layers that combine text and image embeddings instead of processing modalities separately. To evaluate our work, we present a new dataset, HatefulDiscussions, comprising complete multi-modal discussions from multiple online communities on Reddit. We compare the performance of our model to baselines that only process individual comments and conduct extensive ablation studies.
△ Less
Submitted 22 February, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
CREDENCE: Counterfactual Explanations for Document Ranking
Authors:
Joel Rorseth,
Parke Godfrey,
Lukasz Golab,
Mehdi Kargar,
Divesh Srivastava,
Jaroslaw Szlichta
Abstract:
Towards better explainability in the field of information retrieval, we present CREDENCE, an interactive tool capable of generating counterfactual explanations for document rankers. Embracing the unique properties of the ranking problem, we present counterfactual explanations in terms of document perturbations, query perturbations, and even other documents. Additionally, users may build and test t…
▽ More
Towards better explainability in the field of information retrieval, we present CREDENCE, an interactive tool capable of generating counterfactual explanations for document rankers. Embracing the unique properties of the ranking problem, we present counterfactual explanations in terms of document perturbations, query perturbations, and even other documents. Additionally, users may build and test their own perturbations, and extract insights about their query, documents, and ranker.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Qualitative Analysis of a Graph Transformer Approach to Addressing Hate Speech: Adapting to Dynamically Changing Content
Authors:
Liam Hebert,
Hong Yi Chen,
Robin Cohen,
Lukasz Golab
Abstract:
Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. I…
▽ More
Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. In this paper, we offer a detailed qualitative analysis of this solution for hate speech detection in social networks, leading to insights into where the method has the most impressive outcomes in comparison with competitors and identifying scenarios where there are challenges to achieving ideal performance. Included is an exploration of the kinds of posts that permeate social media today, including the use of hateful images. This suggests avenues for extending our model to be more comprehensive. A key insight is that the focus on reasoning about the concept of context positions us well to be able to support multi-modal analysis of online posts. We conclude with a reflection on how the problem we are addressing relates especially well to the theme of dynamic change, a critical concern for all AI solutions for social impact. We also comment briefly on how mental health well-being can be advanced with our work, through curated content attuned to the extent of hate in posts.
△ Less
Submitted 30 April, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context
Authors:
Liam Hebert,
Lukasz Golab,
Robin Cohen
Abstract:
We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community…
▽ More
We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community-specific analysis of hate speech leads to more effective detection of hateful discussions. We evaluate our approach on 333,487 Reddit discussions from various communities. We find that community-specific modeling improves performance two-fold and that models which capture wider-discussion context improve accuracy by 28\% (35\% for the most hateful content) compared to limited context models.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
FedFormer: Contextual Federation with Attention in Reinforcement Learning
Authors:
Liam Hebert,
Lukasz Golab,
Pascal Poupart,
Robin Cohen
Abstract:
A core issue in multi-agent federated reinforcement learning is defining how to aggregate insights from multiple agents. This is commonly done by taking the average of each participating agent's model weights into one common model (FedAvg). We instead propose FedFormer, a novel federation strategy that utilizes Transformer Attention to contextually aggregate embeddings from models originating from…
▽ More
A core issue in multi-agent federated reinforcement learning is defining how to aggregate insights from multiple agents. This is commonly done by taking the average of each participating agent's model weights into one common model (FedAvg). We instead propose FedFormer, a novel federation strategy that utilizes Transformer Attention to contextually aggregate embeddings from models originating from different learner agents. In so doing, we attentively weigh the contributions of other agents with respect to the current agent's environment and learned relationships, thus providing a more effective and efficient federation. We evaluate our methods on the Meta-World environment and find that our approach yields significant improvements over FedAvg and non-federated Soft Actor-Critic single-agent methods. Our results compared to Soft Actor-Critic show that FedFormer achieves higher episodic return while still abiding by the privacy constraints of federated learning. Finally, we also demonstrate improvements in effectiveness with increased agent pools across all methods in certain tasks. This is contrasted by FedAvg, which fails to make noticeable improvements when scaled.
△ Less
Submitted 2 March, 2023; v1 submitted 26 May, 2022;
originally announced May 2022.
-
GRS: Combining Generation and Revision in Unsupervised Sentence Simplification
Authors:
Mohammad Dehghan,
Dhruv Kumar,
Lukasz Golab
Abstract:
We propose GRS: an unsupervised approach to sentence simplification that combines text generation and text revision. We start with an iterative framework in which an input sentence is revised using explicit edit operations, and add paraphrasing as a new edit operation. This allows us to combine the advantages of generative and revision-based approaches: paraphrasing captures complex edit operation…
▽ More
We propose GRS: an unsupervised approach to sentence simplification that combines text generation and text revision. We start with an iterative framework in which an input sentence is revised using explicit edit operations, and add paraphrasing as a new edit operation. This allows us to combine the advantages of generative and revision-based approaches: paraphrasing captures complex edit operations, and the use of explicit edit operations in an iterative manner provides controllability and interpretability. We demonstrate these advantages of GRS compared to existing methods on the Newsela and ASSET datasets.
△ Less
Submitted 22 March, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Climate Action During COVID-19 Recovery and Beyond: A Twitter Text Mining Study
Authors:
Mohammad S. Parsa,
Lukasz Golab,
Srinivasan Keshav
Abstract:
The Coronavirus pandemic created a global crisis that prompted immediate large-scale action, including economic shutdowns and mobility restrictions. These actions have had devastating effects on the economy, but some positive effects on the environment. As the world recovers from the pandemic, we ask the following question: What is the public attitude towards climate action during COVID-19 recover…
▽ More
The Coronavirus pandemic created a global crisis that prompted immediate large-scale action, including economic shutdowns and mobility restrictions. These actions have had devastating effects on the economy, but some positive effects on the environment. As the world recovers from the pandemic, we ask the following question: What is the public attitude towards climate action during COVID-19 recovery and beyond? We answer this question by analyzing discussions on the Twitter social media platform. We find that most discussions support climate action and point out lessons learned during pandemic response that can shape future climate policy, although skeptics continue to have a presence. Additionally, concerns arise in the context of climate action during the pandemic, such as mitigating the risk of COVID-19 transmission on public transit.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Discovery and Contextual Data Cleaning with Ontology Functional Dependencies
Authors:
Zheng Zheng,
Longtao Zheng,
Morteza Alipour Langouri,
Fei Chiang,
Lukasz Golab,
Jaroslaw Szlichta
Abstract:
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an…
▽ More
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.
△ Less
Submitted 12 March, 2022; v1 submitted 17 May, 2021;
originally announced May 2021.
-
Real-Time LSM-Trees for HTAP Workloads
Authors:
Hemant Saxena,
Lukasz Golab,
Stratos Idreos,
Ihab F. Ilyas
Abstract:
Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycl…
▽ More
Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store.
△ Less
Submitted 14 July, 2022; v1 submitted 17 January, 2021;
originally announced January 2021.
-
Efficient Discovery of Approximate Order Dependencies
Authors:
Reza Karegar,
Parke Godfrey,
Lukasz Golab,
Mehdi Kargar,
Divesh Srivastava,
Jaroslaw Szlichta
Abstract:
Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing method…
▽ More
Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods for AODs, and prove that it is correct and has optimal runtime. By replacing the validation step in a leading algorithm for approximate OD discovery with ours, we achieve orders-of-magnitude improvements in performance.
△ Less
Submitted 6 January, 2021;
originally announced January 2021.
-
Iterative Edit-Based Unsupervised Sentence Simplification
Authors:
Dhruv Kumar,
Lili Mou,
Lukasz Golab,
Olga Vechtomova
Abstract:
We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by a scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word and phrase-level edits on the complex sentence. Compared with previous approaches, our model does not require a parallel training set, but is more controllable and interpretabl…
▽ More
We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by a scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word and phrase-level edits on the complex sentence. Compared with previous approaches, our model does not require a parallel training set, but is more controllable and interpretable. Experiments on Newsela and WikiLarge datasets show that our approach is nearly as effective as state-of-the-art supervised approaches.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Discovering Domain Orders through Order Dependencies
Authors:
Reza Karegar,
Melicaalsadat Mirsafian,
Parke Godfrey,
Lukasz Golab,
Mehdi Kargar,
Divesh Srivastava,
Jaroslaw Szlichta
Abstract:
Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit doma…
▽ More
Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit domain orders in data through order dependencies. We enumerate tractable special cases and proceed towards the most general case, which we prove is NP-complete. We show that the general case nevertheless can be effectively handled by a SAT solver. We also devise an interestingness measure to rank the discovered implicit domain orders, which we validate with a user study. Based on an extensive suite of experiments with real-world data, we establish the efficacy of our algorithms, and the utility of the domain orders discovered by demonstrating significant added value in three applications (data profiling, query optimization, and data mining).
△ Less
Submitted 7 September, 2021; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Consentio: Managing Consent to Data Access using Permissioned Blockchains
Authors:
Rishav Raj Agarwal,
Dhruv Kumar,
Lukasz Golab,
Srinivasan Keshav
Abstract:
The increasing amount of personal data is raising serious issues in the context of privacy, security, and data ownership. Entities whose data are being collected can benefit from mechanisms to manage the parties that can access their data and to audit who has accessed their data. Consent management systems address these issues. We present Consentio, a scalable consent management system based on th…
▽ More
The increasing amount of personal data is raising serious issues in the context of privacy, security, and data ownership. Entities whose data are being collected can benefit from mechanisms to manage the parties that can access their data and to audit who has accessed their data. Consent management systems address these issues. We present Consentio, a scalable consent management system based on the Hyperledger Fabric permissioned blockchain. The data management challenge we address is to ensure high throughput and low latency of endorsing data access requests and granting or revoking consent. Experimental results show that our system can handle as many as 6,000 access requests per second, allowing it to scale to very large deployments.
△ Less
Submitted 9 March, 2020; v1 submitted 15 October, 2019;
originally announced October 2019.
-
XOX Fabric: A hybrid approach to blockchain transaction execution
Authors:
Christian Gorenflo,
Lukasz Golab,
Srinivasan Keshav
Abstract:
Performance and scalability are major concerns for blockchains: permissionless systems are typically limited by slow proof of X consensus algorithms and sequential post-order transaction execution on every node of the network. By introducing a small amount of trust in their participants, permissioned blockchain systems such as Hyperledger Fabric can benefit from more efficient consensus algorithms…
▽ More
Performance and scalability are major concerns for blockchains: permissionless systems are typically limited by slow proof of X consensus algorithms and sequential post-order transaction execution on every node of the network. By introducing a small amount of trust in their participants, permissioned blockchain systems such as Hyperledger Fabric can benefit from more efficient consensus algorithms and make use of parallel pre-order execution on a subset of network nodes. Fabric, in particular, has been shown to handle tens of thousands of transactions per second. However, this performance is only achievable for contention-free transaction workloads. If many transactions compete for a small set of hot keys in the world state, the effective throughput drops drastically. We therefore propose XOX: a novel two-pronged transaction execution approach that both minimizes invalid transactions in the Fabric blockchain and maximizes concurrent execution. Our approach additionally prevents unintentional denial of service attacks by clients re-submitting conflicting transactions. Even under fully contentious workloads, XOX can handle more than 3000 transactions per second, all of which would be discarded by regular Fabric.
△ Less
Submitted 9 March, 2020; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Errata Note: Discovering Order Dependencies through Order Compatibility
Authors:
Parke Godfrey,
Lukasz Golab,
Mehdi Kargar,
Divesh Srivastava,
Jaroslaw Szlichta
Abstract:
A number of extensions to the classical notion of functional dependencies have been proposed to express and enforce application semantics. One of these extensions is that of order dependencies (ODs), which express rules involving order. The article entitled "Discovering Order Dependencies through Order Compatibility" by Consonni et al., published in the EDBT conference proceedings in March 2019, i…
▽ More
A number of extensions to the classical notion of functional dependencies have been proposed to express and enforce application semantics. One of these extensions is that of order dependencies (ODs), which express rules involving order. The article entitled "Discovering Order Dependencies through Order Compatibility" by Consonni et al., published in the EDBT conference proceedings in March 2019, investigates the OD discovery problem. They claim to prove that their OD discovery algorithm, OCDDISCOVER, is complete, as well as being significantly more efficient in practice than the state-of-the-art. They further claim that the implementation of the existing FASTOD algorithm (ours)-we shared our code base with the authors-which they benchmark against is flawed, as OCDDISCOVER and FASTOD report different sets of ODs over the same data sets.
In this rebuttal, we show that their claim of completeness is, in fact, not true. Built upon their incorrect claim, OCDDISCOVER's pruning rules are overly aggressive, and prune parts of the search space that contain legitimate ODs. This is the reason their approach appears to be "faster" in practice. Finally, we show that Consonni et al. misinterpret our set-based canonical form for ODs, leading to an incorrect claim that our FASTOD implementation has an error.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Distributed Dependency Discovery
Authors:
Hemant Saxena,
Lukasz Golab,
Ihab F. Ilyas
Abstract:
We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce si…
▽ More
We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
FastFabric: Scaling Hyperledger Fabric to 20,000 Transactions per Second
Authors:
Christian Gorenflo,
Stephen Lee,
Lukasz Golab,
S. Keshav
Abstract:
Blockchain technologies are expected to make a significant impact on a variety of industries. However, one issue holding them back is their limited transaction throughput, especially compared to established solutions such as distributed database systems. In this paper, we re-architect a modern permissioned blockchain system, Hyperledger Fabric, to increase transaction throughput from 3,000 to 20,0…
▽ More
Blockchain technologies are expected to make a significant impact on a variety of industries. However, one issue holding them back is their limited transaction throughput, especially compared to established solutions such as distributed database systems. In this paper, we re-architect a modern permissioned blockchain system, Hyperledger Fabric, to increase transaction throughput from 3,000 to 20,000 transactions per second. We focus on performance bottlenecks beyond the consensus mechanism, and we propose architectural changes that reduce computation and I/O overhead during transaction ordering and validation to greatly improve throughput. Notably, our optimizations are fully plug-and-play and do not require any interface changes to Hyperledger Fabric.
△ Less
Submitted 4 March, 2019; v1 submitted 3 January, 2019;
originally announced January 2019.
-
Authority-based Team Discovery in Social Networks
Authors:
Morteza Zihayat,
Aijun An,
Lukasz Golab,
Mehdi Kargar,
Jaroslaw Szlichta
Abstract:
Given a social network of experts, we address the problem of discovering a team of experts that collectively holds a set of skills required to complete a given project. Most prior work ranks possible solutions by communication cost, represented by edge weights in the expert network. Our contribution is to take experts authority into account, represented by node weights. We formulate several proble…
▽ More
Given a social network of experts, we address the problem of discovering a team of experts that collectively holds a set of skills required to complete a given project. Most prior work ranks possible solutions by communication cost, represented by edge weights in the expert network. Our contribution is to take experts authority into account, represented by node weights. We formulate several problems that combine communication cost and authority, we prove that they are NP-hard, and we propose and experimentally evaluate greedy algorithms to solve them.
△ Less
Submitted 15 November, 2016; v1 submitted 8 November, 2016;
originally announced November 2016.
-
Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization
Authors:
Jaroslaw Szlichta,
Parke Godfrey,
Lukasz Golab,
Mehdi Kargar,
Divesh Srivastava
Abstract:
Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently…
▽ More
Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee.
We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in "redundant" discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial map** to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show orders-of-magnitude performance improvements over the current state-of-the-art algorithm and demonstrate effectiveness of our techniques.
△ Less
Submitted 23 August, 2016; v1 submitted 22 August, 2016;
originally announced August 2016.
-
Effective Keyword Search in Graphs
Authors:
Mehdi Kargar,
Lukasz Golab,
Jaroslaw Szlichta
Abstract:
In a node-labeled graph, keyword search finds subtrees of the graph whose nodes contain all of the query keywords. This provides a way to query graph databases that neither requires mastery of a query language such as SPARQL, nor a deep knowledge of the database schema. Previous work ranks answer trees using combinations of structural and content-based metrics, such as path lengths between keyword…
▽ More
In a node-labeled graph, keyword search finds subtrees of the graph whose nodes contain all of the query keywords. This provides a way to query graph databases that neither requires mastery of a query language such as SPARQL, nor a deep knowledge of the database schema. Previous work ranks answer trees using combinations of structural and content-based metrics, such as path lengths between keywords or relevance of the labels in the answer tree to the query keywords. We propose two new ways to rank keyword search results over graphs. The first takes node importance into account while the second is a bi-objective optimization of edge weights and node importance. Since both of these problems are NP-hard, we propose greedy algorithms to solve them, and experimentally verify their effectiveness and efficiency on a real dataset.
△ Less
Submitted 29 March, 2016; v1 submitted 20 December, 2015;
originally announced December 2015.
-
Distributed Data Placement via Graph Partitioning
Authors:
Lukasz Golab,
Marios Hadjieleftheriou,
Howard Karloff,
Barna Saha
Abstract:
With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput…
▽ More
With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting.
Our main contribution is a reduction of the data placement problem to the well-studied problem of {\sc Graph Partitioning}, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost).
We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the {\sc Graph Partitioning} solution of the no-replication case. Using the TPC-DS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.
△ Less
Submitted 1 December, 2013;
originally announced December 2013.
-
On the Relative Trust between Inconsistent Data and Inaccurate Constraints
Authors:
George Beskales,
Ihab F. Ilyas,
Lukasz Golab,
Artur Galiullin
Abstract:
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are out…
▽ More
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency.
△ Less
Submitted 24 July, 2012; v1 submitted 22 July, 2012;
originally announced July 2012.