-
TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
Authors:
Aamod Khatiwada,
Harsha Kokel,
Ibrahim Abdelaziz,
Subhajit Chaudhury,
Julian Dolby,
Oktie Hassanzadeh,
Zhenhan Huang,
Tejaswini Pedapati,
Horst Samulowitz,
Kavitha Srinivas
Abstract:
Enterprises have a growing need to identify relevant tables in data lakes; e.g. tables that are unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such data discovery tasks. In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose a novel pre-training sketch-based approach to enhance the effectiveness o…
▽ More
Enterprises have a growing need to identify relevant tables in data lakes; e.g. tables that are unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such data discovery tasks. In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose a novel pre-training sketch-based approach to enhance the effectiveness of data discovery techniques in neural tabular models. Second, to further finetune the pretrained model for several downstream tasks, we develop LakeBench, a collection of 8 benchmarks to help with different data discovery tasks such as finding tasks that are unionable, joinable, or subsets of each other. We then show on these finetuning tasks that TabSketchFM achieves state-of-the art performance compared to existing neural models. Third, we use these finetuned models to search for tables that are unionable, joinable, or can be subsets of each other. Our results demonstrate improvements in F1 scores for search compared to state-of-the-art techniques (even up to 70% improvement in a joinable search benchmark). Finally, we show significant transfer across datasets and tasks establishing that our model can generalize across different tasks over different data lakes
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Distilling Event Sequence Knowledge From Large Language Models
Authors:
Somin Wadhwa,
Oktie Hassanzadeh,
Debarun Bhattacharjya,
Ken Barker,
Jian Ni
Abstract:
Event sequence models have been found to be highly effective in the analysis and prediction of events. Building such models requires availability of abundant high-quality event sequence data. In certain applications, however, clean structured event sequences are not available, and automated sequence extraction results in data that is too noisy and incomplete. In this work, we explore the use of La…
▽ More
Event sequence models have been found to be highly effective in the analysis and prediction of events. Building such models requires availability of abundant high-quality event sequence data. In certain applications, however, clean structured event sequences are not available, and automated sequence extraction results in data that is too noisy and incomplete. In this work, we explore the use of Large Language Models (LLMs) to generate event sequences that can effectively be used for probabilistic event model construction. This can be viewed as a mechanism of distilling event sequence knowledge from LLMs. Our approach relies on a Knowledge Graph (KG) of event concepts with partial causal relations to guide the generative language model for causal event sequence generation. We show that our approach can generate high-quality event sequences, filling a knowledge gap in the input KG. Furthermore, we explore how the generated sequences can be leveraged to discover useful and more complex structured knowledge from pattern mining and probabilistic event models. We release our sequence generation code and evaluation framework, as well as corpus of event sequence data.
△ Less
Submitted 1 July, 2024; v1 submitted 14 January, 2024;
originally announced January 2024.
-
An Evaluation Framework for Map** News Headlines to Event Classes in a Knowledge Graph
Authors:
Steve Fonin Mbouadeu,
Martin Lorenzo,
Ken Barker,
Oktie Hassanzadeh
Abstract:
Map** ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the map**. We use the dataset to study t…
▽ More
Map** ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the map**. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Event Prediction using Case-Based Reasoning over Knowledge Graphs
Authors:
Sola Shirai,
Debarun Bhattacharjya,
Oktie Hassanzadeh
Abstract:
Applying link prediction (LP) methods over knowledge graphs (KG) for tasks such as causal event prediction presents an exciting opportunity. However, typical LP models are ill-suited for this task as they are incapable of performing inductive link prediction for new, unseen event entities and they require retraining as knowledge is added or changed in the underlying KG. We introduce a case-based r…
▽ More
Applying link prediction (LP) methods over knowledge graphs (KG) for tasks such as causal event prediction presents an exciting opportunity. However, typical LP models are ill-suited for this task as they are incapable of performing inductive link prediction for new, unseen event entities and they require retraining as knowledge is added or changed in the underlying KG. We introduce a case-based reasoning model, EvCBR, to predict properties about new consequent events based on similar cause-effect events present in the KG. EvCBR uses statistical measures to identify similar events and performs path-based predictions, requiring no training step. To generalize our methods beyond the domain of event prediction, we frame our task as a 2-hop LP task, where the first hop is a causal relation connecting a cause event to a new effect event and the second hop is a property about the new event which we wish to predict. The effectiveness of our method is demonstrated using a novel dataset of newsworthy events with causal relations curated from Wikidata, where EvCBR outperforms baselines including translational-distance-based, GNN-based, and rule-based LP models.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Matching Table Metadata with Business Glossaries Using Large Language Models
Authors:
Elita Lobo,
Oktie Hassanzadeh,
Nhan Pham,
Nandana Mihindukulasooriya,
Dharmashankar Subramanian,
Horst Samulowitz
Abstract:
Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents and, therefore, limit the application of classic retrieval and analysis solutions. As a result, there is a need for solutions that can effectively utilize the av…
▽ More
Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents and, therefore, limit the application of classic retrieval and analysis solutions. As a result, there is a need for solutions that can effectively utilize the available metadata. In this paper, we study the problem of matching table metadata to a business glossary containing data labels and descriptions. The resulting matching enables the use of an available or curated business glossary for retrieval and analysis without or before requesting access to the data contents. One solution to this problem is to use manually-defined rules or similarity measures on column names and glossary descriptions (or their vector embeddings) to find the closest match. However, such approaches need to be tuned through manual labeling and cannot handle many business glossaries that contain a combination of simple as well as complex and long descriptions. In this work, we leverage the power of large language models (LLMs) to design generic matching methods that do not require manual tuning and can identify complex relations between column names and glossaries. We propose methods that utilize LLMs in two ways: a) by generating additional context for column names that can aid with matching b) by using LLMs to directly infer if there is a relation between column names and glossary descriptions. Our preliminary experimental results show the effectiveness of our proposed methods.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Improving Neural Ranking Models with Traditional IR Methods
Authors:
Anik Saha,
Oktie Hassanzadeh,
Alex Gittens,
Jian Ni,
Kavitha Srinivas,
Bulent Yener
Abstract:
Neural ranking methods based on large transformer models have recently gained significant attention in the information retrieval community, and have been adopted by major commercial solutions. Nevertheless, they are computationally expensive to create, and require a great deal of labeled data for specialized corpora. In this paper, we explore a low resource alternative which is a bag-of-embedding…
▽ More
Neural ranking methods based on large transformer models have recently gained significant attention in the information retrieval community, and have been adopted by major commercial solutions. Nevertheless, they are computationally expensive to create, and require a great deal of labeled data for specialized corpora. In this paper, we explore a low resource alternative which is a bag-of-embedding model for document retrieval and find that it is competitive with large transformer models fine tuned on information retrieval tasks. Our results show that a simple combination of TF-IDF, a traditional keyword matching method, with a shallow embedding model provides a low cost path to compete well with the performance of complex neural ranking models on 3 datasets. Furthermore, adding TF-IDF measures improves the performance of large-scale fine tuned models on these tasks.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
Open Government Data Corpus for Table Search
Authors:
Michael Glass,
Sugato Bagchi,
Oktie Hassanzadeh,
Gaetano Rossiello,
Alfio Gliozzo
Abstract:
Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper…
▽ More
Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper are considerably different from data tables in both scale and style. By using metadata associated with open data from government portals, we create the first dataset to benchmark search over data tables at scale. We demonstrate three styles of table-to-table related table search. The three notions of table relatedness are: tables produced by the same organization, tables distributed as part of the same dataset, and tables with a high degree of overlap in the annotated tags. The keyword tags provided with the metadata also permit the automatic creation of a keyword search over tables benchmark. We provide baselines on this dataset using existing methods including traditional and neural approaches.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
A Cross-Domain Evaluation of Approaches for Causal Knowledge Extraction
Authors:
Anik Saha,
Oktie Hassanzadeh,
Alex Gittens,
Jian Ni,
Kavitha Srinivas,
Bulent Yener
Abstract:
Causal knowledge extraction is the task of extracting relevant causes and effects from text by detecting the causal relation. Although this task is important for language understanding and knowledge discovery, recent works in this domain have largely focused on binary classification of a text segment as causal or non-causal. In this regard, we perform a thorough analysis of three sequence tagging…
▽ More
Causal knowledge extraction is the task of extracting relevant causes and effects from text by detecting the causal relation. Although this task is important for language understanding and knowledge discovery, recent works in this domain have largely focused on binary classification of a text segment as causal or non-causal. In this regard, we perform a thorough analysis of three sequence tagging models for causal knowledge extraction and compare it with a span based approach to causality extraction. Our experiments show that embeddings from pre-trained language models (e.g. BERT) provide a significant performance boost on this task compared to previous state-of-the-art models with complex architectures. We observe that span based models perform better than simple sequence tagging models based on BERT across all 4 data sets from diverse domains with different types of cause-effect phrases.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
LakeBench: Benchmarks for Data Discovery over Data Lakes
Authors:
Kavitha Srinivas,
Julian Dolby,
Ibrahim Abdelaziz,
Oktie Hassanzadeh,
Harsha Kokel,
Aamod Khatiwada,
Tejaswini Pedapati,
Subhajit Chaudhury,
Horst Samulowitz
Abstract:
Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private data…
▽ More
Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private datasets. In LakeBench, we develop multiple benchmarks for these tasks by using the tables that are drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. We compare the performance of 4 publicly available tabular foundational models on these tasks. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark; not surprisingly, their performance shows significant room for improvement. The results suggest that the establishment of such benchmarks may be useful to the community to build tabular models usable for data discovery in data lakes.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Summary Markov Models for Event Sequences
Authors:
Debarun Bhattacharjya,
Saurabh Sihag,
Oktie Hassanzadeh,
Liza Bialik
Abstract:
Datasets involving sequences of different types of events without meaningful time stamps are prevalent in many applications, for instance when extracted from textual corpora. We propose a family of models for such event sequences -- summary Markov models -- where the probability of observing an event type depends only on a summary of historical occurrences of its influencing set of event types. Th…
▽ More
Datasets involving sequences of different types of events without meaningful time stamps are prevalent in many applications, for instance when extracted from textual corpora. We propose a family of models for such event sequences -- summary Markov models -- where the probability of observing an event type depends only on a summary of historical occurrences of its influencing set of event types. This Markov model family is motivated by Granger causal models for time series, with the important distinction that only one event can occur in a position in an event sequence. We show that a unique minimal influencing set exists for any set of event types of interest and choice of summary function, formulate two novel models from the general family that represent specific sequence dynamics, and propose a greedy search algorithm for learning them from event sequence data. We conduct an experimental investigation comparing the proposed models with relevant baselines, and illustrate their knowledge acquisition and discovery capabilities through case studies involving sequences from text.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Bipartite Graph Matching Algorithms for Clean-Clean Entity Resolution: An Empirical Evaluation
Authors:
George Papadakis,
Vasilis Efthymiou,
Emanouil Thanos,
Oktie Hassanzadeh
Abstract:
Entity Resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario is when entities across two clean sources need to be resolved, which we refer to as Clean-Clean ER. In this paper, we perform an extensive empirical evaluation of 8 bipartite graph matching algorithms that take in as input a bipartite similarity graph and provide as output a set of m…
▽ More
Entity Resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario is when entities across two clean sources need to be resolved, which we refer to as Clean-Clean ER. In this paper, we perform an extensive empirical evaluation of 8 bipartite graph matching algorithms that take in as input a bipartite similarity graph and provide as output a set of matched entities. We consider a wide range of matching algorithms, including algorithms that have not previously been applied to ER, or have been evaluated only in other ER settings. We assess the relative performance of the algorithms with respect to accuracy and time efficiency over 10 established, real datasets, from which we extract >700 different similarity graphs. Our results provide insights into the relative performance of these algorithms and guidelines for choosing the best one, depending on the data at hand.
△ Less
Submitted 25 February, 2022; v1 submitted 28 December, 2021;
originally announced December 2021.
-
LinkedCT: A Linked Data Space for Clinical Trials
Authors:
Oktie Hassanzadeh,
Anastasios Kementsietsidis,
Lipyeow Lim,
Renee J. Miller,
Min Wang
Abstract:
The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges…
▽ More
The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges involved in these two steps and present the methodology used in LinkedCT to overcome these challenges. Our approach for semantic link discovery involves using state-of-the-art approximate string matching techniques combined with ontology-based semantic matching of the records, all performed in a declarative and easy-to-use framework. We present an evaluation of the performance of our proposed techniques in several link discovery scenarios in LinkedCT.
△ Less
Submitted 4 August, 2009;
originally announced August 2009.
-
Benchmarking Declarative Approximate Selection Predicates
Authors:
Oktie Hassanzadeh
Abstract:
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Several similarity predicates have been proposed in t…
▽ More
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Several similarity predicates have been proposed in the past for common quality primitives (approximate selections, joins, etc.) and have been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed.
△ Less
Submitted 14 July, 2009;
originally announced July 2009.
-
Automated Protein Structure Classification: A Survey
Authors:
Oktie Hassanzadeh
Abstract:
Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classifica…
▽ More
Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classification methods to generate classification databases or increase the speed and accuracy of semi-automatic techniques. Recognizing this need, several automated classification methods have been developed. In this survey, we overview recent developments in this area. We classify different methods based on their characteristics and compare their methodology, accuracy and efficiency. We then present a few open problems and explain future directions.
△ Less
Submitted 13 July, 2009;
originally announced July 2009.