-
A Closer Look at Spatial-Slice Features Learning for COVID-19 Detection
Authors:
Chih-Chung Hsu,
Chia-Ming Lee,
Yang Fan Chiang,
Yi-Shiuan Chou,
Chih-Yu Jiang,
Shen-Chieh Tai,
Chi-Han Tsai
Abstract:
Conventional Computed Tomography (CT) imaging recognition faces two significant challenges: (1) There is often considerable variability in the resolution and size of each CT scan, necessitating strict requirements for the input size and adaptability of models. (2) CT-scan contains large number of out-of-distribution (OOD) slices. The crucial features may only be present in specific spatial regions…
▽ More
Conventional Computed Tomography (CT) imaging recognition faces two significant challenges: (1) There is often considerable variability in the resolution and size of each CT scan, necessitating strict requirements for the input size and adaptability of models. (2) CT-scan contains large number of out-of-distribution (OOD) slices. The crucial features may only be present in specific spatial regions and slices of the entire CT scan. How can we effectively figure out where these are located? To deal with this, we introduce an enhanced Spatial-Slice Feature Learning (SSFL++) framework specifically designed for CT scan. It aim to filter out a OOD data within whole CT scan, enabling our to select crucial spatial-slice for analysis by reducing 70% redundancy totally. Meanwhile, we proposed Kernel-Density-based slice Sampling (KDS) method to improve the stability when training and inference stage, therefore speeding up the rate of convergence and boosting performance. As a result, the experiments demonstrate the promising performance of our model using a simple EfficientNet-2D (E2D) model, even with only 1% of the training data. The efficacy of our approach has been validated on the COVID-19-CT-DB datasets provided by the DEF-AI-MIA workshop, in conjunction with CVPR 2024. Our source code is available at https://github.com/ming053l/E2D
△ Less
Submitted 20 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Simple 2D Convolutional Neural Network-based Approach for COVID-19 Detection
Authors:
Chih-Chung Hsu,
Chia-Ming Lee,
Yang Fan Chiang,
Yi-Shiuan Chou,
Chih-Yu Jiang,
Shen-Chieh Tai,
Chi-Han Tsai
Abstract:
This study explores the use of deep learning techniques for analyzing lung Computed Tomography (CT) images. Classic deep learning approaches face challenges with varying slice counts and resolutions in CT images, a diversity arising from the utilization of assorted scanning equipment. Typically, predictions are made on single slices which are then combined for a comprehensive outcome. Yet, this me…
▽ More
This study explores the use of deep learning techniques for analyzing lung Computed Tomography (CT) images. Classic deep learning approaches face challenges with varying slice counts and resolutions in CT images, a diversity arising from the utilization of assorted scanning equipment. Typically, predictions are made on single slices which are then combined for a comprehensive outcome. Yet, this method does not incorporate learning features specific to each slice, leading to a compromise in effectiveness. To address these challenges, we propose an advanced Spatial-Slice Feature Learning (SSFL++) framework specifically tailored for CT scans. It aims to filter out out-of-distribution (OOD) data within the entire CT scan, allowing us to select essential spatial-slice features for analysis by reducing data redundancy by 70\%. Additionally, we introduce a Kernel-Density-based slice Sampling (KDS) method to enhance stability during training and inference phases, thereby accelerating convergence and enhancing overall performance. Remarkably, our experiments reveal that our model achieves promising results with a simple EfficientNet-2D (E2D) model. The effectiveness of our approach is confirmed on the COVID-19-CT-DB datasets provided by the DEF-AI-MIA workshop.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine
Authors:
Qiao **,
Fangyuan Chen,
Yiliang Zhou,
Ziyang Xu,
Justin M. Cheung,
Robert Chen,
Ronald M. Summers,
Justin F. Rousseau,
Peiyun Ni,
Marc J Landsman,
Sally L. Baxter,
Subhi J. Al'Aref,
Yijia Li,
Alex Chen,
Josef A. Brejt,
Michael F. Chiang,
Yifan Peng,
Zhiyong Lu
Abstract:
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by…
▽ More
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
△ Less
Submitted 22 April, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Residual Scheduling: A New Reinforcement Learning Approach to Solving Job Shop Scheduling Problem
Authors:
Kuo-Hao Ho,
Ruei-Yu Jheng,
Ji-Han Wu,
Fan Chiang,
Yen-Chi Chen,
Yuan-Yu Wu,
I-Chen Wu
Abstract:
Job-shop scheduling problem (JSP) is a mathematical optimization problem widely used in industries like manufacturing, and flexible JSP (FJSP) is also a common variant. Since they are NP-hard, it is intractable to find the optimal solution for all cases within reasonable times. Thus, it becomes important to develop efficient heuristics to solve JSP/FJSP. A kind of method of solving scheduling prob…
▽ More
Job-shop scheduling problem (JSP) is a mathematical optimization problem widely used in industries like manufacturing, and flexible JSP (FJSP) is also a common variant. Since they are NP-hard, it is intractable to find the optimal solution for all cases within reasonable times. Thus, it becomes important to develop efficient heuristics to solve JSP/FJSP. A kind of method of solving scheduling problems is construction heuristics, which constructs scheduling solutions via heuristics. Recently, many methods for construction heuristics leverage deep reinforcement learning (DRL) with graph neural networks (GNN). In this paper, we propose a new approach, named residual scheduling, to solving JSP/FJSP. In this new approach, we remove irrelevant machines and jobs such as those finished, such that the states include the remaining (or relevant) machines and jobs only. Our experiments show that our approach reaches state-of-the-art (SOTA) among all known construction heuristics on most well-known open JSP and FJSP benchmarks. In addition, we also observe that even though our model is trained for scheduling problems of smaller sizes, our method still performs well for scheduling problems of large sizes. Interestingly in our experiments, our approach even reaches zero gap for 49 among 50 JSP instances whose job numbers are more than 150 on 20 machines.
△ Less
Submitted 2 October, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
A generalized framework to predict continuous scores from medical ordinal labels
Authors:
Katharina V. Hoebel,
Andreanne Lemay,
John Peter Campbell,
Susan Ostmo,
Michael F. Chiang,
Christopher P. Bridge,
Matthew D. Li,
Praveer Singh,
Aaron S. Coyner,
Jayashree Kalpathy-Cramer
Abstract:
Many variables of interest in clinical medicine, like disease severity, are recorded using discrete ordinal categories such as normal/mild/moderate/severe. These labels are used to train and evaluate disease severity prediction models. However, ordinal categories represent a simplification of an underlying continuous severity spectrum. Using continuous scores instead of ordinal categories is more…
▽ More
Many variables of interest in clinical medicine, like disease severity, are recorded using discrete ordinal categories such as normal/mild/moderate/severe. These labels are used to train and evaluate disease severity prediction models. However, ordinal categories represent a simplification of an underlying continuous severity spectrum. Using continuous scores instead of ordinal categories is more sensitive to detecting small changes in disease severity over time. Here, we present a generalized framework that accurately predicts continuously valued variables using only discrete ordinal labels during model development. We found that for three clinical prediction tasks, models that take the ordinal relationship of the training labels into account outperformed conventional multi-class classification models. Particularly the continuous scores generated by ordinal classification and regression models showed a significantly higher correlation with expert rankings of disease severity and lower mean squared errors compared to the multi-class classification models. Furthermore, the use of MC dropout significantly improved the ability of all evaluated deep learning approaches to predict continuously valued scores that truthfully reflect the underlying continuous target variable. We showed that accurate continuously valued predictions can be generated even if the model development only involves discrete ordinal labels. The novel framework has been validated on three different clinical prediction tasks and has proven to bridge the gap between discrete ordinal labels and the underlying continuously valued variables.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Discovery of Keys for Graphs [Extended Version]
Authors:
Morteza Alipourlangouri,
Fei Chiang
Abstract:
Keys for graphs uses the topology and value constraints needed to uniquely identify entities in a graph database. They have been studied to support object identification, knowledge fusion, data deduplication, and social network reconciliation. In this paper, we present our algorithm to mine keys over graphs. Our algorithm discovers keys in a graph via frequent subgraph expansion. We present two pr…
▽ More
Keys for graphs uses the topology and value constraints needed to uniquely identify entities in a graph database. They have been studied to support object identification, knowledge fusion, data deduplication, and social network reconciliation. In this paper, we present our algorithm to mine keys over graphs. Our algorithm discovers keys in a graph via frequent subgraph expansion. We present two properties that define a meaningful key, including minimality and support. Lastly, using real-world graphs, we experimentally verify the efficiency of our algorithm on real world graphs.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Not Color Blind: AI Predicts Racial Identity from Black and White Retinal Vessel Segmentations
Authors:
Aaron S. Coyner,
Praveer Singh,
James M. Brown,
Susan Ostmo,
R. V. Paul Chan,
Michael F. Chiang,
Jayashree Kalpathy-Cramer,
J. Peter Campbell
Abstract:
Background: Artificial intelligence (AI) may demonstrate racial bias when skin or choroidal pigmentation is present in medical images. Recent studies have shown that convolutional neural networks (CNNs) can predict race from images that were not previously thought to contain race-specific features. We evaluate whether grayscale retinal vessel maps (RVMs) of patients screened for retinopathy of pre…
▽ More
Background: Artificial intelligence (AI) may demonstrate racial bias when skin or choroidal pigmentation is present in medical images. Recent studies have shown that convolutional neural networks (CNNs) can predict race from images that were not previously thought to contain race-specific features. We evaluate whether grayscale retinal vessel maps (RVMs) of patients screened for retinopathy of prematurity (ROP) contain race-specific features.
Methods: 4095 retinal fundus images (RFIs) were collected from 245 Black and White infants. A U-Net generated RVMs from RFIs, which were subsequently thresholded, binarized, or skeletonized. To determine whether RVM differences between Black and White eyes were physiological, CNNs were trained to predict race from color RFIs, raw RVMs, and thresholded, binarized, or skeletonized RVMs. Area under the precision-recall curve (AUC-PR) was evaluated.
Findings: CNNs predicted race from RFIs near perfectly (image-level AUC-PR: 0.999, subject-level AUC-PR: 1.000). Raw RVMs were almost as informative as color RFIs (image-level AUC-PR: 0.938, subject-level AUC-PR: 0.995). Ultimately, CNNs were able to detect whether RFIs or RVMs were from Black or White babies, regardless of whether images contained color, vessel segmentation brightness differences were nullified, or vessel segmentation widths were normalized.
Interpretation: AI can detect race from grayscale RVMs that were not thought to contain racial information. Two potential explanations for these findings are that: retinal vessels physiologically differ between Black and White babies or the U-Net segments the retinal vasculature differently for various fundus pigmentations. Either way, the implications remain the same: AI algorithms have potential to demonstrate racial bias in practice, even when preliminary attempts to remove such information from the underlying images appear to be successful.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Efficient Action Recognition Using Confidence Distillation
Authors:
Shervin Manzuri Shalmani,
Fei Chiang,
Rong Zheng
Abstract:
Modern neural networks are powerful predictive models. However, when it comes to recognizing that they may be wrong about their predictions, they perform poorly. For example, for one of the most common activation functions, the ReLU and its variants, even a well-calibrated model can produce incorrect but high confidence predictions. In the related task of action recognition, most current classific…
▽ More
Modern neural networks are powerful predictive models. However, when it comes to recognizing that they may be wrong about their predictions, they perform poorly. For example, for one of the most common activation functions, the ReLU and its variants, even a well-calibrated model can produce incorrect but high confidence predictions. In the related task of action recognition, most current classification methods are based on clip-level classifiers that densely sample a given video for non-overlap**, same-sized clips and aggregate the results using an aggregation function - typically averaging - to achieve video level predictions. While this approach has shown to be effective, it is sub-optimal in recognition accuracy and has a high computational overhead. To mitigate both these issues, we propose the confidence distillation framework to teach a representation of uncertainty of the teacher to the student sampler and divide the task of full video prediction between the student and the teacher models. We conduct extensive experiments on three action recognition datasets and demonstrate that our framework achieves significant improvements in action recognition accuracy (up to 20%) and computational efficiency (more than 40%).
△ Less
Submitted 16 August, 2022; v1 submitted 5 September, 2021;
originally announced September 2021.
-
Temporal Graph Functional Dependencies [Extended Version]
Authors:
Morteza Alipourlangouri,
Adam Mansfield,
Fei Chiang,
Yinghui Wu
Abstract:
Data dependencies have been extended to graphs to characterize topological and value constraints. Existing data dependencies are defined to capture inconsistencies in static graphs. Nevertheless, inconsistencies may occur over evolving graphs and only for certain time periods. The need for capturing such inconsistencies in temporal graphs is evident in anomaly detection and predictive dynamic netw…
▽ More
Data dependencies have been extended to graphs to characterize topological and value constraints. Existing data dependencies are defined to capture inconsistencies in static graphs. Nevertheless, inconsistencies may occur over evolving graphs and only for certain time periods. The need for capturing such inconsistencies in temporal graphs is evident in anomaly detection and predictive dynamic network analysis. This paper introduces a class of data dependencies called Temporal Graph Functional Dependencies (TGFDs). TGFDs generalize functional dependencies to temporal graphs as a sequence of graph snapshots that are induced by time intervals, and enforce both topological constraints and attribute value dependencies that must be satisfied by these snapshots. (1) We establish the complexity results for the satisfiability and implication problems of TGFDs. (2) We propose a sound and complete axiomatization system for TGFDs. (3) We also present efficient parallel algorithms to detect inconsistencies in temporal graphs as violations of TGFDs. The algorithm exploits data and temporal locality induced by time intervals, and uses incremental pattern matching and load balancing strategies to enable feasible error detection in large temporal graphs. Using real datasets, we experimentally verify that our algorithms achieve lower runtimes compared to existing baselines, while improving the accuracy over error detection using existing graph data constraints, e.g., GFDs and GTARs with 55% and 74% gain in F1-score, respectively.
△ Less
Submitted 25 July, 2022; v1 submitted 19 August, 2021;
originally announced August 2021.
-
Discovery and Contextual Data Cleaning with Ontology Functional Dependencies
Authors:
Zheng Zheng,
Longtao Zheng,
Morteza Alipour Langouri,
Fei Chiang,
Lukasz Golab,
Jaroslaw Szlichta
Abstract:
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an…
▽ More
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.
△ Less
Submitted 12 March, 2022; v1 submitted 17 May, 2021;
originally announced May 2021.
-
Privacy-Aware Data Cleaning-as-a-Service (Extended Version)
Authors:
Yu Huang,
Mostafa Milani,
Fei Chiang
Abstract:
Data cleaning is a pervasive problem for organizations as they try to reap value from their data. Recent advances in networking and cloud computing technology have fueled a new computing paradigm called Database-as-a-Service, where data management tasks are outsourced to large service providers. In this paper, we consider a Data Cleaning-as-a-Service model that allows a client to interact with a d…
▽ More
Data cleaning is a pervasive problem for organizations as they try to reap value from their data. Recent advances in networking and cloud computing technology have fueled a new computing paradigm called Database-as-a-Service, where data management tasks are outsourced to large service providers. In this paper, we consider a Data Cleaning-as-a-Service model that allows a client to interact with a data cleaning provider who hosts curated, and sensitive data. We present PACAS: a Privacy-Aware data Cleaning-As-a-Service model that facilitates interaction between the parties with client query requests for data, and a service provider using a data pricing scheme that computes prices according to data sensitivity. We propose new extensions to the model to define generalized data repairs that obfuscate sensitive data to allow data sharing between the client and service provider. We present a new semantic distance measure to quantify the utility of such repairs, and we re-define the notion of consistency in the presence of generalized values. The PACAS model uses (X,Y,L)-anonymity that extends existing data publishing techniques to consider the semantics in the data while protecting sensitive values. Our evaluation over real data show that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques.
△ Less
Submitted 1 August, 2020;
originally announced August 2020.
-
Diversifying Anonymized Data with Diversity Constraints
Authors:
Mostafa Milani,
Yu Huang,
Fei Chiang
Abstract:
Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development…
▽ More
Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development. Unfortunately, data anonymization techniques have largely ignored diversity in its published result. This inadvertently propagates underlying bias in subsequent data analysis. We study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints. We formalize diversity constraints and study their foundations such as implication and satisfiability. We show that determining the existence of a diverse, anonymized instance can be done in PTIME, and we present a clustering-based algorithm. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. Our work aligns with recent trends towards responsible data science by coupling diversity with privacy-preserving data publishing.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Accelerated Experimental Design for Pairwise Comparisons
Authors:
Yuan Guo,
Jennifer Dy,
Deniz Erdogmus,
Jayashree Kalpathy-Cramer,
Susan Ostmo,
J. Peter Campbell,
Michael F. Chiang,
Stratis Ioannidis
Abstract:
Pairwise comparison labels are more informative and less variable than class labels, but generating them poses a challenge: their number grows quadratically in the dataset size. We study a natural experimental design objective, namely, D-optimality, that can be used to identify which $K$ pairwise comparisons to generate. This objective is known to perform well in practice, and is submodular, makin…
▽ More
Pairwise comparison labels are more informative and less variable than class labels, but generating them poses a challenge: their number grows quadratically in the dataset size. We study a natural experimental design objective, namely, D-optimality, that can be used to identify which $K$ pairwise comparisons to generate. This objective is known to perform well in practice, and is submodular, making the selection approximable via the greedy algorithm. A naïve greedy implementation has $O(N^2d^2K)$ complexity, where $N$ is the dataset size, $d$ is the feature space dimension, and $K$ is the number of generated comparisons. We show that, by exploiting the inherent geometry of the dataset--namely, that it consists of pairwise comparisons--the greedy algorithm's complexity can be reduced to $O(N^2(K+d)+N(dK+d^2) +d^2K).$ We apply the same acceleration also to the so-called lazy greedy algorithm. When combined, the above improvements lead to an execution time of less than 1 hour for a dataset with $10^8$ comparisons; the naïve greedy algorithm on the same dataset would require more than 10 days to terminate.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
Deep feature transfer between localization and segmentation tasks
Authors:
Szu-Yeu Hu,
Andrew Beers,
Ken Chang,
Kathi Höbel,
J. Peter Campbell,
Deniz Erdogumus,
Stratis Ioannidis,
Jennifer Dy,
Michael F. Chiang,
Jayashree Kalpathy-Cramer,
James M. Brown
Abstract:
In this paper, we propose a new pre-training scheme for U-net based image segmentation. We first train the encoding arm as a localization network to predict the center of the target, before extending it into a U-net architecture for segmentation. We apply our proposed method to the problem of segmenting the optic disc from fundus photographs. Our work shows that the features learned by encoding ar…
▽ More
In this paper, we propose a new pre-training scheme for U-net based image segmentation. We first train the encoding arm as a localization network to predict the center of the target, before extending it into a U-net architecture for segmentation. We apply our proposed method to the problem of segmenting the optic disc from fundus photographs. Our work shows that the features learned by encoding arm can be transferred to the segmentation network to reduce the annotation burden. We propose that an approach could have broad utility for medical image segmentation, and alleviate the burden of delineating complex structures by pre-training on annotations that are much easier to acquire.
△ Less
Submitted 10 November, 2018; v1 submitted 6 November, 2018;
originally announced November 2018.
-
High-resolution medical image synthesis using progressively grown generative adversarial networks
Authors:
Andrew Beers,
James Brown,
Ken Chang,
J. Peter Campbell,
Susan Ostmo,
Michael F. Chiang,
Jayashree Kalpathy-Cramer
Abstract:
Generative adversarial networks (GANs) are a class of unsupervised machine learning algorithms that can produce realistic images from randomly-sampled vectors in a multi-dimensional space. Until recently, it was not possible to generate realistic high-resolution images using GANs, which has limited their applicability to medical images that contain biomarkers only detectable at native resolution.…
▽ More
Generative adversarial networks (GANs) are a class of unsupervised machine learning algorithms that can produce realistic images from randomly-sampled vectors in a multi-dimensional space. Until recently, it was not possible to generate realistic high-resolution images using GANs, which has limited their applicability to medical images that contain biomarkers only detectable at native resolution. Progressive growing of GANs is an approach wherein an image generator is trained to initially synthesize low resolution synthetic images (8x8 pixels), which are then fed to a discriminator that distinguishes these synthetic images from real downsampled images. Additional convolutional layers are then iteratively introduced to produce images at twice the previous resolution until the desired resolution is reached. In this work, we demonstrate that this approach can produce realistic medical images in two different domains; fundus photographs exhibiting vascular pathology associated with retinopathy of prematurity (ROP), and multi-modal magnetic resonance images of glioma. We also show that fine-grained details associated with pathology, such as retinal vessels or tumor heterogeneity, can be preserved and enhanced by including segmentation maps as additional channels. We envisage several applications of the approach, including image augmentation and unsupervised classification of pathology.
△ Less
Submitted 9 May, 2018; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Efficient Discovery of Ontology Functional Dependencies
Authors:
Sridevi Baskaran,
Alexander Keller,
Fei Chiang,
Golab Lukasz,
Jaroslaw Szlichta
Abstract:
Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors…
▽ More
Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.
△ Less
Submitted 23 May, 2017; v1 submitted 8 November, 2016;
originally announced November 2016.