-
Graph-based Active Learning for Entity Cluster Repair
Authors:
Victor Christen,
Daniel Obraczka,
Marvin Hofer,
Martin Franke,
Erhard Rahm
Abstract:
Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent a…
▽ More
Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Generating Synthetic Health Sensor Data for Privacy-Preserving Wearable Stress Detection
Authors:
Lucas Lange,
Nils Wenzlitschke,
Erhard Rahm
Abstract:
Smartwatch health sensor data are increasingly utilized in smart health applications and patient monitoring, including stress detection. However, such medical data often comprise sensitive personal information and are resource-intensive to acquire for research purposes. In response to this challenge, we introduce the privacy-aware synthetization of multi-sensor smartwatch health readings related t…
▽ More
Smartwatch health sensor data are increasingly utilized in smart health applications and patient monitoring, including stress detection. However, such medical data often comprise sensitive personal information and are resource-intensive to acquire for research purposes. In response to this challenge, we introduce the privacy-aware synthetization of multi-sensor smartwatch health readings related to moments of stress, employing Generative Adversarial Networks (GANs) and Differential Privacy (DP) safeguards. Our method not only protects patient information but also enhances data availability for research. To ensure its usefulness, we test synthetic data from multiple GANs and employ different data enhancement strategies on an actual stress detection task. Our GAN-based augmentation methods demonstrate significant improvements in model performance, with private DP training scenarios observing an 11.90-15.48% increase in F1-score, while non-private training scenarios still see a 0.45% boost. These results underline the potential of differentially private synthetic data in optimizing utility-privacy trade-offs, especially with the limited availability of real training samples. Through rigorous quality assessments, we confirm the integrity and plausibility of our synthetic data, which, however, are significantly impacted when increasing privacy requirements.
△ Less
Submitted 14 May, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Privacy at Risk: Exploiting Similarities in Health Data for Identity Inference
Authors:
Lucas Lange,
Tobias Schreieder,
Victor Christen,
Erhard Rahm
Abstract:
Smartwatches enable the efficient collection of health data that can be used for research and comprehensive analysis to improve the health of individuals. In addition to the analysis capabilities, ensuring privacy when handling health data is a critical concern as the collection and analysis of such data become pervasive. Since health data contains sensitive information, it should be handled with…
▽ More
Smartwatches enable the efficient collection of health data that can be used for research and comprehensive analysis to improve the health of individuals. In addition to the analysis capabilities, ensuring privacy when handling health data is a critical concern as the collection and analysis of such data become pervasive. Since health data contains sensitive information, it should be handled with responsibility and is therefore often treated anonymously. However, also the data itself can be exploited to reveal information and break anonymity. We propose a novel similarity-based re-identification attack on time-series health data and thereby unveil a significant vulnerability. Despite privacy measures that remove identifying information, our attack demonstrates that a brief amount of various sensor data from a target individual is adequate to possibly identify them within a database of other samples, solely based on sensor-level similarities. In our example scenario, where data owners leverage health data from smartwatches, findings show that we are able to correctly link the target data in two out of three cases. User privacy is thus already inherently threatened by the data itself and even when removing personal information.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Construction of Knowledge Graphs: State and Challenges
Authors:
Marvin Hofer,
Daniel Obraczka,
Alieh Saeedi,
Hanna Köpcke,
Erhard Rahm
Abstract:
With knowledge graphs (KGs) at the center of numerous applications such as recommender systems and question answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured (e.g. text) and structured data sources (e.g. databases) are mostly well-researched for their one-shot exec…
▽ More
With knowledge graphs (KGs) at the center of numerous applications such as recommender systems and question answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured (e.g. text) and structured data sources (e.g. databases) are mostly well-researched for their one-shot execution, their adoption for incremental KG updates and the interplay of the individual steps have hardly been investigated in a systematic manner so far. In this work, we first discuss the main graph models for KGs and introduce the major requirement for future KG construction pipelines. Next, we provide an overview of the necessary steps to build high-quality KGs, including cross-cutting topics such as metadata management, ontology development, and quality assurance. We then evaluate the state of the art of KG construction w.r.t the introduced requirements for specific popular KGs as well as some recent tools and strategies for KG construction. Finally, we identify areas in need of further research and improvement.
△ Less
Submitted 11 October, 2023; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)
Authors:
Lucas Lange,
Maja Schneider,
Peter Christen,
Erhard Rahm
Abstract:
Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practica…
▽ More
Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practical privacy. We suggest improvements to address these open gaps. We account for inherent class imbalances and evaluate the utility-privacy trade-off more extensively and over stricter privacy budgets. Our evaluation is supported by empirically estimating practical privacy through black-box Membership Inference Attacks (MIAs). The introduced DP should help limit leakage threats posed by MIAs, and our practical analysis is the first to test this hypothesis on the COVID-19 classification task. Our results indicate that needed privacy levels might differ based on the task-dependent practical threat from MIAs. The results further suggest that with increasing DP guarantees, empirical privacy leakage only improves marginally, and DP therefore appears to have a limited impact on practical MIA defense. Our findings identify possibilities for better utility-privacy trade-offs, and we believe that empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy.
△ Less
Submitted 26 April, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Bitemporal Property Graphs to Organize Evolving Systems
Authors:
Christopher Rost,
Philip Fritzsche,
Lucas Schons,
Maximilian Zimmer,
Dieter Gawlick,
Erhard Rahm
Abstract:
This work is a summarized view on the results of a one-year cooperation between Oracle Corp. and the University of Leipzig. The goal was to research the organization of relationships within multi-dimensional time-series data, such as sensor data from the IoT area. We showed in this project that temporal property graphs with some extensions are a prime candidate for this organizational task that co…
▽ More
This work is a summarized view on the results of a one-year cooperation between Oracle Corp. and the University of Leipzig. The goal was to research the organization of relationships within multi-dimensional time-series data, such as sensor data from the IoT area. We showed in this project that temporal property graphs with some extensions are a prime candidate for this organizational task that combines the strengths of both data models (graph and time-series). The outcome of the cooperation includes four achievements: (1) a bitemporal property graph model, (2) a temporal graph query language, (3) a conception of continuous event detection, and (4) a prototype of a bitemporal graph database that supports the model, language and event detection.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs
Authors:
Daniel Obraczka,
Jonathan Schuchart,
Erhard Rahm
Abstract:
Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the…
▽ More
Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the distance between them in the embedding space which is comparatively simple. However, previous work has shown that the use of graph embeddings alone is not sufficient to achieve high ER quality. We therefore propose a more comprehensive ER approach for knowledge graphs called EAGER (Embedding-Assisted Knowledge Graph Entity Resolution) to flexibly utilize both the similarity of graph embeddings and attribute values within a supervised machine learning approach. We evaluate our approach on 23 benchmark datasets with differently sized and structured knowledge graphs and use hypothesis tests to ensure statistical significance of our results. Furthermore we compare our approach with state-of-the-art ER solutions, where our approach yields competitive results for table-oriented ER problems and shallow knowledge graphs but much better results for deeper knowledge graphs.
△ Less
Submitted 15 January, 2021;
originally announced January 2021.
-
ErGAN: Generative Adversarial Networks for Entity Resolution
Authors:
**gyu Shao,
Qing Wang,
Asiri Wijesinghe,
Erhard Rahm
Abstract:
Entity resolution targets at identifying records that represent the same real-world entity from one or more datasets. A major challenge in learning-based entity resolution is how to reduce the label cost for training. Due to the quadratic nature of record pair comparison, labeling is a costly task that often requires a significant effort from human experts. Inspired by recent advances of generativ…
▽ More
Entity resolution targets at identifying records that represent the same real-world entity from one or more datasets. A major challenge in learning-based entity resolution is how to reduce the label cost for training. Due to the quadratic nature of record pair comparison, labeling is a costly task that often requires a significant effort from human experts. Inspired by recent advances of generative adversarial network (GAN), we propose a novel deep learning method, called ErGAN, to address the challenge. ErGAN consists of two key components: a label generator and a discriminator which are optimized alternatively through adversarial learning. To alleviate the issues of overfitting and highly imbalanced distribution, we design two novel modules for diversity and propagation, which can greatly improve the model generalization power. We have conducted extensive experiments to empirically verify the labeling and learning efficiency of ErGAN. The experimental results show that ErGAN beats the state-of-the-art baselines, including unsupervised, semi-supervised, and unsupervised learning methods.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
LEAPME: Learning-based Property Matching with Embeddings
Authors:
Daniel Ayala,
Inma Hernández,
David Ruiz,
Erhard Rahm
Abstract:
Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in c…
▽ More
Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in challenging use cases such as the integration of heterogeneous product entities from many sources.
We therefore present a new machine learning-based property matching approach called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features of both property names and instance values. The approach heavily makes use of word embeddings to better utilize the domain-specific semantics of both property names and instance values. The use of supervised machine learning helps exploit the predictive power of word embeddings.
Our comparative evaluation against five baselines for several multi-source datasets with real-world data shows the high effectiveness of LEAPME. We also show that our approach is even effective when training data from another domain (transfer learning) is used.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
Retrofitting Fine Grain Isolation in the Firefox Renderer (Extended Version)
Authors:
Shravan Narayan,
Craig Disselkoen,
Tal Garfinkel,
Nathan Froyd,
Eric Rahm,
Sorin Lerner,
Hovav Shacham,
Deian Stefan
Abstract:
Firefox and other major browsers rely on dozens of third-party libraries to render audio, video, images, and other content. These libraries are a frequent source of vulnerabilities. To mitigate this threat, we are migrating Firefox to an architecture that isolates these libraries in lightweight sandboxes, dramatically reducing the impact of a compromise.
Retrofitting isolation can be labor-inten…
▽ More
Firefox and other major browsers rely on dozens of third-party libraries to render audio, video, images, and other content. These libraries are a frequent source of vulnerabilities. To mitigate this threat, we are migrating Firefox to an architecture that isolates these libraries in lightweight sandboxes, dramatically reducing the impact of a compromise.
Retrofitting isolation can be labor-intensive, very prone to security bugs, and requires critical attention to performance. To help, we developed RLBox, a framework that minimizes the burden of converting Firefox to securely and efficiently use untrusted code. To enable this, RLBox employs static information flow enforcement, and lightweight dynamic checks, expressed directly in the C++ type system.
RLBox supports efficient sandboxing through either software-based-fault isolation or multi-core process isolation. Performance overheads are modest and transient, and have only minor impact on page latency. We demonstrate this by sandboxing performance-sensitive image decoding libraries ( libjpeg and libpng ), video decoding libraries ( libtheora and libvpx ), the libvorbis audio decoding library, and the zlib decompression library.
RLBox, using a WebAssembly sandbox, has been integrated into production Firefox to sandbox the libGraphite font sha** library.
△ Less
Submitted 9 March, 2020; v1 submitted 1 March, 2020;
originally announced March 2020.
-
Incremental Clustering Techniques for Multi-Party Privacy-Preserving Record Linkage
Authors:
Dinusha Vatsalan,
Peter Christen,
Erhard Rahm
Abstract:
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values.…
▽ More
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values. Employing PPRL on records from multiple (more than two) parties/sources (multi-party PPRL, MP-PPRL) is an increasingly important but challenging problem that so far has not been sufficiently solved. Existing MP-PPRL approaches are limited to finding only those entities that are present in all parties thereby missing entities that match only in a subset of parties. Furthermore, previous MP-PPRL approaches face substantial scalability limitations due to the need of a large number of comparisons between masked records. We thus propose and evaluate new MP-PPRL approaches that find matches in any subset of parties and still scale to many parties. Our approaches maintain all matches within clusters, where these clusters are incrementally extended or refined by considering records from one party after the other. An empirical evaluation using multiple real datasets ranging from 3 to 26 parties each containing up to $5$ million records validates that our protocols are efficient, and significantly outperform existing MP-PPRL approaches in terms of linkage quality and scalability.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Graph Sampling with Distributed In-Memory Dataflow Systems
Authors:
Kevin Gomez,
Matthias Täschner,
M. Ali Rostami,
Christopher Rost,
Erhard Rahm
Abstract:
Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large graphs. We focus on the implementation of distributed graph sampling for Big Data frameworks and in-memory dataflow systems such as Apache Spark or Apache Flink. We…
▽ More
Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large graphs. We focus on the implementation of distributed graph sampling for Big Data frameworks and in-memory dataflow systems such as Apache Spark or Apache Flink. We evaluate the scalability of the new implementations and analyze to what degree the sampling approaches preserve certain graph metrics compared to the original graph. The latter analysis also uses comparative graph visualizations. The presented methods will be open source and be integrated into Gradoop, a system for distributed graph analytics.
△ Less
Submitted 10 October, 2019;
originally announced October 2019.
-
Distributed Holistic Clustering on Linked Data
Authors:
Markus Nentwig,
Anika Groß,
Maximilian Möller,
Erhard Rahm
Abstract:
Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based…
▽ More
Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our clustering approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support a distributed execution of the clustering approach to achieve faster execution times and scalability for large real-world data sets. We provide a novel gold standard for multi-source clustering, and evaluate our methods with respect to effectiveness and efficiency for large data sets from the geographic and music domains.
△ Less
Submitted 30 August, 2017;
originally announced August 2017.
-
DIMSpan - Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems
Authors:
André Petermann,
Martin Junghanns,
Erhard Rahm
Abstract:
Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow…
▽ More
Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Spark or Apache Flink. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its pruning and optimization techniques.
△ Less
Submitted 6 March, 2017;
originally announced March 2017.
-
Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters
Authors:
Dinusha Vatsalan,
Peter Christen,
Erhard Rahm
Abstract:
Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PP…
▽ More
Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PPRL to more databases (multi-party PPRL) is an open challenge since privacy threats as well as the computation and communication costs for record linkage increase significantly with the number of databases. We thus propose the use of a new encoding method of sensitive data based on Counting Bloom Filters (CBF) to improve privacy for multi-party PPRL. We also investigate optimizations to reduce communication and computation costs for CBF-based multi-party PPRL with and without the use of a dedicated linkage unit. Empirical evaluations conducted with real datasets show the viability of the proposed approaches and demonstrate their scalability, linkage quality, and privacy protection.
△ Less
Submitted 5 January, 2017;
originally announced January 2017.
-
GRADOOP: Scalable Graph Data Management and Analytics with Hadoop
Authors:
Martin Junghanns,
André Petermann,
Kevin Gómez,
Erhard Rahm
Abstract:
Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expressiveness. We are therefore develo** a new end-to-end approach for graph data management and analysis ba…
▽ More
Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expressiveness. We are therefore develo** a new end-to-end approach for graph data management and analysis based on the Hadoop ecosystem, called Gradoop (Graph analytics on Hadoop). Gradoop is designed around the so-called Extended Property Graph Data Model (EPGM) supporting semantically rich, schema-free graph data within many distinct graphs. A set of high-level operators is provided for analyzing both single graphs and collections of graphs. Based on these operators, we propose a domain-specific language to define analytical workflows. The Gradoop graph store is currently utilizing HBase for distributed storage of graph data in Hadoop clusters. An initial version of Gradoop has been used to analyze graph data for business intelligence and social network analysis.
△ Less
Submitted 2 June, 2015; v1 submitted 1 June, 2015;
originally announced June 2015.
-
Semi-automatic identification of counterfeit offers in online shop** platforms
Authors:
Christian Wartner,
Patrick Arnold,
Erhard Rahm
Abstract:
Product counterfeiting is a serious problem causing the industry estimated losses of billions of dollars every year. With the increasing spread of e-commerce, the number of counterfeit products sold online increased substantially. We propose the adoption of a semi-automatic workflow to identify likely counterfeit offers in online platforms and to present these offers to a domain expert for manual…
▽ More
Product counterfeiting is a serious problem causing the industry estimated losses of billions of dollars every year. With the increasing spread of e-commerce, the number of counterfeit products sold online increased substantially. We propose the adoption of a semi-automatic workflow to identify likely counterfeit offers in online platforms and to present these offers to a domain expert for manual verification. The workflow includes steps to generate search queries for relevant product offers, to match and cluster similar product offers, and to assess the counterfeit suspiciousness based on different criteria. The goal is to support the periodic identification of many counterfeit offers with a limited amount of manual effort. We explain how the proposed approach can be realized. We also present a preliminary evaluation of its most important steps on a case study using the eBay platform.
△ Less
Submitted 2 April, 2015;
originally announced April 2015.
-
How do Ontology Map**s Change in the Life Sciences?
Authors:
Anika Gross,
Michael Hartung,
Andreas Thor,
Erhard Rahm
Abstract:
Map**s between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology map**s. So far the evolution of ontology map**s has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how map**s between popular life science ontologies e…
▽ More
Map**s between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology map**s. So far the evolution of ontology map**s has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how map**s between popular life science ontologies evolve for different match algorithms. We also evaluate which semantic ontology changes primarily affect the map**s. We further investigate alternatives to predict or estimate the degree of future map** changes based on previous ontology and map** transitions.
△ Less
Submitted 12 April, 2012;
originally announced April 2012.
-
Rule-based Construction of Matching Processes
Authors:
Eric Peukert,
Julian Eberius,
Erhard Rahm
Abstract:
Map** complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up that process automatic matching systems were developed to compute map** suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct map**s…
▽ More
Map** complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up that process automatic matching systems were developed to compute map** suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct map**s to evaluate generated map**s. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given map** problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different map** problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality map**s across different map** problems and domains.
△ Less
Submitted 9 August, 2011;
originally announced August 2011.
-
Load Balancing for MapReduce-based Entity Resolution
Authors:
Lars Kolb,
Andreas Thor,
Erhard Rahm
Abstract:
The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose…
▽ More
The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches.
△ Less
Submitted 8 August, 2011;
originally announced August 2011.
-
Target-driven merging of Taxonomies
Authors:
Salvatore Raunich,
Erhard Rahm
Abstract:
The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies. The goal of ontology integration is to merge two or more given ontologies in order to provide a unified view on the input ontologies while maintaining all information coming from them. We propose a new taxonomy merging algorithm that, given as input two taxonomies and an…
▽ More
The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies. The goal of ontology integration is to merge two or more given ontologies in order to provide a unified view on the input ontologies while maintaining all information coming from them. We propose a new taxonomy merging algorithm that, given as input two taxonomies and an equivalence matching between them, can generate an integrated taxonomy in a fully automatic manner. The approach is target-driven, i.e. we merge a source taxonomy into the target taxonomy and preserve the structure of the target ontology as much as possible. We also discuss how to extend the merge algorithm providing auxiliary information, like additional relationships between source and target concepts, in order to semantically improve the final result. The algorithm was implemented in a working prototype and evaluated using synthetic and real-world scenarios.
△ Less
Submitted 21 December, 2010;
originally announced December 2010.
-
Parallel Sorted Neighborhood Blocking with MapReduce
Authors:
Lars Kolb,
Andreas Thor,
Erhard Rahm
Abstract:
Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs…
▽ More
Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply a tailored data replication.
△ Less
Submitted 14 October, 2010;
originally announced October 2010.
-
Rule-based Generation of Diff Evolution Map**s between Ontology Versions
Authors:
Michael Hartung,
Anika Groß,
Erhard Rahm
Abstract:
Ontologies such as taxonomies, product catalogs or web directories are heavily used and hence evolve frequently to meet new requirements or to better reflect the current instance data of a domain. To effectively manage the evolution of ontologies it is essential to identify the difference (Diff) between two ontology versions. We propose a novel approach to determine an expressive and invertible di…
▽ More
Ontologies such as taxonomies, product catalogs or web directories are heavily used and hence evolve frequently to meet new requirements or to better reflect the current instance data of a domain. To effectively manage the evolution of ontologies it is essential to identify the difference (Diff) between two ontology versions. We propose a novel approach to determine an expressive and invertible diff evolution map** between given versions of an ontology. Our approach utilizes the result of a match operation to determine an evolution map** consisting of a set of basic change operations (insert/update/delete). To semantically enrich the evolution map** we adopt a rule-based approach to transform the basic change operations into a smaller set of more complex change operations, such as merge, split, or changes of entire subgraphs. The proposed algorithm is customizable in different ways to meet the requirements of diverse ontologies and application scenarios. We evaluate the proposed approach by determining and analyzing evolution map**s for real-world life science ontologies and web directories.
△ Less
Submitted 1 October, 2010;
originally announced October 2010.
-
Data Partitioning for Parallel Entity Matching
Authors:
Toralf Kirsten,
Lars Kolb,
Michael Hartung,
Anika Groß,
Hanna Köpcke,
Erhard Rahm
Abstract:
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, bloc…
▽ More
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of in-put entities and affinity-based scheduling of match tasks.
△ Less
Submitted 28 June, 2010;
originally announced June 2010.
-
Evaluation of Query Generators for Entity Search Engines
Authors:
Stefan Endrullis,
Andreas Thor,
Erhard Rahm
Abstract:
Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For…
▽ More
Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For a given set of entities query generators are able to automatically determine a set of search queries to retrieve these entities from an entity search engine. We demonstrate the usefulness of query generators for on-demand web data integration and evaluate the effectiveness and efficiency of query generators for a challenging real-world integration scenario.
△ Less
Submitted 23 March, 2010;
originally announced March 2010.