Skip to main content

Showing 1–25 of 25 results for author: Rahm, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.14992  [pdf, other

    cs.LG cs.DB

    Graph-based Active Learning for Entity Cluster Repair

    Authors: Victor Christen, Daniel Obraczka, Marvin Hofer, Martin Franke, Erhard Rahm

    Abstract: Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent a… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

  2. Generating Synthetic Health Sensor Data for Privacy-Preserving Wearable Stress Detection

    Authors: Lucas Lange, Nils Wenzlitschke, Erhard Rahm

    Abstract: Smartwatch health sensor data are increasingly utilized in smart health applications and patient monitoring, including stress detection. However, such medical data often comprise sensitive personal information and are resource-intensive to acquire for research purposes. In response to this challenge, we introduce the privacy-aware synthetization of multi-sensor smartwatch health readings related t… ▽ More

    Submitted 14 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Published in the MDPI Sensors Journal

    Journal ref: Sensors 2024, 24(10), 3052

  3. arXiv:2308.08310  [pdf, other

    cs.CR

    Privacy at Risk: Exploiting Similarities in Health Data for Identity Inference

    Authors: Lucas Lange, Tobias Schreieder, Victor Christen, Erhard Rahm

    Abstract: Smartwatches enable the efficient collection of health data that can be used for research and comprehensive analysis to improve the health of individuals. In addition to the analysis capabilities, ensuring privacy when handling health data is a critical concern as the collection and analysis of such data become pervasive. Since health data contains sensitive information, it should be handled with… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

  4. arXiv:2302.11509  [pdf, other

    cs.AI cs.DB cs.LG

    Construction of Knowledge Graphs: State and Challenges

    Authors: Marvin Hofer, Daniel Obraczka, Alieh Saeedi, Hanna Köpcke, Erhard Rahm

    Abstract: With knowledge graphs (KGs) at the center of numerous applications such as recommender systems and question answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured (e.g. text) and structured data sources (e.g. databases) are mostly well-researched for their one-shot exec… ▽ More

    Submitted 11 October, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: 51 pages, 5 figures, 4 tables, 328 references

  5. arXiv:2211.11434  [pdf, other

    cs.LG cs.CR cs.CV

    Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)

    Authors: Lucas Lange, Maja Schneider, Peter Christen, Erhard Rahm

    Abstract: Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practica… ▽ More

    Submitted 26 April, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: Extended version of the paper accepted at the 20th International Conference on Security and Cryptography SECRYPT 2023. This version is more detailed and includes additional content: a longer results chapter and an appendix containing a proof

    Journal ref: Proceedings of the 20th International Conference on Security and Cryptography - SECRYPT 2023

  6. arXiv:2111.13499  [pdf, other

    cs.DB cs.SI

    Bitemporal Property Graphs to Organize Evolving Systems

    Authors: Christopher Rost, Philip Fritzsche, Lucas Schons, Maximilian Zimmer, Dieter Gawlick, Erhard Rahm

    Abstract: This work is a summarized view on the results of a one-year cooperation between Oracle Corp. and the University of Leipzig. The goal was to research the organization of relationships within multi-dimensional time-series data, such as sensor data from the IoT area. We showed in this project that temporal property graphs with some extensions are a prime candidate for this organizational task that co… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: 21 pages

  7. arXiv:2101.06126  [pdf, other

    cs.LG cs.DB

    EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs

    Authors: Daniel Obraczka, Jonathan Schuchart, Erhard Rahm

    Abstract: Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the… ▽ More

    Submitted 15 January, 2021; originally announced January 2021.

    Comments: 10 pages, 7 figures

    ACM Class: I.2.4; I.2.6

  8. arXiv:2012.10004  [pdf, other

    cs.LG cs.DB

    ErGAN: Generative Adversarial Networks for Entity Resolution

    Authors: **gyu Shao, Qing Wang, Asiri Wijesinghe, Erhard Rahm

    Abstract: Entity resolution targets at identifying records that represent the same real-world entity from one or more datasets. A major challenge in learning-based entity resolution is how to reduce the label cost for training. Due to the quadratic nature of record pair comparison, labeling is a costly task that often requires a significant effort from human experts. Inspired by recent advances of generativ… ▽ More

    Submitted 17 December, 2020; originally announced December 2020.

  9. arXiv:2010.01951  [pdf, other

    cs.DB cs.LG

    LEAPME: Learning-based Property Matching with Embeddings

    Authors: Daniel Ayala, Inma Hernández, David Ruiz, Erhard Rahm

    Abstract: Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in c… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    MSC Class: 68U35

  10. arXiv:2003.00572  [pdf, other

    cs.CR

    Retrofitting Fine Grain Isolation in the Firefox Renderer (Extended Version)

    Authors: Shravan Narayan, Craig Disselkoen, Tal Garfinkel, Nathan Froyd, Eric Rahm, Sorin Lerner, Hovav Shacham, Deian Stefan

    Abstract: Firefox and other major browsers rely on dozens of third-party libraries to render audio, video, images, and other content. These libraries are a frequent source of vulnerabilities. To mitigate this threat, we are migrating Firefox to an architecture that isolates these libraries in lightweight sandboxes, dramatically reducing the impact of a compromise. Retrofitting isolation can be labor-inten… ▽ More

    Submitted 9 March, 2020; v1 submitted 1 March, 2020; originally announced March 2020.

    Comments: Accepted at Usenix Security 2020

    MSC Class: D.4.6 ACM Class: D.4.6

  11. arXiv:1911.12930  [pdf, ps, other

    cs.DB cs.DC

    Incremental Clustering Techniques for Multi-Party Privacy-Preserving Record Linkage

    Authors: Dinusha Vatsalan, Peter Christen, Erhard Rahm

    Abstract: Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values.… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  12. arXiv:1910.04493  [pdf, other

    cs.DC

    Graph Sampling with Distributed In-Memory Dataflow Systems

    Authors: Kevin Gomez, Matthias Täschner, M. Ali Rostami, Christopher Rost, Erhard Rahm

    Abstract: Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large graphs. We focus on the implementation of distributed graph sampling for Big Data frameworks and in-memory dataflow systems such as Apache Spark or Apache Flink. We… ▽ More

    Submitted 10 October, 2019; originally announced October 2019.

  13. arXiv:1708.09299  [pdf, ps, other

    cs.DB

    Distributed Holistic Clustering on Linked Data

    Authors: Markus Nentwig, Anika Groß, Maximilian Möller, Erhard Rahm

    Abstract: Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based… ▽ More

    Submitted 30 August, 2017; originally announced August 2017.

  14. arXiv:1703.01910  [pdf, ps, other

    cs.DB

    DIMSpan - Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems

    Authors: André Petermann, Martin Junghanns, Erhard Rahm

    Abstract: Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow… ▽ More

    Submitted 6 March, 2017; originally announced March 2017.

  15. arXiv:1701.01232  [pdf, ps, other

    cs.DB

    Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters

    Authors: Dinusha Vatsalan, Peter Christen, Erhard Rahm

    Abstract: Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PP… ▽ More

    Submitted 5 January, 2017; originally announced January 2017.

    Comments: This is an extended version of an article published in IEEE ICDM International Workshop on Privacy and Discrimination in Data Mining (PDDM) 2016 - Scalable privacy-preserving linking of multiple databases using counting Bloom filters

  16. arXiv:1506.00548  [pdf, ps, other

    cs.DB

    GRADOOP: Scalable Graph Data Management and Analytics with Hadoop

    Authors: Martin Junghanns, André Petermann, Kevin Gómez, Erhard Rahm

    Abstract: Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expressiveness. We are therefore develo** a new end-to-end approach for graph data management and analysis ba… ▽ More

    Submitted 2 June, 2015; v1 submitted 1 June, 2015; originally announced June 2015.

    Comments: Technical Report

  17. arXiv:1504.00457  [pdf

    cs.DB cs.CY

    Semi-automatic identification of counterfeit offers in online shop** platforms

    Authors: Christian Wartner, Patrick Arnold, Erhard Rahm

    Abstract: Product counterfeiting is a serious problem causing the industry estimated losses of billions of dollars every year. With the increasing spread of e-commerce, the number of counterfeit products sold online increased substantially. We propose the adoption of a semi-automatic workflow to identify likely counterfeit offers in online platforms and to present these offers to a domain expert for manual… ▽ More

    Submitted 2 April, 2015; originally announced April 2015.

  18. arXiv:1204.2731  [pdf, other

    cs.DB

    How do Ontology Map**s Change in the Life Sciences?

    Authors: Anika Gross, Michael Hartung, Andreas Thor, Erhard Rahm

    Abstract: Map**s between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology map**s. So far the evolution of ontology map**s has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how map**s between popular life science ontologies e… ▽ More

    Submitted 12 April, 2012; originally announced April 2012.

    Comments: Keywords: map** evolution, ontology matching, ontology evolution

  19. arXiv:1108.1925  [pdf

    cs.DB

    Rule-based Construction of Matching Processes

    Authors: Eric Peukert, Julian Eberius, Erhard Rahm

    Abstract: Map** complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up that process automatic matching systems were developed to compute map** suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct map**s… ▽ More

    Submitted 9 August, 2011; originally announced August 2011.

    Comments: 10 Pages

    ACM Class: D.2.12

  20. arXiv:1108.1631  [pdf, other

    cs.DC

    Load Balancing for MapReduce-based Entity Resolution

    Authors: Lars Kolb, Andreas Thor, Erhard Rahm

    Abstract: The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose… ▽ More

    Submitted 8 August, 2011; originally announced August 2011.

    ACM Class: H.2.4

  21. arXiv:1012.4855  [pdf

    cs.DB

    Target-driven merging of Taxonomies

    Authors: Salvatore Raunich, Erhard Rahm

    Abstract: The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies. The goal of ontology integration is to merge two or more given ontologies in order to provide a unified view on the input ontologies while maintaining all information coming from them. We propose a new taxonomy merging algorithm that, given as input two taxonomies and an… ▽ More

    Submitted 21 December, 2010; originally announced December 2010.

  22. arXiv:1010.3053  [pdf, other

    cs.DC

    Parallel Sorted Neighborhood Blocking with MapReduce

    Authors: Lars Kolb, Andreas Thor, Erhard Rahm

    Abstract: Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs… ▽ More

    Submitted 14 October, 2010; originally announced October 2010.

  23. arXiv:1010.0122  [pdf

    cs.DB

    Rule-based Generation of Diff Evolution Map**s between Ontology Versions

    Authors: Michael Hartung, Anika Groß, Erhard Rahm

    Abstract: Ontologies such as taxonomies, product catalogs or web directories are heavily used and hence evolve frequently to meet new requirements or to better reflect the current instance data of a domain. To effectively manage the evolution of ontologies it is essential to identify the difference (Diff) between two ontology versions. We propose a novel approach to determine an expressive and invertible di… ▽ More

    Submitted 1 October, 2010; originally announced October 2010.

    Comments: 12 pages

  24. arXiv:1006.5309  [pdf

    cs.DC

    Data Partitioning for Parallel Entity Matching

    Authors: Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm

    Abstract: Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, bloc… ▽ More

    Submitted 28 June, 2010; originally announced June 2010.

    Comments: 11 pages

    ACM Class: H.3.4

  25. arXiv:1003.4418  [pdf

    cs.DB cs.IR

    Evaluation of Query Generators for Entity Search Engines

    Authors: Stefan Endrullis, Andreas Thor, Erhard Rahm

    Abstract: Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For… ▽ More

    Submitted 23 March, 2010; originally announced March 2010.

    ACM Class: H.3.3; H.3.4