-
A Distributed Approach for Persistent Homology Computation on a Large Scale
Authors:
Riccardo Ceccaroni,
Lorenzo Di Rocco,
Umberto Ferraro Petrillo,
Pierpaolo Brutti
Abstract:
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount…
▽ More
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount of data. In this paper we present PixHomology, a novel algorithm for efficiently computing $0$-dimensional PH on 2D images, optimizing memory and processing time. By leveraging the Apache Spark framework, we also present a distributed version of our algorithm with several optimized variants, able to concurrently process large batches of astronomical images. Finally, we present the results of an experimental analysis showing that our algorithm and its distributed version are efficient in terms of required memory, execution time, and scalability, consistently outperforming existing state-of-the-art PH computation tools when used to process large datasets.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
DomainNet: Homograph Detection for Data Lake Disambiguation
Authors:
Aristotelis Leventidis,
Laura Di Rocco,
Wolfgang Gatterbauer,
Renée J. Miller,
Mirek Riedewald
Abstract:
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we sh…
▽ More
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.
△ Less
Submitted 22 March, 2021; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Impact of Semantic Granularity on Geographic Information Search Support
Authors:
Noemi Mauro,
Liliana Ardissono,
Laura Di Rocco,
Michela Bertolotto,
Giovanna Guerrini
Abstract:
The Information Retrieval research has used semantics to provide accurate search results, but the analysis of conceptual abstraction has mainly focused on information integration. We consider session-based query expansion in Geographical Information Retrieval, and investigate the impact of semantic granularity (i.e., specificity of concepts representation) on the suggestion of relevant types of in…
▽ More
The Information Retrieval research has used semantics to provide accurate search results, but the analysis of conceptual abstraction has mainly focused on information integration. We consider session-based query expansion in Geographical Information Retrieval, and investigate the impact of semantic granularity (i.e., specificity of concepts representation) on the suggestion of relevant types of information to search for. We study how different levels of detail in knowledge representation influence the capability of guiding the user in the exploration of a complex information space. A comparative analysis of the performance of a query expansion model, using three spatial ontologies defined at different semantic granularity levels, reveals that a fine-grained representation enhances recall. However, precision depends on how closely the ontologies match the way people conceptualize and verbally describe the geographic space.
△ Less
Submitted 1 April, 2020;
originally announced April 2020.