-
ROIBIN-SZ: Fast and Science-Preserving Compression for Serial Crystallography
Authors:
Robert Underwood,
Chun Yoon,
Ali Gok,
Sheng Di,
Franck Cappello
Abstract:
Crystallography is the leading technique to study atomic structures of proteins and produces enormous volumes of information that can place strains on the storage and data transfer capabilities of synchrotron and free-electron laser light sources. Lossy compression has been identified as a possible means to cope with the growing data volumes; however, prior approaches have not produced sufficient…
▽ More
Crystallography is the leading technique to study atomic structures of proteins and produces enormous volumes of information that can place strains on the storage and data transfer capabilities of synchrotron and free-electron laser light sources. Lossy compression has been identified as a possible means to cope with the growing data volumes; however, prior approaches have not produced sufficient quality at a sufficient rate to meet scientific needs. This paper presents Region Of Interest BINning with SZ lossy compression (ROIBIN-SZ) a novel, parallel, and accelerated compression scheme that separates the dynamically selected preservation of key regions with lossy compression of background information. We perform and present an extensive evaluation of the performance and quality results made by the co-design of this compression scheme. We can achieve up to a 196x and 46.44x compression ratio on lysozyme and selenobiotinyl-streptavidin while preserving the data sufficiently to reconstruct the structure at bandwidths and scales that approach the needs of the upcoming light sources
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
SZ3: A Modular Framework for Composing Prediction-Based Error-Bounded Lossy Compressors
Authors:
Xin Liang,
Kai Zhao,
Sheng Di,
Sihuan Li,
Robert Underwood,
Ali M. Gok,
Jiannan Tian,
Jun**g Deng,
Jon C. Calhoun,
Dingwen Tao,
Zizhong Chen,
Franck Cappello
Abstract:
Today's scientific simulations require a significant reduction of data volume because of extremely large amounts of data they produce and the limited I/O bandwidth and storage space. Error-bounded lossy compressor has been considered one of the most effective solutions to the above problem. In practice, however, the best-fit compression method often needs to be customized/optimized in particular b…
▽ More
Today's scientific simulations require a significant reduction of data volume because of extremely large amounts of data they produce and the limited I/O bandwidth and storage space. Error-bounded lossy compressor has been considered one of the most effective solutions to the above problem. In practice, however, the best-fit compression method often needs to be customized/optimized in particular because of diverse characteristics in different datasets and various user requirements on the compression quality and performance. In this paper, we develop a novel modular, composable compression framework (namely SZ3), which involves three significant contributions. (1) SZ3 features a modular abstraction for the prediction-based compression framework such that the new compression modules can be plugged in easily. (2) SZ3 supports multialgorithm predictors and can automatically select the best-fit predictor for each data block based on the designed error estimation criterion. (3) SZ3 allows users to easily compose different compression pipelines on demand, such that both compression quality and performance can be significantly improved for their specific datasets and requirements. (4) In addition, we evaluate several lossy compressors composed from SZ3 using the real-world datasets. Specifically, we leverage SZ3 to improve the compression quality and performance for different use-cases, including GAMESS quantum chemistry dataset and Advanced Photon Source (APS) instrument dataset. Experiments show that our customized compression pipelines lead to up to 20% improvement in compression ratios under the same data distortion compared with the state-of-the-art approaches.
△ Less
Submitted 11 November, 2021; v1 submitted 4 November, 2021;
originally announced November 2021.
-
End-to-End Open Vocabulary Keyword Search
Authors:
Bolaji Yusuf,
Alican Gok,
Batuhan Gundogdu,
Murat Saraclar
Abstract:
Recently, neural approaches to spoken content retrieval have become popular. However, they tend to be restricted in their vocabulary or in their ability to deal with imbalanced test settings. These restrictions limit their applicability in keyword search, where the set of queries is not known beforehand, and where the system should return not just whether an utterance contains a query but the exac…
▽ More
Recently, neural approaches to spoken content retrieval have become popular. However, they tend to be restricted in their vocabulary or in their ability to deal with imbalanced test settings. These restrictions limit their applicability in keyword search, where the set of queries is not known beforehand, and where the system should return not just whether an utterance contains a query but the exact location of any such occurrences. In this work, we propose a model directly optimized for keyword search. The model takes a query and an utterance as input and returns a sequence of probabilities for each frame of the utterance of the query having occurred in that frame. Experiments show that the proposed model not only outperforms similar end-to-end models on a task where the ratio of positive and negative trials is artificially balanced, but it is also able to deal with the far more challenging task of keyword search with its inherent imbalance. Furthermore, using our system to rescore the outputs an LVCSR-based keyword search system leads to significant improvements on the latter.
△ Less
Submitted 23 August, 2021;
originally announced August 2021.
-
From web crawled text to project descriptions: automatic summarizing of social innovation projects
Authors:
Nikola Milosevic,
Dimitar Marinov,
Abdullah Gok,
Goran Nenadic
Abstract:
In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and su…
▽ More
In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and summarized. In this paper, we propose and compare several methods (e.g. SVM-based, recurrent neural network based, ensambled) for describing projects based on the text that is available on project websites. We also address and propose a new metric for automated evaluation of summaries based on topic modelling.
△ Less
Submitted 22 May, 2019;
originally announced May 2019.