Search | arXiv e-print repository

Plot2Spectra: an Automatic Spectra Extraction Tool

Authors: Weixin Jiang, Eric Schwenker, Trevor Spreadbury, Kai Li, Maria K. Y. Chan, Oliver Cossairt

Abstract: Different types of spectroscopies, such as X-ray absorption near edge structure (XANES) and Raman spectroscopy, play a very important role in analyzing the characteristics of different materials. In scientific literature, XANES/Raman data are usually plotted in line graphs which is a visually appropriate way to represent the information when the end-user is a human reader. However, such graphs are… ▽ More Different types of spectroscopies, such as X-ray absorption near edge structure (XANES) and Raman spectroscopy, play a very important role in analyzing the characteristics of different materials. In scientific literature, XANES/Raman data are usually plotted in line graphs which is a visually appropriate way to represent the information when the end-user is a human reader. However, such graphs are not conducive to direct programmatic analysis due to the lack of automatic tools. In this paper, we develop a plot digitizer, named Plot2Spectra, to extract data points from spectroscopy graph images in an automatic fashion, which makes it possible for large scale data acquisition and analysis. Specifically, the plot digitizer is a two-stage framework. In the first axis alignment stage, we adopt an anchor-free detector to detect the plot region and then refine the detected bounding boxes with an edge-based constraint to locate the position of two axes. We also apply scene text detector to extract and interpret all tick information below the x-axis. In the second plot data extraction stage, we first employ semantic segmentation to separate pixels belonging to plot lines from the background, and from there, incorporate optical flow constraints to the plot line pixels to assign them to the appropriate line (data instance) they encode. Extensive experiments are conducted to validate the effectiveness of the proposed plot digitizer, which shows that such a tool could help accelerate the discovery and machine learning of materials properties. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:2105.10532 [pdf]

Ingrained -- An automated framework for fusing atomic-scale image simulations into experiments

Authors: Eric Schwenker, V. S. Chaitanya Kolluru, **glong Guo, Xiaobing Hu, Qiucheng Li, Mark C. Hersam, Vinayak P. Dravid, Robert F. Klie, Jeffrey R. Guest, Maria K. Y. Chan

Abstract: To fully leverage the power of image simulation to corroborate and explain patterns and structures in atomic resolution microscopy (e.g., electron and scanning probe), an initial correspondence between the simulation and experimental image must be established at the outset of further high accuracy simulations or calculations. Furthermore, if simulation is to be used in context of highly automated… ▽ More To fully leverage the power of image simulation to corroborate and explain patterns and structures in atomic resolution microscopy (e.g., electron and scanning probe), an initial correspondence between the simulation and experimental image must be established at the outset of further high accuracy simulations or calculations. Furthermore, if simulation is to be used in context of highly automated processes or high-throughput optimization, the process of finding this correspondence itself must be automated. In this work, we introduce ingrained, an open-source automation framework which solves for this correspondence and fuses atomic resolution image simulations into the experimental images to which they correspond. We describe herein the overall ingrained workflow, focusing on its application to interface structure approximations, and the development of an experimentally rationalized forward model for scanning tunneling microscopy simulation. △ Less

Submitted 21 May, 2021; originally announced May 2021.

arXiv:2103.10631 [pdf]

EXSCLAIM! -- An automated pipeline for the construction of labeled materials imaging datasets from literature

Authors: Eric Schwenker, Weixin Jiang, Trevor Spreadbury, Nicola Ferrier, Oliver Cossairt, Maria K. Y. Chan

Abstract: Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for traditional data ingestion scenarios where a select number of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data shar… ▽ More Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for traditional data ingestion scenarios where a select number of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are presented as components of a larger figure with their explicit context buried in the main body or caption text, so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. To solve the problem of curating labeled microscopy data from literature, this work introduces the EXSCLAIM! Python toolkit for the automatic EXtraction, Separation, and Caption-based natural Language Annotation of IMages from scientific literature. We highlight the methodology behind the construction of EXSCLAIM! and demonstrate its ability to extract and label open-source scientific images at high volume. △ Less

Submitted 19 March, 2021; originally announced March 2021.

arXiv:2101.09903 [pdf, other]

doi 10.1109/ICIP42928.2021.9506171

A Two-stage Framework for Compound Figure Separation

Authors: Weixin Jiang, Eric Schwenker, Trevor Spreadbury, Nicola Ferrier, Maria K. Y. Chan, Oliver Cossairt

Abstract: Scientific literature contains large volumes of complex, unstructured figures that are compound in nature (i.e. composed of multiple images, graphs, and drawings). Separation of these compound figures is critical for information retrieval from these figures. In this paper, we propose a new strategy for compound figure separation, which decomposes the compound figures into constituent subfigures wh… ▽ More Scientific literature contains large volumes of complex, unstructured figures that are compound in nature (i.e. composed of multiple images, graphs, and drawings). Separation of these compound figures is critical for information retrieval from these figures. In this paper, we propose a new strategy for compound figure separation, which decomposes the compound figures into constituent subfigures while preserving the association between the subfigures and their respective caption components. We propose a two-stage framework to address the proposed compound figure separation problem. In particular, the subfigure label detection module detects all subfigure labels in the first stage. Then, in the subfigure detection module, the detected subfigure labels help to detect the subfigures by optimizing the feature selection process and providing the global layout information as extra features. Extensive experiments are conducted to validate the effectiveness and superiority of the proposed framework, which improves the detection precision by 9%. △ Less

Submitted 7 October, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

Journal ref: IEEE International Conference on Image Processing (ICIP), 2021, pp. 1204-1208

arXiv:1912.07142 [pdf, other]

Semantic Segmentation for Compound figures

Authors: Weixin Jiang, Eric Schwenker, Maria Chan, Oliver Cossairt

Abstract: Scientific literature contains large volumes of unstructured data,with over 30\% of figures constructed as a combination of multiple images, these compound figures cannot be analyzed directly with existing information retrieval tools. In this paper, we propose a semantic segmentation approach for compound figure separation, decomposing the compound figures into "master images". Each master image i… ▽ More Scientific literature contains large volumes of unstructured data,with over 30\% of figures constructed as a combination of multiple images, these compound figures cannot be analyzed directly with existing information retrieval tools. In this paper, we propose a semantic segmentation approach for compound figure separation, decomposing the compound figures into "master images". Each master image is one part of a compound figure governed by a subfigure label (typically "(a), (b), (c), etc"). In this way, the separated subfigures can be easily associated with the description information in the caption. In particular, we propose an anchor-based master image detection algorithm, which leverages the correlation between master images and subfigure labels and locates the master images in a two-step manner. First, a subfigure label detector is built to extract the global layout information of the compound figure. Second, the layout information is combined with local features to locate the master images. We validate the effectiveness of proposed method on our labeled testing dataset both quantitatively and qualitatively. △ Less

Submitted 4 February, 2021; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: This paper has been withdrawn by the authors. This paper has been superseded by arXiv:2101.09903

Showing 1–5 of 5 results for author: Schwenker, E