Search | arXiv e-print repository

doi 10.1109/JCDL52503.2021.00039

Automatic Recognition of Learning Resource Category in a Digital Library

Authors: Soumya Banerjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, Partha Pratim Das

Abstract: Digital libraries often face the challenge of processing a large volume of diverse document types. The manual collection and tagging of metadata can be a time-consuming and error-prone task. To address this, we aim to develop an automatic metadata extractor for digital libraries. In this work, we introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classificatio… ▽ More Digital libraries often face the challenge of processing a large volume of diverse document types. The manual collection and tagging of metadata can be a time-consuming and error-prone task. To address this, we aim to develop an automatic metadata extractor for digital libraries. In this work, we introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification. The approach involves decomposing individual learning resources into constituent document images (sheets). These images are then processed through an OCR tool to extract textual representation. State-of-the-art classifiers are employed to classify both the document image and its textual content. Subsequently, the labels of the constituent document images are utilized to predict the label of the overall document. △ Less

Submitted 28 November, 2023; originally announced January 2024.

Comments: 2 pages, 3 figures, Published in JCDL 21

arXiv:2302.07729 [pdf, other]

doi 10.1109/ACCESS.2023.3292300

Generation of Highlights from Research Papers Using Pointer-Generator Networks and SciBERT Embeddings

Authors: Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, Partha Pratim Das

Abstract: Nowadays many research articles are prefaced with research highlights to summarize the main findings of the paper. Highlights not only help researchers precisely and quickly identify the contributions of a paper, they also enhance the discoverability of the article via search engines. We aim to automatically construct research highlights given certain segments of a research paper. We use a pointer… ▽ More Nowadays many research articles are prefaced with research highlights to summarize the main findings of the paper. Highlights not only help researchers precisely and quickly identify the contributions of a paper, they also enhance the discoverability of the article via search engines. We aim to automatically construct research highlights given certain segments of a research paper. We use a pointer-generator network with coverage mechanism and a contextual embedding layer at the input that encodes the input tokens into SciBERT embeddings. We test our model on a benchmark dataset, CSPubSum, and also present MixSub, a new multi-disciplinary corpus of papers for automatic research highlight generation. For both CSPubSum and MixSub, we have observed that the proposed model achieves the best performance compared to related variants and other models proposed in the literature. On the CSPubSum dataset, our model achieves the best performance when the input is only the abstract of a paper as opposed to other segments of the paper. It produces ROUGE-1, ROUGE-2 and ROUGE-L F1-scores of 38.26, 14.26 and 35.51, respectively, METEOR score of 32.62, and BERTScore F1 of 86.65 which outperform all other baselines. On the new MixSub dataset, where only the abstract is the input, our proposed model (when trained on the whole training corpus without distinguishing between the subject categories) achieves ROUGE-1, ROUGE-2 and ROUGE-L F1-scores of 31.78, 9.76 and 29.3, respectively, METEOR score of 24.00, and BERTScore F1 of 85.25. △ Less

Submitted 17 September, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: 19 Pages, 7 Figures, 8 Tables

Journal ref: IEEE Access, 2023

arXiv:2005.05414 [pdf, other]

doi 10.1145/3383583.3398598

Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data

Authors: Soumya Banerjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, Parthapratim Das

Abstract: The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-l… ▽ More The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse. △ Less

Submitted 27 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: to appear in the proceedings of JCDL'2020

ACM Class: I.5.1; H.3.7

Showing 1–3 of 3 results for author: Bhowmick, P K