-
L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models
Authors:
Harsh Chaudhari,
Anuja Patil,
Dhanashree Lavekar,
Pranav Khairnar,
Raviraj Joshi
Abstract:
This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, includin…
▽ More
This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP
△ Less
Submitted 30 December, 2023;
originally announced January 2024.
-
On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi
Authors:
Harsh Chaudhari,
Anuja Patil,
Dhanashree Lavekar,
Pranav Khairnar,
Raviraj Joshi,
Sachin Pande
Abstract:
Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low reso…
▽ More
Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Explanation and Use of Uncertainty Quantified by Bayesian Neural Network Classifiers for Breast Histopathology Images
Authors:
Ponkrshnan Thiagarajan,
Pushkar Khairnar,
Susanta Ghosh
Abstract:
Despite the promise of Convolutional neural network (CNN) based classification models for histopathological images, it is infeasible to quantify its uncertainties. Moreover, CNNs may suffer from overfitting when the data is biased. We show that Bayesian-CNN can overcome these limitations by regularizing automatically and by quantifying the uncertainty. We have developed a novel technique to utiliz…
▽ More
Despite the promise of Convolutional neural network (CNN) based classification models for histopathological images, it is infeasible to quantify its uncertainties. Moreover, CNNs may suffer from overfitting when the data is biased. We show that Bayesian-CNN can overcome these limitations by regularizing automatically and by quantifying the uncertainty. We have developed a novel technique to utilize the uncertainties provided by the Bayesian-CNN that significantly improves the performance on a large fraction of the test data (about 6% improvement in accuracy on 77% of test data). Further, we provide a novel explanation for the uncertainty by projecting the data into a low dimensional space through a nonlinear dimensionality reduction technique. This dimensionality reduction enables interpretation of the test data through visualization and reveals the structure of the data in a low dimensional feature space. We show that the Bayesian-CNN can perform much better than the state-of-the-art transfer learning CNN (TL-CNN) by reducing the false negative and false positive by 11% and 7.7% respectively for the present data set. It achieves this performance with only 1.86 million parameters as compared to 134.33 million for TL-CNN. Besides, we modify the Bayesian-CNN by introducing a stochastic adaptive activation function. The modified Bayesian-CNN performs slightly better than Bayesian-CNN on all performance metrics and significantly reduces the number of false negatives and false positives (3% reduction for both). We also show that these results are statistically significant by performing McNemar's statistical significance test. This work shows the advantages of Bayesian-CNN against the state-of-the-art, explains and utilizes the uncertainties for histopathological images. It should find applications in various medical image classifications.
△ Less
Submitted 5 November, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Image Resolution and Contrast Enhancement of Satellite Geographical Images with Removal of Noise using Wavelet Transforms
Authors:
Prajakta P. Khairnar,
C. A. Manjare
Abstract:
In this paper the technique for resolution and contrast enhancement of satellite geographical images based on discrete wavelet transform (DWT), stationary wavelet transform (SWT) and singular value decomposition (SVD) has been proposed. In this, the noise is added in the input low resolution and low contrast image. The median filter is used remove noise from the input image. This low resolution, l…
▽ More
In this paper the technique for resolution and contrast enhancement of satellite geographical images based on discrete wavelet transform (DWT), stationary wavelet transform (SWT) and singular value decomposition (SVD) has been proposed. In this, the noise is added in the input low resolution and low contrast image. The median filter is used remove noise from the input image. This low resolution, low contrast image without noise is decomposed into four sub-bands by using DWT and SWT. The resolution enhancement technique is based on the interpolation of high frequency components obtained by DWT and input image. SWT is used to enhance input image. DWT is used to decompose an image into four frequency sub bands and these four sub-bands are interpolated using bicubic interpolation technique. All these sub-bands are reconstructed as high resolution image by using inverse DWT (IDWT). To increase the contrast the proposed technique uses DWT and SVD. GHE is used to equalize an image. The equalized image is decomposed into four sub-bands using DWT and new LL sub-band is reconstructed using SVD. All sub-bands are reconstructed using IDWT to generate high resolution and contrast image over conventional techniques. The experimental result shows superiority of the proposed technique over conventional techniques.
Key words: Discrete wavelet transform (DWT), General histogram equalization (GHE), Median filter, Singular value decomposition (SVD), Stationary wavelet transform (SWT).
△ Less
Submitted 8 May, 2014;
originally announced May 2014.