-
Speech Representation Analysis based on Inter- and Intra-Model Similarities
Authors:
Yassine El Kheir,
Ahmed Ali,
Shammur Absar Chowdhury
Abstract:
Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation an…
▽ More
Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm -- Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts.Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts\footnote{A concept represents a coherent fragment of knowledge, such as ``a class containing certain objects as elements, where the objects have certain properties. We made the code publicly available for facilitating further research, we publicly released our code.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Children's Speech Recognition through Discrete Token Enhancement
Authors:
Vrunda N. Sukhadia,
Shammur Absar Chowdhury
Abstract:
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information…
▽ More
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
△ Less
Submitted 24 June, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Classification of Non-native Handwritten Characters Using Convolutional Neural Network
Authors:
F. A. Mamun,
S. A. H. Chowdhury,
J. E. Giti,
H. Sarker
Abstract:
The use of convolutional neural networks (CNNs) has accelerated the progress of handwritten character classification/recognition. Handwritten character recognition (HCR) has found applications in various domains, such as traffic signal detection, language translation, and document information extraction. However, the widespread use of existing HCR technology is yet to be seen as it does not provid…
▽ More
The use of convolutional neural networks (CNNs) has accelerated the progress of handwritten character classification/recognition. Handwritten character recognition (HCR) has found applications in various domains, such as traffic signal detection, language translation, and document information extraction. However, the widespread use of existing HCR technology is yet to be seen as it does not provide reliable character recognition with outstanding accuracy. One of the reasons for unreliable HCR is that existing HCR methods do not take the handwriting styles of non-native writers into account. Hence, further improvement is needed to ensure the reliability and extensive deployment of character recognition technologies for critical tasks. In this work, the classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model. We train this CNN with a new dataset called the handwritten isolated English character (HIEC) dataset. This dataset consists of 16,496 images collected from 260 persons. This paper also includes an ablation study of our CNN by adjusting hyperparameters to identify the best model for the HIEC dataset. The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy and achieves an accuracy of $\mathbf{97.04}$%. Compared with the second-best model, the relative improvement of our model in terms of classification accuracy is $\mathbf{4.38}$%.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Unveiling the Potential of Big Data Analytics for Transforming Higher Education in Bangladesh; Needs, Prospects, and Challenges
Authors:
Sabbir Ahmed Chowdhury,
Md Aminul Islam,
Mostafa Azad Kamal
Abstract:
Big Data Analytics has gained tremendous momentum in many sectors worldwide. Big Data has substantial influence in the field of Learning Analytics that may allow academic institutions to better understand the learners needs and proactively address them. Hence, it is essential to understand Big Data and its application. With the capability of Big Data to find a broad understanding of the scientific…
▽ More
Big Data Analytics has gained tremendous momentum in many sectors worldwide. Big Data has substantial influence in the field of Learning Analytics that may allow academic institutions to better understand the learners needs and proactively address them. Hence, it is essential to understand Big Data and its application. With the capability of Big Data to find a broad understanding of the scientific decision making process, Big Data Analytics (BDA) can be a piece of the answer to accomplishing Bangladesh Higher Education (BHE) objectives. This paper reviews the capacity of BDA, considers possible applications in BHE, gives an insight into how to improve the quality of education or uncover additional values from the data generated by educational institutions, and lastly, identifies needs and difficulties, opportunities, and some frameworks to probable implications about the BDA in BHE sector.
Keywords; Big Data Analytics, Learning Analytics, Quality of Education, Challenges, Higher Education, Bangladesh
△ Less
Submitted 24 November, 2023; v1 submitted 10 October, 2023;
originally announced November 2023.
-
Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
Authors:
Rabindra Nath Nandi,
Mehadi Hasan Menon,
Tareq Al Muntasir,
Sagor Sarker,
Quazi Sarwar Muhtaseem,
Md. Tariqul Islam,
Shammur Absar Chowdhury,
Firoj Alam
Abstract:
One of the major challenges for develo** automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speak…
▽ More
One of the major challenges for develo** automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Automatic Pronunciation Assessment -- A Review
Authors:
Yassine El Kheir,
Ahmed Ali,
Shammur Absar Chowdhury
Abstract:
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challeng…
▽ More
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verba…
▽ More
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verbal cues, alongside the conventional speech and phoneme representations. Additionally, we introduce ''Goodness of phonemic-duration'' metric to effectively model duration distribution within the framework. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that either matches or outperforms existing research works.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
L1-aware Multilingual Mispronunciation Detection Framework
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechani…
▽ More
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
△ Less
Submitted 21 September, 2023; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Bornil: An open-source sign language data crowdsourcing platform for AI enabled dialect-agnostic communication
Authors:
Shahriar Elahi Dhruvo,
Mohammad Akhlaqur Rahman,
Manash Kumar Mandal,
Md. Istiak Hossain Shihab,
A. A. Noman Ansary,
Kaneez Fatema Shithi,
Sanjida Khanom,
Rabeya Akter,
Safaeid Hossain Arib,
M. N. Ansary,
Sazia Mehnaz,
Rezwana Sultana,
Sejuti Rahman,
Sayma Sultana Chowdhury,
Sabbir Ahmed Chowdhury,
Farig Sadeque,
Asif Sushmit
Abstract:
The absence of annotated sign language datasets has hindered the development of sign language recognition and translation technologies. In this paper, we introduce Bornil; a crowdsource-friendly, multilingual sign language data collection, annotation, and validation platform. Bornil allows users to record sign language gestures and lets annotators perform sentence and gloss-level annotation. It al…
▽ More
The absence of annotated sign language datasets has hindered the development of sign language recognition and translation technologies. In this paper, we introduce Bornil; a crowdsource-friendly, multilingual sign language data collection, annotation, and validation platform. Bornil allows users to record sign language gestures and lets annotators perform sentence and gloss-level annotation. It also allows validators to make sure of the quality of both the recorded videos and the annotations through manual validation to develop high-quality datasets for deep learning-based Automatic Sign Language Recognition. To demonstrate the system's efficacy; we collected the largest sign language dataset for Bangladeshi Sign Language dialect, perform deep learning based Sign Language Recognition modeling, and report the benchmark performance. The Bornil platform, BornilDB v1.0 Dataset, and the codebases are available on https://bornil.bengali.ai
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
Authors:
Fahim Dalvi,
Maram Hasanain,
Sabri Boughorbel,
Basel Mousi,
Samir Abdaljalil,
Nizi Nazar,
Ahmed Abdelali,
Shammur Absar Chowdhury,
Hamdy Mubarak,
Ahmed Ali,
Majd Hawasly,
Nadir Durrani,
Firoj Alam
Abstract:
The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, whi…
▽ More
The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online. (https://youtu.be/9cC2m_abk3A)
△ Less
Submitted 26 February, 2024; v1 submitted 9 August, 2023;
originally announced August 2023.
-
MyVoice: Arabic Speech Resource Collaboration Platform
Authors:
Yousseif Elshahawy,
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and…
▽ More
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.
△ Less
Submitted 23 July, 2023;
originally announced August 2023.
-
Replicability Study: Corpora For Understanding Simulink Models & Projects
Authors:
Sohil Lal Shrestha,
Shafiul Azam Chowdhury,
Christoph Csallner
Abstract:
Background: Empirical studies on widely used model-based development tools such as MATLAB/Simulink are limited despite the tools' importance in various industries.
Aims: The aim of this paper is to investigate the reproducibility of previous empirical studies that used Simulink model corpora and to evaluate the generalizability of their results to a newer and larger corpus, including a compariso…
▽ More
Background: Empirical studies on widely used model-based development tools such as MATLAB/Simulink are limited despite the tools' importance in various industries.
Aims: The aim of this paper is to investigate the reproducibility of previous empirical studies that used Simulink model corpora and to evaluate the generalizability of their results to a newer and larger corpus, including a comparison with proprietary models.
Method: The study reviews methodologies and data sources employed in prior Simulink model studies and replicates the previous analysis using SLNET. In addition, we propose a heuristic for determining code-generating Simulink models and assess the open-source models' similarity to proprietary models.
Results: Our analysis of SLNET confirms and contradicts earlier findings and highlights its potential as a valuable resource for model-based development research. We found that open-source Simulink models follow good modeling practices and contain models comparable in size and properties to proprietary models. We also collected and distribute 208 git repositories with over 9k commits, facilitating studies on model evolution.
Conclusions: The replication study offers actionable insights and lessons learned from the reproduction process, including valuable information on the generalizability of research findings based on earlier open-source corpora to the newer and larger SLNET corpus. The study sheds light on noteworthy attributes of SLNET, which is self-contained and redistributable.
△ Less
Submitted 9 August, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Multi-View Multi-Task Representation Learning for Mispronunciation Detection
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phoneti…
▽ More
The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.
△ Less
Submitted 7 August, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
LAraBench: Benchmarking Arabic AI with Large Language Models
Authors:
Ahmed Abdelali,
Hamdy Mubarak,
Shammur Absar Chowdhury,
Maram Hasanain,
Basel Mousi,
Sabri Boughorbel,
Yassine El Kheir,
Daniel Izham,
Fahim Dalvi,
Majd Hawasly,
Nizi Nazar,
Yousseif Elshahawy,
Ahmed Ali,
Nadir Durrani,
Natasa Milic-Frayling,
Firoj Alam
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tag…
▽ More
Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ~296K data points, ~46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks.
△ Less
Submitted 5 February, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Automated Grain Boundary (GB) Segmentation and Microstructural Analysis in 347H Stainless Steel Using Deep Learning and Multimodal Microscopy
Authors:
Shoieb Ahmed Chowdhury,
M. F. N. Taufique,
**g Wang,
Marissa Masden,
Madison Wenzlick,
Ram Devanathan,
Alan L Schemer-Kohrn,
Keerti S Kappagantula
Abstract:
Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-…
▽ More
Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-properties loop. Applying convolutional neural network (CNN) based deep-learning models is a powerful technique to detect features from material micrographs in an automated manner. Manual labeling of the images for the segmentation task poses a major bottleneck for generating training data and labels in a reliable and reproducible way within a reasonable timeframe. In this study, we attempt to overcome such limitations by utilizing multi-modal microscopy to generate labels directly instead of manual labeling. We combine scanning electron microscopy (SEM) images of 347H stainless steel as training data and electron backscatter diffraction (EBSD) micrographs as pixel-wise labels for grain boundary detection as a semantic segmentation task. We demonstrate that despite producing instrumentation drift during data collection between two modes of microscopy, this method performs comparably to similar segmentation tasks that used manual labeling. Additionally, we find that naïve pixel-wise segmentation results in small gaps and missing boundaries in the predicted grain boundary map. By incorporating topological information during model training, the connectivity of the grain boundary network and segmentation performance is improved. Finally, our approach is validated by accurate computation on downstream tasks of predicting the underlying grain morphology distributions which are the ultimate quantities of interest for microstructural characterization.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
QVoice: Arabic Speech Pronunciation Learning Application
Authors:
Yassine El Kheir,
Fouad Khnaisser,
Shammur Absar Chowdhury,
Hamdy Mubarak,
Shazia Afzal,
Ahmed Ali
Abstract:
This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also hel** native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pr…
▽ More
This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also hel** native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pronunciation. QVoice employs various learning cues to aid learners in comprehending meaning, drawing connections with their existing knowledge of English language, and offers detailed feedback for pronunciation correction, along with contextual examples showcasing word usage. The learning cues featured in QVoice encompass a wide range of meaningful information, such as visualizations of phrases/words and their translations, as well as phonetic transcriptions and transliterations. QVoice provides pronunciation feedback at the character level and assesses performance at the word level.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Multilingual Word Error Rate Estimation: e-WER3
Authors:
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical re…
▽ More
The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical representation to estimate word error rate. We demonstrate the effectiveness of eWER3 to (i) predict WER without using any internal states from the ASR and (ii) use the multilingual shared latent space to push the performance of the close-related languages. We show our proposed multilingual model outperforms the previous monolingual word error rate estimation method (eWER2) by an absolute 9\% increase in Pearson correlation coefficient (PCC), with better overall estimation between the predicted and reference WER.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali,
Hamdy Mubarak,
Shazia Afzal
Abstract:
The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly inte…
▽ More
The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.
△ Less
Submitted 12 July, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
What can Speech and Language Tell us About the Working Alliance in Psychotherapy
Authors:
Sebastian P. Bayerl,
Gabriel Roccabruna,
Shammur Absar Chowdhury,
Tommaso Ciulli,
Morena Danieli,
Korbinian Riedhammer,
Giuseppe Riccardi
Abstract:
We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 item…
▽ More
We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 items inventory covering task, goal, and relationship - which has a relevant influence on therapeutic outcomes. In this work, we investigate the relation between this alliance inventory and the spoken conversations (sessions) between the patient and the psychotherapist. We have delivered eight weeks of e-therapy, collected their audio and video call sessions, and manually transcribed them. The spoken conversations have been annotated and evaluated with WAI ratings by professional therapists. We have investigated speech and language features and their association with WAI items. The feature types include turn dynamics, lexical entrainment, and conversational descriptors extracted from the speech and language signals. Our findings provide strong evidence that a subset of these features are strong indicators of working alliance. To the best of our knowledge, this is the first and a novel study to exploit speech and language for characterising working alliance.
△ Less
Submitted 27 June, 2022; v1 submitted 17 June, 2022;
originally announced June 2022.
-
An Empirical Study on Maintainable Method Size in Java
Authors:
Shaiful Alam Chowdhury,
Gias Uddin,
Reid Holmes
Abstract:
Code metrics have been widely used to estimate software maintenance effort. Metrics have generally been used to guide developer effort to reduce or avoid future maintenance burdens. Size is the simplest and most widely deployed metric. The size metric is pervasive because size correlates with many other common metrics (e.g., McCabe complexity, readability, etc.). Given the ease of computing a meth…
▽ More
Code metrics have been widely used to estimate software maintenance effort. Metrics have generally been used to guide developer effort to reduce or avoid future maintenance burdens. Size is the simplest and most widely deployed metric. The size metric is pervasive because size correlates with many other common metrics (e.g., McCabe complexity, readability, etc.). Given the ease of computing a method's size, and the ubiquity of these metrics in industrial settings, it is surprising that no systematic study has been performed to provide developers with meaningful method size guidelines with respect to future maintenance effort. In this paper we examine the evolution of around 785K Java methods and show that developers should strive to keep their Java methods under 24 lines in length. Additionally, we show that decomposing larger methods to smaller methods also decreases overall maintenance efforts. Taken together, these findings provide empirical guidelines to help developers design their systems in a way that can reduce future maintenance.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
SLNET: A Redistributable Corpus of 3rd-party Simulink Models
Authors:
Sohil Lal Shrestha,
Shafiul Azam Chowdhury,
Christoph Csallner
Abstract:
MATLAB/Simulink is widely used for model-based design. Engineers create Simulink models and compile them to embedded code, often to control safety-critical cyber-physical systems in automotive, aerospace, and healthcare applications. Despite Simulink's importance, there are few large-scale empirical Simulink studies, perhaps because there is no large readily available corpus of third-party open-so…
▽ More
MATLAB/Simulink is widely used for model-based design. Engineers create Simulink models and compile them to embedded code, often to control safety-critical cyber-physical systems in automotive, aerospace, and healthcare applications. Despite Simulink's importance, there are few large-scale empirical Simulink studies, perhaps because there is no large readily available corpus of third-party open-source Simulink models. To enable empirical Simulink studies, this paper introduces SLNET, the largest corpus of freely available third-party Simulink models. SLNET has several advantages over earlier collections. Specifically, SLNET is 8 times larger than the largest previous corpus of Simulink models, includes fine-grained metadata, is constructed automatically, is self-contained, and allows redistribution. SLNET is available under permissive open-source licenses and contains all of its collection and analysis tools.
△ Less
Submitted 31 March, 2022;
originally announced March 2022.
-
ArabGend: Gender Analysis and Inference on Arabic Twitter
Authors:
Hamdy Mubarak,
Shammur Absar Chowdhury,
Firoj Alam
Abstract:
Gender analysis of Twitter can reveal important socio-cultural differences between male and female users. There has been a significant effort to analyze and automatically infer gender in the past for most widely spoken languages' content, however, to our knowledge very limited work has been done for Arabic. In this paper, we perform an extensive analysis of differences between male and female user…
▽ More
Gender analysis of Twitter can reveal important socio-cultural differences between male and female users. There has been a significant effort to analyze and automatically infer gender in the past for most widely spoken languages' content, however, to our knowledge very limited work has been done for Arabic. In this paper, we perform an extensive analysis of differences between male and female users on the Arabic Twitter-sphere. We study differences in user engagement, topics of interest, and the gender gap in professions. Along with gender analysis, we also propose a method to infer gender by utilizing usernames, profile pictures, tweets, and networks of friends. In order to do so, we manually annotated gender and locations for ~166K Twitter accounts associated with ~92K user location, which we plan to make publicly available at http://anonymous.com. Our proposed gender inference method achieve an F1 score of 82.1%, which is 47.3% higher than majority baseline. In addition, we also developed a demo and made it publicly available.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech
Authors:
Hamdy Mubarak,
Sabit Hassan,
Shammur Absar Chowdhury
Abstract:
We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets regardless of their topics or genres. We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets. We apply the proposed method on Arabic tweets and compare it with English tweets - analysing key cultural differences. We observed a const…
▽ More
We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets regardless of their topics or genres. We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets. We apply the proposed method on Arabic tweets and compare it with English tweets - analysing key cultural differences. We observed a constant usage of these emojis to represent offensiveness throughout different timespans on Twitter. We manually annotate and publicly release the largest Arabic dataset for offensive, fine-grained hate speech, vulgar and violence content. Furthermore, we benchmark the dataset for detecting offensiveness and hate speech using different transformer architectures and perform in-depth linguistic analysis. We evaluate our models on external datasets - a Twitter dataset collected using a completely different method, and a multi-platform dataset containing comments from Twitter, YouTube and Facebook, for assessing generalization capability. Competitive results on these datasets suggest that the data collected using our method captures universal characteristics of offensive language. Our findings also highlight the common words used in offensive communications, common targets for hate speech, specific patterns in violence tweets; and pinpoint common classification errors that can be attributed to limitations of NLP models. We observe that even state-of-the-art transformer models may fail to take into account culture, background and context or understand nuances present in real-world data such as sarcasm.
△ Less
Submitted 18 May, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
ArCovidVac: Analyzing Arabic Tweets About COVID-19 Vaccination
Authors:
Hamdy Mubarak,
Sabit Hassan,
Shammur Absar Chowdhury,
Firoj Alam
Abstract:
The emergence of the COVID-19 pandemic and the first global infodemic have changed our lives in many different ways. We relied on social media to get the latest information about the COVID-19 pandemic and at the same time to disseminate information. The content in social media consisted not only health related advises, plans, and informative news from policy makers, but also contains conspiracies…
▽ More
The emergence of the COVID-19 pandemic and the first global infodemic have changed our lives in many different ways. We relied on social media to get the latest information about the COVID-19 pandemic and at the same time to disseminate information. The content in social media consisted not only health related advises, plans, and informative news from policy makers, but also contains conspiracies and rumors. It became important to identify such information as soon as they are posted to make actionable decisions (e.g., debunking rumors, or taking certain measures for traveling). To address this challenge, we develop and publicly release the first largest manually annotated Arabic tweet dataset, ArCovidVac, for the COVID-19 vaccination campaign, covering many countries in the Arab region. The dataset is enriched with different layers of annotation, including, (i) Informativeness (more vs. less importance of the tweets); (ii) fine-grained tweet content types (e.g., advice, rumors, restriction, authenticate news/information); and (iii) stance towards vaccination (pro-vaccination, neutral, anti-vaccination). Further, we performed in-depth analysis of the data, exploring the popularity of different vaccines, trending hashtags, topics and presence of offensiveness in the tweets. We studied the data for individual types of tweets and temporal changes in stance towards vaccine. We benchmarked the ArCovidVac dataset using transformer architectures for informativeness, content types, and stance detection.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition
Authors:
Amir Hussein,
Shammur Absar Chowdhury,
Ahmed Abdelali,
Najim Dehak,
Ahmed Ali,
Sanjeev Khudanpur
Abstract:
The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with mono…
▽ More
The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with monolingual data. In this work, we propose a zero-shot learning methodology for CS-ASR by augmenting the monolingual data with artificially generating CS text. We based our approach on random lexical replacements and Equivalence Constraint (EC) while exploiting aligned translation pairs to generate random and grammatically valid CS content. Our empirical results show a 65.5% relative reduction in language model perplexity, and 7.7% in ASR WER on two ecologically valid CS test sets. The human evaluation of the generated text using EC suggests that more than 80% is of adequate quality.
△ Less
Submitted 11 January, 2023; v1 submitted 7 January, 2022;
originally announced January 2022.
-
A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models
Authors:
Firoj Alam,
Arid Hasan,
Tanvirul Alam,
Akib Khan,
Janntatul Tajrin,
Naira Khan,
Shammur Absar Chowdhury
Abstract:
Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that com…
▽ More
Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.
△ Less
Submitted 25 July, 2021; v1 submitted 8 July, 2021;
originally announced July 2021.
-
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis
Authors:
Shammur Absar Chowdhury,
Nadir Durrani,
Ahmed Ali
Abstract:
Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis o…
▽ More
Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework [1]. Specifically, we analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. We conduct layer and neuron-wise analyses, probing for speaker, language, and channel properties. Our study aims to answer the following questions: i) what information is captured within the representations? ii) how is it represented and distributed? and iii) can we identify a minimal subset of the network that possesses this information?
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, iii) complex properties such as dialectal information are encoded only in the task-oriented pretrained network, iv) and is localised in the upper layers, v) we can extract a minimal subset of neurons encoding the pre-defined property, vi) salient neurons are sometimes shared between properties, vii) our analysis highlights the presence of biases (for example gender) in the network. Our cross-architectural comparison indicates that: i) the pretrained models capture speaker-invariant information, and ii) CNN models are competitive with Transformer models in encoding various understudied properties.
△ Less
Submitted 10 July, 2023; v1 submitted 1 July, 2021;
originally announced July 2021.
-
QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus
Authors:
Hamdy Mubarak,
Amir Hussein,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, pun…
▽ More
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
Authors:
Shammur Absar Chowdhury,
Amir Hussein,
Ahmed Abdelali,
Ahmed Ali
Abstract:
With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (E…
▽ More
With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
△ Less
Submitted 5 July, 2021; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Sentiment Classification in Bangla Textual Content: A Comparative Study
Authors:
Md. Arid Hasan,
Jannatul Tajrin,
Shammur Absar Chowdhury,
Firoj Alam
Abstract:
Sentiment analysis has been widely used to understand our views on social and political agendas or user experiences over a product. It is one of the cores and well-researched areas in NLP. However, for low-resource languages, like Bangla, one of the prominent challenge is the lack of resources. Another important limitation, in the current literature for Bangla, is the absence of comparable results…
▽ More
Sentiment analysis has been widely used to understand our views on social and political agendas or user experiences over a product. It is one of the cores and well-researched areas in NLP. However, for low-resource languages, like Bangla, one of the prominent challenge is the lack of resources. Another important limitation, in the current literature for Bangla, is the absence of comparable results due to the lack of a well-defined train/test split. In this study, we explore several publicly available sentiment labeled datasets and designed classifiers using both classical and deep learning algorithms. In our study, the classical algorithms include SVM and Random Forest, and deep learning algorithms include CNN, FastText, and transformer-based models. We compare these models in terms of model performance and time-resource complexity. Our finding suggests transformer-based models, which have not been explored earlier for Bangla, outperform all other models. Furthermore, we created a weighted list of lexicon content based on the valence score per class. We then analyzed the content for high significance entries per class, in the datasets. For reproducibility, we make publicly available data splits and the ranked lexicon list. The presented results can be used for future studies as a benchmark.
△ Less
Submitted 19 November, 2020;
originally announced November 2020.
-
Discovering Users Topic of Interest from Tweet
Authors:
Muhammad Kamal Hossen,
Md. Ali Faiad,
Md. Shahnur Azad Chowdhury,
Md. Sajjatul Islam
Abstract:
Nowadays social media has become one of the largest gatherings of people in online. There are many ways for the industries to promote their products to the public through advertising. The variety of advertisement is increasing dramatically. Businessmen are so much dependent on the advertisement that significantly it really brought out success in the market and hence practiced by major industries.…
▽ More
Nowadays social media has become one of the largest gatherings of people in online. There are many ways for the industries to promote their products to the public through advertising. The variety of advertisement is increasing dramatically. Businessmen are so much dependent on the advertisement that significantly it really brought out success in the market and hence practiced by major industries. Thus, companies are trying hard to draw the attention of customers on social networks through online advertisement. One of the most popular social media is Twitter which is popular for short text sharing named Tweet. People here create their profile with basic information. To ensure the advertisements are shown to relative people, Twitter targets people based on language, gender, interest, follower, device, behavior, tailored audiences, keyword, and geography targeting. Twitter generates interest sets based on their activities on Twitter. What our framework does is that it determines the topic of interest from a given list of Tweets if it has any. This process is called Entity Intersect Categorizing Value (EICV). Each category topic generates a set of words or phrases related to that topic. An entity set is created from processing tweets by keyword generation and Twitters data using Twitter API. Value of entities is matched with the set of categories. If they cross a threshold value, it results in the category which matched the desired interest category. For smaller amounts of data sizes, the results show that our framework performs with higher accuracy rate.
△ Less
Submitted 10 March, 2018;
originally announced March 2018.
-
Depression Severity Estimation from Multiple Modalities
Authors:
Evgeny Stepanov,
Stephane Lathuiliere,
Shammur Absar Chowdhury,
Arindam Ghosh,
Radu-Laurentiu Vieriu,
Nicu Sebe,
Giuseppe Riccardi
Abstract:
Depression is a major debilitating disorder which can affect people from all ages. With a continuous increase in the number of annual cases of depression, there is a need to develop automatic techniques for the detection of the presence and extent of depression. In this AVEC challenge we explore different modalities (speech, language and visual features extracted from face) to design and develop a…
▽ More
Depression is a major debilitating disorder which can affect people from all ages. With a continuous increase in the number of annual cases of depression, there is a need to develop automatic techniques for the detection of the presence and extent of depression. In this AVEC challenge we explore different modalities (speech, language and visual features extracted from face) to design and develop automatic methods for the detection of depression. In psychology literature, the PHQ-8 questionnaire is well established as a tool for measuring the severity of depression. In this paper we aim to automatically predict the PHQ-8 scores from features extracted from the different modalities. We show that visual features extracted from facial landmarks obtain the best performance in terms of estimating the PHQ-8 results with a mean absolute error (MAE) of 4.66 on the development set. Behavioral characteristics from speech provide an MAE of 4.73. Language features yield a slightly higher MAE of 5.17. When switching to the test set, our Turn Features derived from audio transcriptions achieve the best performance, scoring an MAE of 4.11 (corresponding to an RMSE of 4.94), which makes our system the winner of the AVEC 2017 depression sub-challenge.
△ Less
Submitted 10 November, 2017;
originally announced November 2017.