Search | arXiv e-print repository

AVR: Synergizing Foundation Models for Audio-Visual Humor Detection

Authors: Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in re… ▽ More In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in real-world applications. To address this bottleneck, we propose an innovative audio-visual humor detection system that circumvents textual reliance, eliminating the need for ASR models. Instead, the proposed approach hinges on the intricate interplay between audio and visual content for effective humor detection. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.09156 [pdf, other]

Towards Multilingual Audio-Visual Question Answering

Authors: Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages crea… ▽ More In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

MSC Class: 68T45

arXiv:2406.06798 [pdf, other]

The Reasonable Effectiveness of Speaker Embeddings for Violence Detection

Authors: Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL)… ▽ More In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 24 Show & Tell Demonstrations

arXiv:2406.06781 [pdf, other]

PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation

Authors: Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma

Abstract: Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in develo** models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite th… ▽ More Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in develo** models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite their inherent interconnectedness. As such in this demonstration, we present PERSONA, an application for predicting ER, GR, and AE with a single model in the backend. One notable point is we show that representations from speaker recognition pre-trained model (PTM) is better suited for such a multi-task learning format than the state-of-the-art (SOTA) self-supervised (SSL) PTM by carrying out a comparative study. Our methodology obviates the need for deploying separate models for each task and can potentially conserve resources and time during the training and deployment phases. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.06774 [pdf, other]

ComFeAT: Combination of Neural and Spectral Features for Improved Depression Detection

Authors: Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce… ▽ More In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce ComFeAT, an application that employs a CNN model trained on a combination of features extracted from PTMs, a.k.a. neural features and spectral features to enhance depression detection. Spectral features are robust to domain variations, but, they are not as good as neural features in performance, suprisingly, combining them shows complementary behavior and improves over both neural and spectral features individually. The proposed method also improves over previous state-of-the-art (SOTA) works on E-DAIC benchmark. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.03514 [pdf, other]

NeuRO: An Application for Code-Switched Autism Detection in Children

Authors: Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan, Muskaan Singh

Abstract: Code-switching is a common communication phenomenon where individuals alternate between two or more languages or linguistic styles within a single conversation. Autism Spectrum Disorder (ASD) is a developmental disorder posing challenges in social interaction, communication, and repetitive behaviors. Detecting ASD in individuals with code-switch scenario presents unique challenges. In this paper,… ▽ More Code-switching is a common communication phenomenon where individuals alternate between two or more languages or linguistic styles within a single conversation. Autism Spectrum Disorder (ASD) is a developmental disorder posing challenges in social interaction, communication, and repetitive behaviors. Detecting ASD in individuals with code-switch scenario presents unique challenges. In this paper, we address this problem by building an application NeuRO which aims to detect potential signs of autism in code-switched conversations, facilitating early intervention and support for individuals with ASD. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 24 Show & Tell Demonstrations

arXiv:2406.03205 [pdf, other]

CoLLAB: A Collaborative Approach for Multilingual Abuse Detection

Authors: Orchid Chetia Phukan, Yashasvi Chaurasia, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this study, we investigate representations from paralingual Pre-Trained model (PTM) for Audio Abuse Detection (AAD), which has not been explored for AAD. Our results demonstrate their superiority compared to other PTM representations on the ADIMA benchmark. Furthermore, combining PTM representations enhances AAD performance. Despite these improvements, challenges with cross-lingual generalizabi… ▽ More In this study, we investigate representations from paralingual Pre-Trained model (PTM) for Audio Abuse Detection (AAD), which has not been explored for AAD. Our results demonstrate their superiority compared to other PTM representations on the ADIMA benchmark. Furthermore, combining PTM representations enhances AAD performance. Despite these improvements, challenges with cross-lingual generalizability still remain, and certain languages require training in the same language. This demands individual models for different languages, leading to scalability, maintenance, and resource allocation issues and hindering the practical deployment of AAD systems in linguistically diverse real-world environments. To address this, we introduce CoLLAB, a novel framework that doesn't require training and allows seamless merging of models trained in different languages through weight-averaging. This results in a unified model with competitive AAD performance across multiple languages. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2404.00827 [pdf, other]

SONIC: Synergizing VisiON Foundation Models for Stress RecogNItion from ECG signals

Authors: Orchid Chetia Phukan, Ankita Das, Arun Balaji Buduru, Rajesh Sharma

Abstract: Stress recognition through physiological signals such as Electrocardiogram (ECG) signals has garnered significant attention. Traditionally, research in this field predominantly focused on utilizing handcrafted features or raw signals as inputs for learning algorithms. However, there is now a burgeoning interest within the community in leveraging large-scale vision foundation models (VFMs) like Res… ▽ More Stress recognition through physiological signals such as Electrocardiogram (ECG) signals has garnered significant attention. Traditionally, research in this field predominantly focused on utilizing handcrafted features or raw signals as inputs for learning algorithms. However, there is now a burgeoning interest within the community in leveraging large-scale vision foundation models (VFMs) like ResNet50, VGG19, and others. These VFMs are increasingly preferred due to their ability to capture complex features, enhancing the accuracy and effectiveness of stress recognition systems. However, no particular focus has been given on combining these VFMs. The combination of VFMs offers promising benefits by harnessing their collective knowledge to extract richer representations for improved stress recognition. So, to mitigate this research gap, we focus on combining different VFMs for stress recognition from ECG and propose SONIC, a novel framework that combines VFMs through their logits and training a fully connected network on the combined logits. Through extensive experimentation, SONIC showed the top performance against individual VFMs performance on the WESAD benchmark. With SONIC, we report state-of-the-art (SOTA) performance in WESAD with 99.36% and 99.24% (stress vs non-stress) and 97.66% and 97.10% (amusement vs stress vs baseline) in accuracy and F1 respectively. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2404.00809 [pdf, other]

Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake

Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize that multilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during their pre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfake… ▽ More In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize that multilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during their pre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfakes. To validate our hypothesis, we extract representations from state-of-the-art (SOTA) PTMs including monolingual, multilingual as well as PTMs trained for speaker and emotion recognition, and evaluated them on ASVSpoof 2019 (ASV), In-the-Wild (ITW), and DECRO benchmark databases. We show that representations from multilingual PTMs, with simple downstream networks, attain the best performance for ADD compared to other PTM representations, which validates our hypothesis. We also explore the possibility of fusion of selected PTM representations for further improvements in ADD, and we propose a framework, MiO (Merge into One) for this purpose. With MiO, we achieve SOTA performance on ASV and ITW and comparable performance on DECRO with current SOTA works. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted to NAACL (Findings) 2024

arXiv:2402.01579 [pdf, other]

Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?

Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

Abstract: Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than Engl… ▽ More Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations. △ Less

Submitted 11 July, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted to INTERSPEECH 24

arXiv:2401.05968 [pdf, other]

A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting

Authors: Yashwardhan Chaudhuri, Ankit Kumar, Orchid Chetia Phukan, Arun Balaji Buduru

Abstract: Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the s… ▽ More Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2311.16958 [pdf]

From Simulations to Reality: Enhancing Multi-Robot Exploration for Urban Search and Rescue

Authors: Gautam Siddharth Kashyap, Deepkashi Mahajan, Orchid Chetia Phukan, Ankit Kumar, Alexander E. I. Brownlee, Jiechao Gao

Abstract: In this study, we present a novel hybrid algorithm, combining Levy Flight (LF) and Particle Swarm Optimization (PSO) (LF-PSO), tailored for efficient multi-robot exploration in unknown environments with limited communication and no global positioning information. The research addresses the growing interest in employing multiple autonomous robots for exploration tasks, particularly in scenarios suc… ▽ More In this study, we present a novel hybrid algorithm, combining Levy Flight (LF) and Particle Swarm Optimization (PSO) (LF-PSO), tailored for efficient multi-robot exploration in unknown environments with limited communication and no global positioning information. The research addresses the growing interest in employing multiple autonomous robots for exploration tasks, particularly in scenarios such as Urban Search and Rescue (USAR) operations. Multiple robots offer advantages like increased task coverage, robustness, flexibility, and scalability. However, existing approaches often make assumptions such as search area, robot positioning, communication restrictions, and target information that may not hold in real-world situations. The hybrid algorithm leverages LF, known for its effectiveness in large space exploration with sparse targets, and incorporates inter-robot repulsion as a social component through PSO. This combination enhances area exploration efficiency. We redefine the local best and global best positions to suit scenarios without continuous target information. Experimental simulations in a controlled environment demonstrate the algorithm's effectiveness, showcasing improved area coverage compared to traditional methods. In the process of refining our approach and testing it in complex, obstacle-rich environments, the presented work holds promise for enhancing multi-robot exploration in scenarios with limited information and communication capabilities. △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2310.07613 [pdf, other]

Reinforcement Learning-based Knowledge Graph Reasoning for Explainable Fact-checking

Authors: Gustav Nikopensius, Mohit Mayank, Orchid Chetia Phukan, Rajesh Sharma

Abstract: Fact-checking is a crucial task as it ensures the prevention of misinformation. However, manual fact-checking cannot keep up with the rate at which false information is generated and disseminated online. Automated fact-checking by machines is significantly quicker than by humans. But for better trust and transparency of these automated systems, explainability in the fact-checking process is necess… ▽ More Fact-checking is a crucial task as it ensures the prevention of misinformation. However, manual fact-checking cannot keep up with the rate at which false information is generated and disseminated online. Automated fact-checking by machines is significantly quicker than by humans. But for better trust and transparency of these automated systems, explainability in the fact-checking process is necessary. Fact-checking often entails contrasting a factual assertion with a body of knowledge for such explanations. An effective way of representing knowledge is the Knowledge Graph (KG). There have been sufficient works proposed related to fact-checking with the usage of KG but not much focus is given to the application of reinforcement learning (RL) in such cases. To mitigate this gap, we propose an RL-based KG reasoning approach for explainable fact-checking. Extensive experiments on FB15K-277 and NELL-995 datasets reveal that reasoning over a KG is an effective way of producing human-readable explanations in the form of paths and classifications for fact claims. The RL reasoning agent computes a path that either proves or disproves a factual claim, but does not provide a verdict itself. A verdict is reached by a voting mechanism that utilizes paths produced by the agent. These paths can be presented to human readers so that they themselves can decide whether or not the provided evidence is convincing or not. This work will encourage works in this direction for incorporating RL for explainable fact-checking as it increases trustworthiness by providing a human-in-the-loop approach. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted to ASONAM 2023

arXiv:2306.10338 [pdf, other]

Trauma lurking in the shadows: A Reddit case study of mental health issues in online posts about Childhood Sexual Abuse

Authors: Orchid Chetia Phukan, Rajesh Sharma, Arun Balaji Buduru

Abstract: Childhood Sexual Abuse (CSA) is a menace to society and has long-lasting effects on the mental health of the survivors. From time to time CSA survivors are haunted by various mental health issues in their lifetime. Proper care and attention towards CSA survivors facing mental health issues can drastically improve the mental health conditions of CSA survivors. Previous works leveraging online socia… ▽ More Childhood Sexual Abuse (CSA) is a menace to society and has long-lasting effects on the mental health of the survivors. From time to time CSA survivors are haunted by various mental health issues in their lifetime. Proper care and attention towards CSA survivors facing mental health issues can drastically improve the mental health conditions of CSA survivors. Previous works leveraging online social media (OSM) data for understanding mental health issues haven't focused on mental health issues in individuals with CSA background. Our work fills this gap by studying Reddit posts related to CSA to understand their mental health issues. Mental health issues such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are most commonly observed in posts with CSA background. Observable differences exist between posts related to mental health issues with and without CSA background. Kee** this difference in mind, for identifying mental health issues in posts with CSA exposure we develop a two-stage framework. The first stage involves classifying posts with and without CSA background and the second stage involves recognizing mental health issues in posts that are classified as belonging to CSA background. The top model in the first stage is able to achieve accuracy and f1-score (macro) of 96.26% and 96.24%. and in the second stage, the top model reports hamming score of 67.09%. Content Warning: Reader discretion is recommended as our study tackles topics such as child sexual abuse, molestation, etc. △ Less

Submitted 17 June, 2023; originally announced June 2023.

arXiv:2306.02308 [pdf]

Roulette-Wheel Selection-Based PSO Algorithm for Solving the Vehicle Routing Problem with Time Windows

Authors: Gautam Siddharth Kashyap, Alexander E. I. Brownlee, Orchid Chetia Phukan, Karan Malik, Samar Wazir

Abstract: The well-known Vehicle Routing Problem with Time Windows (VRPTW) aims to reduce the cost of moving goods between several destinations while accommodating constraints like set time windows for certain locations and vehicle capacity. Applications of the VRPTW problem in the real world include Supply Chain Management (SCM) and logistic dispatching, both of which are crucial to the economy and are exp… ▽ More The well-known Vehicle Routing Problem with Time Windows (VRPTW) aims to reduce the cost of moving goods between several destinations while accommodating constraints like set time windows for certain locations and vehicle capacity. Applications of the VRPTW problem in the real world include Supply Chain Management (SCM) and logistic dispatching, both of which are crucial to the economy and are expanding quickly as work habits change. Therefore, to solve the VRPTW problem, metaheuristic algorithms i.e. Particle Swarm Optimization (PSO) have been found to work effectively, however, they can experience premature convergence. To lower the risk of PSO's premature convergence, the authors have solved VRPTW in this paper utilising a novel form of the PSO methodology that uses the Roulette Wheel Method (RWPSO). Computing experiments using the Solomon VRPTW benchmark datasets on the RWPSO demonstrate that RWPSO is competitive with other state-of-the-art algorithms from the literature. Also, comparisons with two cutting-edge algorithms from the literature show how competitive the suggested algorithm is. △ Less

Submitted 4 June, 2023; originally announced June 2023.

arXiv:2305.18640 [pdf, other]

Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks

Authors: Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this ga… ▽ More Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this gap and in order to understand the efficacy of speaker recognition PTM embeddings, we perform a comparative analysis of five PTM embeddings. Among all, x-vector embeddings performed the best possibly due to its training for speaker recognition leading to capturing various components of speech such as tone, pitch, etc. Our modeling approach which utilizes x-vector embeddings and mel-frequency cepstral coefficients (MFCC) as input features is the most lightweight approach while achieving comparable accuracy to previous state-of-the-art (SOTA) methods in the CREMA-D benchmark. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023

arXiv:2304.11472 [pdf, other]

A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition

Authors: Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. One such crucial task is Speech Emotion Recognition (SER) which has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized lang… ▽ More Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. One such crucial task is Speech Emotion Recognition (SER) which has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning. PTM embeddings have helped advance SER, however, a comprehensive comparison of these PTM embeddings that consider multiple facets such as embedding model architecture, data used for pre-training, and the pre-training procedure being followed is missing. A thorough comparison of PTM embeddings will aid in the faster and more efficient development of models and enable their deployment in real-world scenarios. In this work, we exploit this research gap and perform a comparative analysis of embeddings extracted from eight speech and audio PTMs (wav2vec 2.0, data2vec, wavLM, UniSpeech-SAT, wav2clip, YAMNet, x-vector, ECAPA). We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms (XGBoost, Random Forest, FCN) on the derived embeddings. The results of our study indicate that the best performance is achieved by algorithms trained on embeddings derived from PTMs trained for speaker recognition followed by wav2clip and UniSpeech-SAT. This can relay that the top performance by embeddings from speaker recognition PTMs is most likely due to the model taking up information about numerous speech features such as tone, accent, pitch, and so on during its speaker recognition training. Insights from this work will assist future studies in their selection of embeddings for applications related to SER. △ Less

Submitted 22 April, 2023; originally announced April 2023.

arXiv:2304.10512 [pdf, other]

"Can We Detect Substance Use Disorder?": Knowledge and Time Aware Classification on Social Media from Darkweb

Authors: Usha Lokala, Orchid Chetia Phukan, Triyasha Ghosh Dastidar, Francois Lamy, Raminta Daniulaityte, Amit Sheth

Abstract: Opioid and substance misuse is rampant in the United States today, with the phenomenon known as the "opioid crisis". The relationship between substance use and mental health has been extensively studied, with one possible relationship being: substance misuse causes poor mental health. However, the lack of evidence on the relationship has resulted in opioids being largely inaccessible through legal… ▽ More Opioid and substance misuse is rampant in the United States today, with the phenomenon known as the "opioid crisis". The relationship between substance use and mental health has been extensively studied, with one possible relationship being: substance misuse causes poor mental health. However, the lack of evidence on the relationship has resulted in opioids being largely inaccessible through legal means. This study analyzes the substance use posts on social media with opioids being sold through crypto market listings. We use the Drug Abuse Ontology, state-of-the-art deep learning, and knowledge-aware BERT-based models to generate sentiment and emotion for the social media posts to understand users' perceptions on social media by investigating questions such as: which synthetic opioids people are optimistic, neutral, or negative about? or what kind of drugs induced fear and sorrow? or what kind of drugs people love or are thankful about? or which drugs people think negatively about? or which opioids cause little to no sentimental reaction. We discuss how we crawled crypto market data and its use in extracting posts for fentanyl, fentanyl analogs, and other novel synthetic opioids. We also perform topic analysis associated with the generated sentiments and emotions to understand which topics correlate with people's responses to various drugs. Additionally, we analyze time-aware neural models built on these features while considering historical sentiment and emotional activity of posts related to a drug. The most effective model performs well (statistically significant) with (macroF1=82.12, recall =83.58) to identify substance use disorder. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Showing 1–18 of 18 results for author: Phukan, O C