Skip to main content

Showing 1–11 of 11 results for author: Busso, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.04494  [pdf, other

    eess.AS

    Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

    Authors: Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman

    Abstract: Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in s… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2403.14083  [pdf, other

    cs.SD cs.LG eess.AS

    emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

    Authors: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

    Abstract: Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a pot… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402

  3. Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

    Authors: Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

    Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and stat… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2305.07216  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

    Authors: Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

    Abstract: Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is d… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: 14 pages, 2 Figures, 2 tables

  5. arXiv:2210.13756  [pdf, other

    eess.AS cs.SD

    Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

    Authors: Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

    Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC f… ▽ More

    Submitted 17 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

  6. arXiv:2201.07876  [pdf, other

    cs.SD cs.CL cs.HC eess.AS

    Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

    Authors: Kusha Sridhar, Carlos Busso

    Abstract: The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target sp… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

    Comments: 8 Figures and 5 tables

    Journal ref: IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 1959-1972, October-December 2022

  7. Semi-Supervised Speech Emotion Recognition with Ladder Networks

    Authors: Srinivas Parthasarathy, Carlos Busso

    Abstract: Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increa… ▽ More

    Submitted 8 May, 2019; originally announced May 2019.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2697-2709, September 2020

  8. End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Authors: Fei Tao, Carlos Busso

    Abstract: Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g.… ▽ More

    Submitted 12 September, 2018; originally announced September 2018.

    Comments: Submitted to Speech Communication

    Journal ref: Speech Communication, vol. 113, pp. 25-35, October 2019

  9. Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels

    Authors: Reza Lotfian, Carlos Busso

    Abstract: This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually presented in increasing level of difficulty. For speech… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019

  10. Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes

    Authors: Srinivas Parthasarathy, Carlos Busso

    Abstract: Recognizing emotions using few attribute dimensions such as arousal, valence and dominance provides the flexibility to effectively represent complex range of emotional behaviors. Conventional methods to learn these emotional descriptors primarily focus on separate models to recognize each of these attributes. Recent work has shown that learning these attributes together regularizes the models, lea… ▽ More

    Submitted 28 April, 2018; originally announced April 2018.

    Comments: Submitted to Interspeech 2018

    Journal ref: Interspeech 2018, Hyderabad, India, September 2018, pp. 3698-3702

  11. Domain Adversarial for Acoustic Emotion Recognition

    Authors: Mohammed Abdelwahab, Carlos Busso

    Abstract: The performance of speech emotion recognition is affected by the differences in data distributions between train (source domain) and test (target domain) sets used to build and evaluate the models. This is a common problem, as multiple studies have shown that the performance of emotional classifiers drop when they are exposed to data that does not match the distribution used to build the emotion c… ▽ More

    Submitted 20 April, 2018; originally announced April 2018.

    Comments: submitted to IEEE transactions on signal processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423-2435, December 2018