-
A Dataset and Baselines for Measuring and Predicting the Music Piece Memorability
Authors:
Li-Yang Tseng,
Tzu-Ling Lin,
Hong-Han Shuai,
Jen-Wei Huang,
Wen-Whei Chang
Abstract:
Nowadays, humans are constantly exposed to music, whether through voluntary streaming services or incidental encounters during commercial breaks. Despite the abundance of music, certain pieces remain more memorable and often gain greater popularity. Inspired by this phenomenon, we focus on measuring and predicting music memorability. To achieve this, we collect a new music piece dataset with relia…
▽ More
Nowadays, humans are constantly exposed to music, whether through voluntary streaming services or incidental encounters during commercial breaks. Despite the abundance of music, certain pieces remain more memorable and often gain greater popularity. Inspired by this phenomenon, we focus on measuring and predicting music memorability. To achieve this, we collect a new music piece dataset with reliable memorability labels using a novel interactive experimental procedure. We then train baselines to predict and analyze music memorability, leveraging both interpretable features and audio mel-spectrograms as inputs. To the best of our knowledge, we are the first to explore music memorability using data-driven deep learning-based methods. Through a series of experiments and ablation studies, we demonstrate that while there is room for improvement, predicting music memorability with limited data is possible. Certain intrinsic elements, such as higher valence, arousal, and faster tempo, contribute to memorable music. As prediction techniques continue to evolve, real-life applications like music recommendation systems and music style transfer will undoubtedly benefit from this new area of research.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Authors:
Liang-Hsuan Tseng,
En-Pei Hu,
Cheng-Han Chiang,
Yuan Tseng,
Hung-yi Lee,
Lin-shan Lee,
Shao-Hua Sun
Abstract:
Unsupervised automatic speech recognition (ASR) aims to learn the map** between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the map** between speech and text…
▽ More
Unsupervised automatic speech recognition (ASR) aims to learn the map** between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the map** between speech and text challenging, especially without paired data. In this paper, we propose REBORN,Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.
△ Less
Submitted 28 May, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation
Authors:
Yu-Kuan Fu,
Liang-Hsuan Tseng,
Jiatong Shi,
Chen-An Li,
Tsu-Yuan Hsu,
Shinji Watanabe,
Hung-yi Lee
Abstract:
Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is co…
▽ More
Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is comparable with some other early supervised methods in some language pairs. While cascaded systems always suffer from severe error propagation problems, we proposed denoising back-translation (DBT), a novel approach to building robust unsupervised neural machine translation (UNMT). DBT successfully increases the BLEU score by 0.7--0.9 in all three translation directions. Moreover, we simplified the pipeline of our cascaded system to reduce inference latency and conducted a comprehensive analysis of every part of our work. We also demonstrate our unsupervised speech translation results on the established website.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Introducing Semantics into Speech Encoders
Authors:
Derek Xu,
Shuyan Dong,
Changhan Wang,
Suyoun Kim,
Zhaojiang Lin,
Akshat Shrivastava,
Shang-Wen Li,
Liang-Hsuan Tseng,
Alexei Baevski,
Guan-Ting Lin,
Hung-yi Lee,
Yizhou Sun,
Wei Wang
Abstract:
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio…
▽ More
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Improving generalizability of distilled self-supervised speech processing models under distorted settings
Authors:
Kuan-Po Huang,
Yu-Kuan Fu,
Tsu-Yuan Hsu,
Fabian Ritter Gutierrez,
Fan-Lin Wang,
Liang-Hsuan Tseng,
Yu Zhang,
Hung-yi Lee
Abstract:
Self-supervised learned (SSL) speech pre-trained models perform well across various speech processing tasks. Distilled versions of SSL models have been developed to match the needs of on-device speech applications. Though having similar performance as original SSL models, distilled counterparts suffer from performance degradation even more than their original versions in distorted environments. Th…
▽ More
Self-supervised learned (SSL) speech pre-trained models perform well across various speech processing tasks. Distilled versions of SSL models have been developed to match the needs of on-device speech applications. Though having similar performance as original SSL models, distilled counterparts suffer from performance degradation even more than their original versions in distorted environments. This paper proposes to apply Cross-Distortion Map** and Domain Adversarial Training to SSL models during knowledge distillation to alleviate the performance gap caused by the domain mismatch problem. Results show consistent performance improvements under both in- and out-of-domain distorted setups for different downstream tasks while kee** efficient model size.
△ Less
Submitted 20 October, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Mandarin-English Code-switching Speech Recognition with Self-supervised Speech Representation Models
Authors:
Liang-Hsuan Tseng,
Yu-Kuan Fu,
Heng-Jui Chang,
Hung-yi Lee
Abstract:
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence. The difficulties of CS speech recognition lie in alternating languages and the lack of transcribed data. Therefore, this paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS. We show that hidden representations of SSL mo…
▽ More
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence. The difficulties of CS speech recognition lie in alternating languages and the lack of transcribed data. Therefore, this paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS. We show that hidden representations of SSL models offer frame-level language identity even if the models are trained with English speech only. Jointly training CTC and language identification modules with self-supervised speech representations improves CS speech recognition performance. Furthermore, using multilingual speech data for pre-training obtains the best CS speech recognition.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.