Search | arXiv e-print repository

A Textless Metric for Speech-to-Speech Comparison

Authors: Laurent Besacier, Swen Ribeiro, Olivier Galibert, Ioan Calapodescu

Abstract: In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely correspo… ▽ More In this paper, we introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units. We then propose a simple and easily replicable neural architecture that learns a speech-based metric that closely corresponds to its text-based counterpart. This textless metric has numerous potential applications, including evaluating speech-to-speech translation for oral languages, languages without dependable ASR systems, or to avoid the need for ASR transcription altogether. This paper also shows that for speech-to-speech translation evaluation, ASR-BLEU (which consists in automatically transcribing both speech hypothesis and reference and compute sentence-level BLEU between transcripts) is a poor proxy to real text-BLEU even when ASR system is strong. △ Less

Submitted 20 July, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: link to supplementary material: https://github.com/besacier/textless-metric

arXiv:2102.13589 [pdf, other]

Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding

Authors: Mathilde Veron, Sophie Rosset, Olivier Galibert, Guillaume Bernard

Abstract: On-the-job learning consists in continuously learning while being used in production, in an open environment, meaning that the system has to deal on its own with situations and elements never seen before. The kind of systems that seem to be especially adapted to on-the-job learning are dialogue systems, since they can take advantage of their interactions with users to collect feedback to adapt and… ▽ More On-the-job learning consists in continuously learning while being used in production, in an open environment, meaning that the system has to deal on its own with situations and elements never seen before. The kind of systems that seem to be especially adapted to on-the-job learning are dialogue systems, since they can take advantage of their interactions with users to collect feedback to adapt and improve their components over time. Some dialogue systems performing on-the-job learning have been built and evaluated but no general methodology has yet been defined. Thus in this paper, we propose a first general methodology for evaluating on-the-job learning dialogue systems. We also describe a task-oriented dialogue system which improves on-the-job its natural language component through its user interactions. We finally evaluate our system with the described methodology. △ Less

Submitted 26 February, 2021; originally announced February 2021.

Comments: Accepted to NeurIPS 2020 Human in the Loop Dialogue Systems Workshop

arXiv:1808.08573 [pdf, other]

Analyzing Learned Representations of a Deep ASR Performance Prediction Model

Authors: Zied Elloumi, Laurent Besacier, Olivier Galibert, Benjamin Lecouteux

Abstract: This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our pr… ▽ More This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin. △ Less

Submitted 28 August, 2018; v1 submitted 26 August, 2018; originally announced August 2018.

Comments: EMNLP 2018 Workshop

arXiv:1804.08477 [pdf, other]

ASR Performance Prediction on Unseen Broadcast Programs using Convolutional Neural Networks

Authors: Zied Elloumi, Laurent Besacier, Olivier Galibert, Juliette Kahn, Benjamin Lecouteux

Abstract: In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly foc… ▽ More In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly focus on the combination of both textual (ASR transcription) and signal inputs. While the joint use of textual and signal features did not work for the regression baseline, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably predicts the WER distribution on a collection of speech recordings. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: IEEE ICASSP 2018

Showing 1–4 of 4 results for author: Galibert, O