Search | arXiv e-print repository

arXiv:2403.11092 [pdf, other]

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and co… ▽ More Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: NAACL 2024 Main Conference

arXiv:2306.01735 [pdf, other]

Multilingual Conceptual Coverage in Text-to-Image Models

Authors: Michael Saxon, William Yang Wang

Abstract: We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series… ▽ More We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language. This technique allows us to estimate how well-suited a model is to a target language as well as identify model-specific weaknesses, spurious correlations, and biases without a-priori assumptions. We demonstrate how it can be used to benchmark T2I models in terms of multilinguality, and how despite its simplicity it is a good proxy for impressive generalization. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: ACL 2023 main conference; 16 pages, 13 figures

arXiv:2305.10684 [pdf, other]

Data Augmentation for Diverse Voice Conversion in Noisy Environments

Authors: Avani Tanna, Michael Saxon, Amr El Abbadi, William Yang Wang

Abstract: Voice conversion (VC) models have demonstrated impressive few-shot conversion quality on the clean, native speech populations they're trained on. However, when source or target speech accents, background noise conditions, or microphone characteristics differ from training, quality voice conversion is not guaranteed. These problems are often left unexamined in VC research, giving rise to frustratio… ▽ More Voice conversion (VC) models have demonstrated impressive few-shot conversion quality on the clean, native speech populations they're trained on. However, when source or target speech accents, background noise conditions, or microphone characteristics differ from training, quality voice conversion is not guaranteed. These problems are often left unexamined in VC research, giving rise to frustration in users trying to use pretrained VC models on their own data. We are interested in accent-preserving voice conversion for name pronunciation from self-recorded examples, a domain in which all three of the aforementioned conditions are present, and posit that demonstrating higher performance in this domain correlates with creating VC models that are more usable by otherwise frustrated users. We demonstrate that existing SOTA encoder-decoder VC models can be made robust to these variations and endowed with natural denoising capabilities using more diverse data and simple data augmentation techniques in pretraining. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: Interspeech 2023 Show and Tell, 2 pp

arXiv:2106.09009 [pdf, other]

doi 10.21437/Interspeech.2021-1826

End-to-End Spoken Language Understanding for Generalized Voice Assistants

Authors: Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris

Abstract: End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to develo** an E2E model for generalized SLU in com… ▽ More End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to develo** an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios. △ Less

Submitted 19 July, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021; 5 pages, 2 tables, 1 figure

Journal ref: Proc. Interspeech 2021, 4738-4742

arXiv:2008.02858 [pdf, other]

Semantic Complexity in End-to-End Spoken Language Understanding

Authors: Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris

Abstract: End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these… ▽ More End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these models generalize to broader use cases. In this work, we analyze the relationship between the performance of STI models and the difficulty of the use case to which they are applied. We introduce empirical measures of dataset semantic complexity to quantify the difficulty of the SLU tasks. We show that near-perfect performance metrics for STI models reported in the literature were obtained with datasets that have low semantic complexity values. We perform experiments where we vary the semantic complexity of a large, proprietary dataset and show that STI model performance correlates with our semantic complexity measures, such that performance increases as complexity values decrease. Our results show that it is important to contextualize an STI model's performance with the complexity values of its training dataset to reveal the scope of its applicability. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: Accepted at Interspeech, 2020

arXiv:1911.11360 [pdf, other]

doi 10.1109/TASLP.2020.3015035

Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features

Authors: Michael Saxon, Ayush Tripathi, Yishan Jiao, Julie Liss, Visar Berisha

Abstract: Hypernasality is a common characteristic symptom across many motor-speech disorders. For voiced sounds, hypernasality introduces an additional resonance in the lower frequencies and, for unvoiced sounds, there is reduced articulatory precision due to air esca** through the nasal cavity. However, the acoustic manifestation of these symptoms is highly variable, making hypernasality estimation very… ▽ More Hypernasality is a common characteristic symptom across many motor-speech disorders. For voiced sounds, hypernasality introduces an additional resonance in the lower frequencies and, for unvoiced sounds, there is reduced articulatory precision due to air esca** through the nasal cavity. However, the acoustic manifestation of these symptoms is highly variable, making hypernasality estimation very challenging, both for human specialists and automated systems. Previous work in this area relies on either engineered features based on statistical signal processing or machine learning models trained on clinical ratings. Engineered features often fail to capture the complex acoustic patterns associated with hypernasality, whereas metrics based on machine learning are prone to overfitting to the small disease-specific speech datasets on which they are trained. Here we propose a new set of acoustic features that capture these complementary dimensions. The features are based on two acoustic models trained on a large corpus of healthy speech. The first acoustic model aims to measure nasal resonance from voiced sounds, whereas the second acoustic model aims to measure articulatory imprecision from unvoiced sounds. To demonstrate that the features derived from these acoustic models are specific to hypernasal speech, we evaluate them across different dysarthria corpora. Our results show that the features generalize even when training on hypernasal speech from one disease and evaluating on hypernasal speech from another disease (e.g. training on Parkinson's disease, evaluation on Huntington's disease), and when training on neurologically disordered speech but evaluating on cleft palate speech. △ Less

Submitted 5 August, 2020; v1 submitted 26 November, 2019; originally announced November 2019.

Comments: 12 pages, 9 figures, 2 tables

Journal ref: IEEE/ACM Trans. on Audio, Speech, and Language Proc. 28 (2020) 2511-2522

Showing 1–6 of 6 results for author: Saxon, M