-
Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024
Authors:
Sai Koneru,
Thai-Binh Nguyen,
Ngoc-Quan Pham,
Danni Liu,
Zhaolin Li,
Alexander Waibel,
Jan Niehues
Abstract:
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we inte…
▽ More
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlap** speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Authors:
Tu Anh Dinh,
Carlos Mullov,
Leonard Bärmann,
Zhaolin Li,
Danni Liu,
Simon Reiß,
Jueun Lee,
Nathan Lerzer,
Fabian Ternava,
Jianfeng Gao,
Alexander Waibel,
Tamim Asfour,
Michael Beigl,
Rainer Stiefelhagen,
Carsten Dachsbacher,
Klemens Böhm,
Jan Niehues
Abstract:
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx -…
▽ More
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Authors:
Matthias Sperber,
Ondřej Bojar,
Barry Haddow,
Dávid Javorský,
Xutai Ma,
Matteo Negri,
Jan Niehues,
Peter Polák,
Elizabeth Salesky,
Katsuhito Sudoh,
Marco Turchi
Abstract:
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation…
▽ More
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation
Authors:
Tu Anh Dinh,
Tobias Palzer,
Jan Niehues
Abstract:
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is n…
▽ More
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Language-Independent Representations Improve Zero-Shot Summarization
Authors:
Vladimir Solovyev,
Danni Liu,
Jan Niehues
Abstract:
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are h…
▽ More
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance. Next, we propose query-key (QK) finetuning to decouple task-specific knowledge from the pretrained language generation abilities. Then, after showing downsides of the standard adversarial language classifier, we propose a balanced variant that more directly enforces language-agnostic representations. Moreover, our qualitative analyses show removing source language identity correlates to zero-shot summarization performance. Our code is openly available.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing
Authors:
Sai Koneru,
Miriam Exel,
Matthias Huck,
Jan Niehues
Abstract:
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks, but they have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigat…
▽ More
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks, but they have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLM's for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments find that fine-tuning for translation purposes even led to performance degradation. To overcome this, we propose an alternative approach: adapting LLM's as Automatic Post-Editors (APE) rather than direct translators. Building on the LLM's exceptional ability to process and generate lengthy sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 89\% on the ContraPro test set, which specifically assesses the model's ability to resolve pronoun ambiguities when translating from English to German. Lastly, we investigate a practical scenario involving manual post-editing for document-level translation, where reference context is made available. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations (Interactive Demo for integrating manual feedback can be found here: https://huggingface.co/spaces/skoneru/contextual_refinement_ende).
△ Less
Submitted 18 March, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Audience-specific Explanations for Machine Translation
Authors:
Renhan Lou,
Jan Niehues
Abstract:
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example expl…
▽ More
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
How Transferable are Attribute Controllers on Pretrained Multilingual Translation Models?
Authors:
Danni Liu,
Jan Niehues
Abstract:
Customizing machine translation models to comply with desired attributes (e.g., formality or grammatical gender) is a well-studied topic. However, most current approaches rely on (semi-)supervised data with attribute annotations. This data scarcity bottlenecks democratizing such customization possibilities to a wider range of languages, particularly lower-resource ones. This gap is out of sync wit…
▽ More
Customizing machine translation models to comply with desired attributes (e.g., formality or grammatical gender) is a well-studied topic. However, most current approaches rely on (semi-)supervised data with attribute annotations. This data scarcity bottlenecks democratizing such customization possibilities to a wider range of languages, particularly lower-resource ones. This gap is out of sync with recent progress in pretrained massively multilingual translation models. In response, we transfer the attribute controlling capabilities to languages without attribute-annotated data with an NLLB-200 model as a foundation. Inspired by techniques from controllable generation, we employ a gradient-based inference-time controller to steer the pretrained model. The controller transfers well to zero-shot conditions, as it operates on pretrained multilingual representations and is attribute -- rather than language-specific. With a comprehensive comparison to finetuning-based control, we demonstrate that, despite finetuning's clear dominance in supervised settings, the gap to inference-time control closes when moving to zero-shot conditions, especially with new and distant target languages. The latter also shows stronger domain robustness. We further show that our inference-time control complements finetuning. A human evaluation on a real low-resource language, Bengali, confirms our findings. Our code is https://github.com/dannigt/attribute-controller-transfer
△ Less
Submitted 24 January, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models
Authors:
Leonard Bärmann,
Rainer Kartmann,
Fabian Peller-Konrad,
Jan Niehues,
Alex Waibel,
Tamim Asfour
Abstract:
Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mis…
▽ More
Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mistakes in the future. In this paper, we propose a system to achieve incremental learning of complex behavior from natural interaction, and demonstrate its implementation on a humanoid robot. Building on recent advances, we present a system that deploys Large Language Models (LLMs) for high-level orchestration of the robot's behavior, based on the idea of enabling the LLM to generate Python statements in an interactive console to invoke both robot perception and action. The interaction loop is closed by feeding back human instructions, environment observations, and execution results to the LLM, thus informing the generation of the next statement. Specifically, we introduce incremental prompt learning, which enables the system to interactively learn from its mistakes. For that purpose, the LLM can call another LLM responsible for code-level improvements of the current interaction based on human feedback. The improved interaction is then saved in the robot's memory, and thus retrieved on similar requests. We integrate the system in the robot cognitive architecture of the humanoid robot ARMAR-6 and evaluate our methods both quantitatively (in simulation) and qualitatively (in simulation and real-world) by demonstrating generalized incrementally-learned knowledge.
△ Less
Submitted 16 May, 2024; v1 submitted 8 September, 2023;
originally announced September 2023.
-
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Authors:
Christian Huber,
Tu Anh Dinh,
Carlos Mullov,
Ngoc Quan Pham,
Thai Binh Nguyen,
Fabian Retkowski,
Stefan Constantin,
Enes Yavuz Ugan,
Danni Liu,
Zhaolin Li,
Sai Koneru,
Jan Niehues,
Alexander Waibel
Abstract:
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work…
▽ More
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
△ Less
Submitted 23 October, 2023; v1 submitted 7 August, 2023;
originally announced August 2023.
-
KIT's Multilingual Speech Translation System for IWSLT 2023
Authors:
Danni Liu,
Thai Binh Nguyen,
Sai Koneru,
Enes Yavuz Ugan,
Ngoc-Quan Pham,
Tuan-Nam Nguyen,
Tu Anh Dinh,
Carlos Mullov,
Alexander Waibel,
Jan Niehues
Abstract:
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and te…
▽ More
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
△ Less
Submitted 12 July, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Gender Lost In Translation: How Bridging The Gap Between Languages Affects Gender Bias in Zero-Shot Multilingual Translation
Authors:
Lena Cabrera,
Jan Niehues
Abstract:
Neural machine translation (NMT) models often suffer from gender biases that harm users and society at large. In this work, we explore how bridging the gap between languages for which parallel data is not available affects gender bias in multilingual NMT, specifically for zero-shot directions. We evaluate translation between grammatical gender languages which requires preserving the inherent gende…
▽ More
Neural machine translation (NMT) models often suffer from gender biases that harm users and society at large. In this work, we explore how bridging the gap between languages for which parallel data is not available affects gender bias in multilingual NMT, specifically for zero-shot directions. We evaluate translation between grammatical gender languages which requires preserving the inherent gender information from the source in the target language. We study the effect of encouraging language-agnostic hidden representations on models' ability to preserve gender and compare pivot-based and zero-shot translation regarding the influence of the bridge language (participating in all language pairs during training) on gender preservation. We find that language-agnostic representations mitigate zero-shot models' masculine bias, and with increased levels of gender inflection in the bridge language, pivoting surpasses zero-shot translation regarding fairer gender preservation for speaker-related gender agreement.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Perturbation-based QE: An Explainable, Unsupervised Word-level Quality Estimation Method for Blackbox Machine Translation
Authors:
Tu Anh Dinh,
Jan Niehues
Abstract:
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which r…
▽ More
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which requires glass-box access to the MT systems, or parallel MT data to generate synthetic errors for training QE models. In this paper, we present Perturbation-based QE - a word-level Quality Estimation approach that works simply by analyzing MT system output on perturbed input source sentences. Our approach is unsupervised, explainable, and can evaluate any type of blackbox MT systems, including the currently prominent large language models (LLMs) with opaque internal processes. For language directions with no labeled QE data, our approach has similar or better performance than the zero-shot supervised approach on the WMT21 shared task. Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE, indicating its robustness to out-of-domain usage. The performance gap is larger when detecting errors on a nontraditional translation-prompting LLM, indicating that our approach is more generalizable to different MT systems. We give examples demonstrating our approach's explainability power, where it shows which input source words have influence on a certain MT output word.
△ Less
Submitted 13 July, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages
Authors:
Zhong Zhou,
Jan Niehues,
Alex Waibel
Abstract:
In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, tran…
▽ More
In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Fully transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study
Authors:
Sophia J. Wagner,
Daniel Reisenbüchler,
Nicholas P. West,
Jan Moritz Niehues,
Gregory Patrick Veldhuizen,
Philip Quirke,
Heike I. Grabsch,
Piet A. van den Brandt,
Gordon G. A. Hutchins,
Susan D. Richman,
Tanwei Yuan,
Rupert Langer,
Josien Christina Anna Jenniskens,
Kelly Offermans,
Wolfram Mueller,
Richard Gray,
Stephen B. Gruber,
Joel K. Greenson,
Gad Rennert,
Joseph D. Bonner,
Daniel Schmolze,
Jacqueline A. James,
Maurice B. Loughrey,
Manuel Salto-Tellez,
Hermann Brenner
, et al. (6 additional authors not shown)
Abstract:
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but…
▽ More
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but have not been used for biomarker prediction in cancer at a large scale. In addition, most DL approaches have been trained on small patient cohorts, which limits their clinical utility. Methods: In this study, we developed a new fully transformer-based pipeline for end-to-end biomarker prediction from pathology slides. We combine a pre-trained transformer encoder and a transformer network for patch aggregation, capable of yielding single and multi-target prediction at patient level. We train our pipeline on over 9,000 patients from 10 colorectal cancer cohorts. Results: A fully transformer-based approach massively improves the performance, generalizability, data efficiency, and interpretability as compared with current state-of-the-art algorithms. After training on a large multicenter cohort, we achieve a sensitivity of 0.97 with a negative predictive value of 0.99 for MSI prediction on surgical resection specimens. We demonstrate for the first time that resection specimen-only training reaches clinical-grade performance on endoscopic biopsy tissue, solving a long-standing diagnostic problem. Interpretation: A fully transformer-based end-to-end pipeline trained on thousands of pathology slides yields clinical-grade performance for biomarker prediction on surgical resections and biopsies. Our new methods are freely available under an open source license.
△ Less
Submitted 1 March, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Diffusion Probabilistic Models beat GANs on Medical Images
Authors:
Gustav Müller-Franzes,
Jan Moritz Niehues,
Firas Khader,
Soroosh Tayebi Arasteh,
Christoph Haarburger,
Christiane Kuhl,
Tianci Wang,
Tianyu Han,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we…
▽ More
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Towards continually learning new languages
Authors:
Ngoc-Quan Pham,
Jan Niehues,
Alexander Waibel
Abstract:
Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in…
▽ More
Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly. Such combination allowed us to eliminate catastrophic forgetting while still achieving performance for the new languages comparable with having all languages at once, in experiments of learning from an initial 10 languages to achieve 26 languages without catastrophic forgetting and a reasonable performance compared to training all languages from scratch.
△ Less
Submitted 1 March, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Efficient Speech Translation with Pre-trained Models
Authors:
Zhaolin Li,
Jan Niehues
Abstract:
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech tr…
▽ More
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Learning an Artificial Language for Knowledge-Sharing in Multilingual Translation
Authors:
Danni Liu,
Jan Niehues
Abstract:
The cornerstone of multilingual neural translation is shared representations across languages. Given the theoretically infinite representation power of neural networks, semantically identical sentences are likely represented differently. While representing sentences in the continuous latent space ensures expressiveness, it introduces the risk of capturing of irrelevant features which hinders the l…
▽ More
The cornerstone of multilingual neural translation is shared representations across languages. Given the theoretically infinite representation power of neural networks, semantically identical sentences are likely represented differently. While representing sentences in the continuous latent space ensures expressiveness, it introduces the risk of capturing of irrelevant features which hinders the learning of a common representation. In this work, we discretize the encoder output latent space of multilingual models by assigning encoder states to entries in a codebook, which in effect represents source sentences in a new artificial language. This discretization process not only offers a new way to interpret the otherwise black-box model representations, but, more importantly, gives potential for increasing robustness in unseen testing conditions. We validate our approach on large-scale experiments with realistic data volumes and domains. When tested in zero-shot conditions, our approach is competitive with two strong alternatives from the literature. We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
△ Less
Submitted 18 November, 2022; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Adaptive multilingual speech recognition with pretrained models
Authors:
Ngoc-Quan Pham,
Alex Waibel,
Jan Niehues
Abstract:
Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretra…
▽ More
Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text, together with the adaptive weight techniques to massively improve the recognition quality on the public datasets containing CommonVoice and Europarl. Overall, we noticed an 44% improvement over purely supervised learning, and more importantly, each technique provides a different reinforcement in different languages. We also explore other possibilities to potentially obtain the best model by slightly adding either depth or relative attention to the architecture.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
LibriS2S: A German-English Speech-to-Speech Translation Corpus
Authors:
Pedro Jeuris,
Jan Niehues
Abstract:
Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address th…
▽ More
Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address this issue by creating LibriS2S, to our knowledge the first publicly available speech-to-speech training corpus between German and English. For this corpus, we used independently created audio for German and English leading to an unbiased pronunciation of the text in both languages. This allows the creation of a new text-to-speech and speech-to-speech translation model that directly learns to generate the speech signal based on the pronunciation of the source language. Using this created corpus, we propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model that integrates source language information. We do this by adapting the model to take information such as the pitch, energy or transcript from the source speech as additional input.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
Authors:
Peter Polák,
Ngoc-Quan Ngoc,
Tuan-Nam Nguyen,
Danni Liu,
Carlos Mullov,
Jan Niehues,
Ondřej Bojar,
Alexander Waibel
Abstract:
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set.…
▽ More
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available.
△ Less
Submitted 11 May, 2022; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Multilingual Simultaneous Speech Translation
Authors:
Shashank Subramanya,
Jan Niehues
Abstract:
Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models…
▽ More
Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models, we investigate multilingual models and different architectures (end-to-end and cascade) on the ability to perform online speech translation. On the multilingual TEDx corpus, we show that the approach generalizes to different architectures. We see similar gains in latency reduction (40% relative) across languages and architectures. However, the end-to-end architecture leads to smaller translation quality losses after adapting to the online model. Furthermore, the approach even scales to zero-shot directions.
△ Less
Submitted 29 March, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques
Authors:
Tu Anh Dinh,
Danni Liu,
Jan Niehues
Abstract:
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have…
▽ More
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have been proposed for zero-shot translation. A main idea is to increase the similarity of semantically similar sentences in different languages. We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data. We investigate the effects of data augmentation and auxiliary loss function. The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
Cost-Effective Training in Low-Resource Neural Machine Translation
Authors:
Sai Koneru,
Danni Liu,
Jan Niehues
Abstract:
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be help…
▽ More
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data and also most similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Unsupervised Machine Translation On Dravidian Languages
Authors:
Sai Koneru,
Danni Liu,
Jan Niehues
Abstract:
Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between En…
▽ More
Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between English and Kannada, a low resource Dravidian language. We additionally utilize a limited amount of auxiliary data between English and other related Dravidian languages. We show that unifying the writing systems is essential in unsupervised translation between the Dravidian languages. We explore several model architectures that use the auxiliary data in order to maximize knowledge sharing and enable UNMT for distant language pairs. Our experiments demonstrate that it is crucial to include auxiliary languages that are similar to our focal language, Kannada. Furthermore, we propose a metric to measure language similarity and show that it serves as a good indicator for selecting the auxiliary languages.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
Continuous Learning in Neural Machine Translation using Bilingual Dictionaries
Authors:
Jan Niehues
Abstract:
While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges: The system ne…
▽ More
While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges: The system needs to be able to perform one-shot learning as well as model the morphology of source and target language.
In this work, we proposed an evaluation framework to assess the ability of neural machine translation to continuously learn new phrases. We integrate one-shot learning methods for neural machine translation with different word representations and show that it is important to address both in order to successfully make use of bilingual dictionaries. By addressing both challenges we are able to improve the ability to translate new, rare words and phrases from 30% to up to 70%. The correct lemma is even generated by more than 90%.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Improving Zero-Shot Translation by Disentangling Positional Information
Authors:
Danni Liu,
Jan Niehues,
James Cross,
Francisco Guzmán,
Xian Li
Abstract:
Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. W…
▽ More
Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.
△ Less
Submitted 30 June, 2021; v1 submitted 30 December, 2020;
originally announced December 2020.
-
Adapting End-to-End Speech Recognition for Readable Subtitles
Authors:
Danni Liu,
Jan Niehues,
Gerasimos Spanakis
Abstract:
Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investi…
▽ More
Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Authors:
Danni Liu,
Gerasimos Spanakis,
Jan Niehues
Abstract:
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficie…
▽ More
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.
△ Less
Submitted 13 October, 2020; v1 submitted 22 May, 2020;
originally announced May 2020.
-
Relative Positional Encoding for Speech Recognition and Direct Translation
Authors:
Ngoc-Quan Pham,
Thanh-Le Ha,
Tuan-Nam Nguyen,
Thai-Son Nguyen,
Elizabeth Salesky,
Sebastian Stueker,
Jan Niehues,
Alexander Waibel
Abstract:
Transformer models are powerful sequence-to-sequence architectures that are capable of directly map** speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition…
▽ More
Transformer models are powerful sequence-to-sequence architectures that are capable of directly map** speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Machine Translation with Unsupervised Length-Constraints
Authors:
Jan Niehues
Abstract:
We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper, we explore one of these, the generation of constraint translation. We focus on length constraints, which are essential if the translation should be displayed in…
▽ More
We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper, we explore one of these, the generation of constraint translation. We focus on length constraints, which are essential if the translation should be displayed in a given format. In this work, we propose an end-to-end approach for this task. Compared to a traditional method that first translates and then performs sentence compression, the text compression is learned completely unsupervised. By combining the idea with zero-shot multilingual machine translation, we are also able to perform unsupervised monolingual sentence compression. In order to fulfill the length constraints, we investigated several methods to integrate the constraints into the model. Using the presented technique, we are able to significantly improve the translation quality under constraints. Furthermore, we are able to perform unsupervised monolingual sentence compression.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Low Latency ASR for Simultaneous Speech Translation
Authors:
Thai Son Nguyen,
Jan Niehues,
Eunah Cho,
Thanh-Le Ha,
Kevin Kilgour,
Markus Muller,
Matthias Sperber,
Sebastian Stueker,
Alex Waibel
Abstract:
User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we…
▽ More
User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we focused on word latency. We used it to analyze the performance of our current system and to identify opportunities for improvements. In order to minimize the latency we combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update that allows to revise the most recent parts of the transcription. This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance in terms of word error rate.
△ Less
Submitted 22 March, 2020;
originally announced March 2020.
-
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation
Authors:
Thai-Son Nguyen,
Sebastian Stueker,
Jan Niehues,
Alex Waibel
Abstract:
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhib…
▽ More
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data augmentation methods on the performance of two S2S model architectures. One of the data augmentation method comes from literature, while two other methods are our own development - a time perturbation in the frequency domain and sub-sequence sampling. Our experiments on Switchboard and Fisher data show state-of-the-art performance for S2S models that are trained solely on the speech training data and do not use additional text data.
△ Less
Submitted 3 February, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Modeling Confidence in Sequence-to-Sequence Models
Authors:
Jan Niehues,
Ngoc-Quan Pham
Abstract:
Recently, significant improvements have been achieved in various natural language processing tasks using neural sequence-to-sequence models. While aiming for the best generation quality is important, ultimately it is also necessary to develop models that can assess the quality of their output.
In this work, we propose to use the similarity between training and test conditions as a measure for mo…
▽ More
Recently, significant improvements have been achieved in various natural language processing tasks using neural sequence-to-sequence models. While aiming for the best generation quality is important, ultimately it is also necessary to develop models that can assess the quality of their output.
In this work, we propose to use the similarity between training and test conditions as a measure for models' confidence. We investigate methods solely using the similarity as well as methods combining it with the posterior probability. While traditionally only target tokens are annotated with confidence measures, we also investigate methods to annotate source tokens with confidence. By learning an internal alignment model, we can significantly improve confidence projection over using state-of-the-art external alignment tools. We evaluate the proposed methods on downstream confidence estimation for machine translation (MT). We show improvements on segment-level confidence estimation as well as on confidence estimation for source tokens. In addition, we show that the same methods can also be applied to other tasks using sequence-to-sequence models. On the automatic speech recognition (ASR) task, we are able to find 60% of the errors by looking at 20% of the data.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
Incremental processing of noisy user utterances in the spoken language understanding task
Authors:
Stefan Constantin,
Jan Niehues,
Alex Waibel
Abstract:
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved thro…
▽ More
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved through incremental processing of the user utterance and starting subactions while the utterance is still being uttered. In this work, we present a model-agnostic method to achieve high quality in processing incrementally produced partial utterances. Based on clean and noisy versions of the ATIS dataset, we show how to create datasets with our method to create low-latency natural language understanding components. We get improvements of up to 47.91 absolute percentage points in the metric F1-score.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Improving Zero-shot Translation with Language-Independent Constraints
Authors:
Ngoc-Quan Pham,
Jan Niehues,
Thanh-Le Ha,
Alex Waibel
Abstract:
An important concern in training multilingual neural machine translation (NMT) is to translate between language pairs unseen during training, i.e zero-shot translation. Improving this ability kills two birds with one stone by providing an alternative to pivot translation which also allows us to better understand how the model captures information between languages.
In this work, we carried out a…
▽ More
An important concern in training multilingual neural machine translation (NMT) is to translate between language pairs unseen during training, i.e zero-shot translation. Improving this ability kills two birds with one stone by providing an alternative to pivot translation which also allows us to better understand how the model captures information between languages.
In this work, we carried out an investigation on this capability of the multilingual NMT models. First, we intentionally create an encoder architecture which is independent with respect to the source language. Such experiments shed light on the ability of NMT encoders to learn multilingual representations, in general. Based on such proof of concept, we were able to design regularization methods into the standard Transformer model, so that the whole architecture becomes more robust in zero-shot conditions. We investigated the behaviour of such models on the standard IWSLT 2017 multilingual dataset. We achieved an average improvement of 2.23 BLEU points across 12 language pairs compared to the zero-shot performance of a state-of-the-art multilingual system. Additionally, we carry out further experiments in which the effect is confirmed even for language pairs with multiple intermediate pivots.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Very Deep Self-Attention Networks for End-to-End Speech Recognition
Authors:
Ngoc-Quan Pham,
Thai-Son Nguyen,
Jan Niehues,
Markus Müller,
Sebastian Stüker,
Alexander Waibel
Abstract:
Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transfor…
▽ More
Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.
△ Less
Submitted 3 May, 2019; v1 submitted 30 April, 2019;
originally announced April 2019.
-
Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation
Authors:
Matthias Sperber,
Graham Neubig,
Jan Niehues,
Alex Waibel
Abstract:
Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. Howev…
▽ More
Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora.
In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and while they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end-to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task-trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Multi-task learning to improve natural language understanding
Authors:
Stefan Constantin,
Jan Niehues,
Alex Waibel
Abstract:
Recently advancements in sequence-to-sequence neural network architectures have led to an improved natural language understanding. When building a neural network-based Natural Language Understanding component, one main challenge is to collect enough training data. The generation of a synthetic dataset is an inexpensive and quick way to collect data. Since this data often has less variety than real…
▽ More
Recently advancements in sequence-to-sequence neural network architectures have led to an improved natural language understanding. When building a neural network-based Natural Language Understanding component, one main challenge is to collect enough training data. The generation of a synthetic dataset is an inexpensive and quick way to collect data. Since this data often has less variety than real natural language, neural networks often have problems to generalize to unseen utterances during testing. In this work, we address this challenge by using multi-task learning. We train out-of-domain real data alongside in-domain synthetic data to improve natural language understanding. We evaluate this approach in the domain of airline travel information with two synthetic datasets. As out-of-domain real data, we test two datasets based on the subtitles of movies and series. By using an attention-based encoder-decoder model, we were able to improve the F1-score over strong baselines from 80.76 % to 84.98 % in the smaller synthetic dataset.
△ Less
Submitted 15 February, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Towards Fluent Translations from Disfluent Speech
Authors:
Elizabeth Salesky,
Susanne Burger,
Jan Niehues,
Alex Waibel
Abstract:
When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the…
▽ More
When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the data better-matched to typical translation text and significantly improving performance. However, with the rise of end-to-end speech translation systems, this intermediate step must be incorporated into the sequence-to-sequence architecture. Further, though translated speech datasets exist, they are typically news or rehearsed speech without many disfluencies (e.g. TED), or the disfluencies are translated into the references (e.g. Fisher). To generate clean translations from disfluent speech, cleaned references are necessary for evaluation. We introduce a corpus of cleaned target data for the Fisher Spanish-English dataset for this task. We compare how different architectures handle disfluencies and provide a baseline for removing disfluencies in end-to-end translation.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Optimizing Segmentation Granularity for Neural Machine Translation
Authors:
Elizabeth Salesky,
Andrew Runge,
Alex Coda,
Jan Niehues,
Graham Neubig
Abstract:
In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach to generating these subwords, as they are unsupervised, resource-free, and empirically effective. However, the granularity of these subword units is a hyperpar…
▽ More
In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach to generating these subwords, as they are unsupervised, resource-free, and empirically effective. However, the granularity of these subword units is a hyperparameter to be tuned for each language and task, using methods such as grid search. Tuning may be done inexhaustively or skipped entirely due to resource constraints, leading to sub-optimal performance. In this paper, we propose a method to automatically tune this parameter using only one training pass. We incrementally introduce new vocabulary online based on the held-out validation loss, beginning with smaller, general subwords and adding larger, more specific units over the course of training. Our method matches the results found with grid search, optimizing segmentation granularity without any additional training time. We also show benefits in training efficiency and performance improvements for rare words due to the way embeddings for larger units are incrementally constructed by combining those from smaller units.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
Towards one-shot learning for rare-word translation with external experts
Authors:
Ngoc-Quan Pham,
Jan Niehues,
Alex Waibel
Abstract:
Neural machine translation (NMT) has significantly improved the quality of automatic translation models. One of the main challenges in current systems is the translation of rare words. We present a generic approach to address this weakness by having external models annotate the training data as Experts, and control the model-expert interaction with a pointer network and reinforcement learning. Our…
▽ More
Neural machine translation (NMT) has significantly improved the quality of automatic translation models. One of the main challenges in current systems is the translation of rare words. We present a generic approach to address this weakness by having external models annotate the training data as Experts, and control the model-expert interaction with a pointer network and reinforcement learning. Our experiments using phrase-based models to simulate Experts to complement neural machine translation models show that the model can be trained to copy the annotations into the output consistently. We demonstrate the benefit of our proposed framework in outof-domain translation scenarios with only lexical resources, improving more than 1.0 BLEU point in both translation directions English to Spanish and German to English
△ Less
Submitted 10 September, 2018;
originally announced September 2018.
-
Low-Latency Neural Speech Translation
Authors:
Jan Niehues,
Ngoc-Quan Pham,
Thanh-Le Ha,
Matthias Sperber,
Alex Waibel
Abstract:
Through the development of neural machine translation, the quality of machine translation systems has been improved significantly. By exploiting advancements in deep learning, systems are now able to better approximate the complex map** from source sentences to target sentences. But with this ability, new challenges also arise. An example is the translation of partial sentences in low-latency sp…
▽ More
Through the development of neural machine translation, the quality of machine translation systems has been improved significantly. By exploiting advancements in deep learning, systems are now able to better approximate the complex map** from source sentences to target sentences. But with this ability, new challenges also arise. An example is the translation of partial sentences in low-latency speech translation. Since the model has only seen complete sentences in training, it will always try to generate a complete sentence, though the input may only be a partial sentence. We show that NMT systems can be adapted to scenarios where no task-specific training data is available. Furthermore, this is possible without losing performance on the original training data. We achieve this by creating artificial data and by using multi-task learning. After adaptation, we are able to reduce the number of corrections displayed during incremental output construction by 45%, without a decrease in translation quality.
△ Less
Submitted 1 August, 2018;
originally announced August 2018.
-
A Hierarchical Approach to Neural Context-Aware Modeling
Authors:
Patrick Huber,
Jan Niehues,
Alex Waibel
Abstract:
We present a new recurrent neural network topology to enhance state-of-the-art machine learning systems by incorporating a broader context. Our approach overcomes recent limitations with extended narratives through a multi-layered computational approach to generate an abstract context representation. Therefore, the developed system captures the narrative on word-level, sentence-level, and context-…
▽ More
We present a new recurrent neural network topology to enhance state-of-the-art machine learning systems by incorporating a broader context. Our approach overcomes recent limitations with extended narratives through a multi-layered computational approach to generate an abstract context representation. Therefore, the developed system captures the narrative on word-level, sentence-level, and context-level. Through the hierarchical set-up, our proposed model summarizes the most salient information on each level and creates an abstract representation of the extended context. We subsequently use this representation to enhance neural language processing systems on the task of semantic error detection. To show the potential of the newly introduced topology, we compare the approach against a context-agnostic set-up including a standard neural language model and a supervised binary classification network. The performance measures on the error detection task show the advantage of the hierarchical context-aware topologies, improving the baseline by 12.75% relative for unsupervised models and 20.37% relative for supervised models.
△ Less
Submitted 6 August, 2018; v1 submitted 27 July, 2018;
originally announced July 2018.
-
Robust and Scalable Differentiable Neural Computer for Question Answering
Authors:
Jörg Franke,
Jan Niehues,
Alex Waibel
Abstract:
Deep learning models are often not easily adaptable to new tasks and require task-specific adjustments. The differentiable neural computer (DNC), a memory-augmented neural network, is designed as a general problem solver which can be used in a wide range of tasks. But in reality, it is hard to apply this model to new tasks. We analyze the DNC and identify possible improvements within the applicati…
▽ More
Deep learning models are often not easily adaptable to new tasks and require task-specific adjustments. The differentiable neural computer (DNC), a memory-augmented neural network, is designed as a general problem solver which can be used in a wide range of tasks. But in reality, it is hard to apply this model to new tasks. We analyze the DNC and identify possible improvements within the application of question answering. This motivates a more robust and scalable DNC (rsDNC). The objective precondition is to keep the general character of this model intact while making its application more reliable and speeding up its required training time. The rsDNC is distinguished by a more robust training, a slim memory unit and a bidirectional architecture. We not only achieve new state-of-the-art performance on the bAbI task, but also minimize the performance variance between different initializations. Furthermore, we demonstrate the simplified applicability of the rsDNC to new tasks with passable results on the CNN RC task without adaptions.
△ Less
Submitted 7 July, 2018;
originally announced July 2018.
-
Self-Attentional Acoustic Models
Authors:
Matthias Sperber,
Jan Niehues,
Graham Neubig,
Sebastian Stüker,
Alex Waibel
Abstract:
Self-attention is a method of encoding sequences of vectors by relating these vectors to each-other based on pairwise similarities. These models have recently shown promising results for modeling discrete sequences, but they are non-trivial to apply to acoustic modeling due to computational and modeling issues. In this paper, we apply self-attention to acoustic modeling, proposing several improvem…
▽ More
Self-attention is a method of encoding sequences of vectors by relating these vectors to each-other based on pairwise similarities. These models have recently shown promising results for modeling discrete sequences, but they are non-trivial to apply to acoustic modeling due to computational and modeling issues. In this paper, we apply self-attention to acoustic modeling, proposing several improvements to mitigate these issues: First, self-attention memory grows quadratically in the sequence length, which we address through a downsampling technique. Second, we find that previous approaches to incorporate position information into the model are unsuitable and explore other representations and hybrid models to this end. Third, to stress the importance of local context in the acoustic signal, we propose a Gaussian biasing approach that allows explicit control over the context range. Experiments find that our model approaches a strong baseline based on LSTMs with network-in-network connections while being much faster to compute. Besides speed, we find that interpretability is a strength of self-attentional acoustic models, and demonstrate that self-attention heads learn a linguistically plausible division of labor.
△ Less
Submitted 18 June, 2018; v1 submitted 26 March, 2018;
originally announced March 2018.
-
Automated Evaluation of Out-of-Context Errors
Authors:
Patrick Huber,
Jan Niehues,
Alex Waibel
Abstract:
We present a new approach to evaluate computational models for the task of text understanding by the means of out-of-context error detection. Through the novel design of our automated modification process, existing large-scale data sources can be adopted for a vast number of text understanding tasks. The data is thereby altered on a semantic level, allowing models to be tested against a challengin…
▽ More
We present a new approach to evaluate computational models for the task of text understanding by the means of out-of-context error detection. Through the novel design of our automated modification process, existing large-scale data sources can be adopted for a vast number of text understanding tasks. The data is thereby altered on a semantic level, allowing models to be tested against a challenging set of modified text passages that require to comprise a broader narrative discourse. Our newly introduced task targets actual real-world problems of transcription and translation systems by inserting authentic out-of-context errors. The automated modification process is applied to the 2016 TEDTalk corpus. Entirely automating the process allows the adoption of complete datasets at low cost, facilitating supervised learning procedures and deeper networks to be trained and tested. To evaluate the quality of the modification algorithm a language model and a supervised binary classification model are trained and tested on the altered dataset. A human baseline evaluation is examined to compare the results with human performance. The outcome of the evaluation task indicates the difficulty to detect semantic errors for machine-learning algorithms and humans, showing that the errors cannot be identified when limited to a single sentence.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
An End-to-End Goal-Oriented Dialog System with a Generative Natural Language Response Generation
Authors:
Stefan Constantin,
Jan Niehues,
Alex Waibel
Abstract:
Recently advancements in deep learning allowed the development of end-to-end trained goal-oriented dialog systems. Although these systems already achieve good performance, some simplifications limit their usage in real-life scenarios.
In this work, we address two of these limitations: ignoring positional information and a fixed number of possible response candidates. We propose to use positional…
▽ More
Recently advancements in deep learning allowed the development of end-to-end trained goal-oriented dialog systems. Although these systems already achieve good performance, some simplifications limit their usage in real-life scenarios.
In this work, we address two of these limitations: ignoring positional information and a fixed number of possible response candidates. We propose to use positional encodings in the input to model the word order of the user utterances. Furthermore, by using a feedforward neural network, we are able to generate the output word by word and are no longer restricted to a fixed number of possible response candidates. Using the positional encoding, we were able to achieve better accuracies in the Dialog bAbI Tasks and using the feedforward neural network for generating the response, we were able to save computation time and space consumption.
△ Less
Submitted 15 March, 2018; v1 submitted 6 March, 2018;
originally announced March 2018.
-
Effective Strategies in Zero-Shot Neural Machine Translation
Authors:
Thanh-Le Ha,
Jan Niehues,
Alexander Waibel
Abstract:
In this paper, we proposed two strategies which can be applied to a multilingual neural machine translation system in order to better tackle zero-shot scenarios despite not having any parallel corpus. The experiments show that they are effective in terms of both performance and computing resources, especially in multilingual translation of unbalanced data in real zero-resourced condition when they…
▽ More
In this paper, we proposed two strategies which can be applied to a multilingual neural machine translation system in order to better tackle zero-shot scenarios despite not having any parallel corpus. The experiments show that they are effective in terms of both performance and computing resources, especially in multilingual translation of unbalanced data in real zero-resourced condition when they alleviate the language bias problem.
△ Less
Submitted 22 November, 2017; v1 submitted 21 November, 2017;
originally announced November 2017.