-
Anatomy of Industrial Scale Multilingual ASR
Authors:
Francis McCann Ramirez,
Luka Chkhetiani,
Andrew Ehrenberg,
Robert McHardy,
Rami Botros,
Yash Khare,
Andrea Vanzo,
Taufiquzzaman Peyash,
Gabriel Oexle,
Michael Liang,
Ilya Sklyar,
Enver Fakhan,
Ahmed Etefy,
Daniel McCrystal,
Sam Flamini,
Domenic Donato,
Takuya Yoshioka
Abstract:
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed descriptio…
▽ More
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
△ Less
Submitted 16 April, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrap**
Authors:
Kevin Zhang,
Luka Chkhetiani,
Francis McCann Ramirez,
Yash Khare,
Andrea Vanzo,
Michael Liang,
Sergio Ramirez Martin,
Gabriel Oexle,
Ruben Bousbib,
Taufiquzzaman Peyash,
Michael Nguyen,
Dillon Pulliam,
Domenic Donato
Abstract:
This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseu…
▽ More
This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.
△ Less
Submitted 12 April, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
Going for GOAL: A Resource for Grounded Football Commentaries
Authors:
Alessandro Suglia,
José Lopes,
Emanuele Bastianelli,
Andrea Vanzo,
Shubham Agarwal,
Malvina Nikandrou,
Lu Yu,
Ioannis Konstas,
Verena Rieser
Abstract:
Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer')…
▽ More
Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Playing with words: Do people exploit loaded language to affect others' decisions for their own benefit?
Authors:
Valerio Capraro,
Andrea Vanzo,
Antonio Cabrales
Abstract:
We report on three pre-registered studies testing whether people in the position of describing a decision problem to decision-makers exploit this opportunity for their benefit, by choosing descriptions that may be potentially beneficial for themselves. In Study 1, recipients of an extreme dictator game (where dictators can either take the whole pie for themselves or give it entirely to the receive…
▽ More
We report on three pre-registered studies testing whether people in the position of describing a decision problem to decision-makers exploit this opportunity for their benefit, by choosing descriptions that may be potentially beneficial for themselves. In Study 1, recipients of an extreme dictator game (where dictators can either take the whole pie for themselves or give it entirely to the receiver) are asked to choose the instructions used to introduce the game to dictators, among six different instructions that are known from previous research to affect dictators' decisions. The results demonstrate that some dictator game recipients tend to choose instructions that make them more likely to receive a higher payoff. Study 2 shows that people who choose descriptions that make them more likely to receive a higher payoff indeed believe that they will receive a higher payoff. Study 3 shows that receivers are more likely than dictators to choose these descriptions. In sum, our work suggests that some people choose descriptions that are beneficial to themselves; we also found some evidence that deliberative thinking and young age are associated with this tendency.
△ Less
Submitted 7 December, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games
Authors:
Alessandro Suglia,
Yonatan Bisk,
Ioannis Konstas,
Antonio Vergari,
Emanuele Bastianelli,
Andrea Vanzo,
Oliver Lemon
Abstract:
Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic…
▽ More
Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL).
We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks
Authors:
Emanuele Bastianelli,
Andrea Vanzo,
Oliver Lemon
Abstract:
We study the problem of integrating syntactic information from constituency trees into a neural model in Frame-semantic parsing sub-tasks, namely Target Identification (TI), FrameIdentification (FI), and Semantic Role Labeling (SRL). We use a Graph Convolutional Network to learn specific representations of constituents, such that each constituent is profiled as the production grammar rule it corre…
▽ More
We study the problem of integrating syntactic information from constituency trees into a neural model in Frame-semantic parsing sub-tasks, namely Target Identification (TI), FrameIdentification (FI), and Semantic Role Labeling (SRL). We use a Graph Convolutional Network to learn specific representations of constituents, such that each constituent is profiled as the production grammar rule it corresponds to. We leverage these representations to build syntactic features for each word in a sentence, computed as the sum of all the constituents on the path between a word and a task-specific node in the tree, e.g. the target predicate for SRL. Our approach improves state-of-the-art results on the TI and SRL of ~1%and~3.5% points, respectively (+2.5% additional points are gained with BERT as input), when tested on FrameNet 1.5, while yielding comparable results on the CoNLL05 dataset to other syntax-aware systems.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
SLURP: A Spoken Language Understanding Resource Package
Authors:
Emanuele Bastianelli,
Andrea Vanzo,
Pawel Swietojanski,
Verena Rieser
Abstract:
Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger an…
▽ More
Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https: //github.com/pswietojanski/slurp.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games
Authors:
Alessandro Suglia,
Antonio Vergari,
Ioannis Konstas,
Yonatan Bisk,
Emanuele Bastianelli,
Andrea Vanzo,
Oliver Lemon
Abstract:
In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, re…
▽ More
In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.
△ Less
Submitted 5 November, 2020;
originally announced November 2020.
-
CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning
Authors:
Alessandro Suglia,
Ioannis Konstas,
Andrea Vanzo,
Emanuele Bastianelli,
Desmond Elliott,
Stella Frank,
Oliver Lemon
Abstract:
Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub…
▽ More
Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular concerning attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with abstract and situated attributes. By using diagnostic classifiers, we show that current models learn representations that are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%).
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU
Authors:
Andrea Vanzo,
Emanuele Bastianelli,
Oliver Lemon
Abstract:
We present a new neural architecture for wide-coverage Natural Language Understanding in Spoken Dialogue Systems. We develop a hierarchical multi-task architecture, which delivers a multi-layer representation of sentence meaning (i.e., Dialogue Acts and Frame-like structures). The architecture is a hierarchy of self-attention mechanisms and BiLSTM encoders followed by CRF tagging layers. We descri…
▽ More
We present a new neural architecture for wide-coverage Natural Language Understanding in Spoken Dialogue Systems. We develop a hierarchical multi-task architecture, which delivers a multi-layer representation of sentence meaning (i.e., Dialogue Acts and Frame-like structures). The architecture is a hierarchy of self-attention mechanisms and BiLSTM encoders followed by CRF tagging layers. We describe a variety of experiments, showing that our approach obtains promising results on a dataset annotated with Dialogue Acts and Frame Semantics. Moreover, we demonstrate its applicability to a different, publicly available NLU dataset annotated with domain-specific intents and corresponding semantic roles, providing overall performance higher than state-of-the-art tools such as RASA, Dialogflow, LUIS, and Watson. For example, we show an average 4.45% improvement in entity tagging F-score over Rasa, Dialogflow and LUIS.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces
Authors:
Mary Ellen Foster,
Bart Craenen,
Amol Deshmukh,
Oliver Lemon,
Emanuele Bastianelli,
Christian Dondrup,
Ioannis Papaioannou,
Andrea Vanzo,
Jean-Marc Odobez,
Olivier Canévet,
Yuanzhouhan Cao,
Weipeng He,
Angel Martínez-González,
Petr Motlicek,
Rémy Siegfried,
Rachid Alami,
Kathleen Belhassein,
Guilhem Buisan,
Aurélie Clodic,
Amandine Mayima,
Yoan Sallami,
Guillaume Sarthou,
Phani-Teja Singamaneni,
Jules Waldhart,
Alexandre Mazel
, et al. (5 additional authors not shown)
Abstract:
In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shop** mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion plann…
▽ More
In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shop** mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shop** mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shop** mall.
△ Less
Submitted 15 September, 2019;
originally announced September 2019.
-
The power of moral words: Loaded language generates framing effects in the extreme dictator game
Authors:
Valerio Capraro,
Andrea Vanzo
Abstract:
Understanding whether preferences are sensitive to the frame has been a major topic of debate in the last decades. For example, several works have explored whether the dictator game in the give frame gives rise to a different rate of pro-sociality than the same game in the take frame, leading to mixed results. Here we contribute to this debate with two experiments. In Study 1 ($N=567$) we implemen…
▽ More
Understanding whether preferences are sensitive to the frame has been a major topic of debate in the last decades. For example, several works have explored whether the dictator game in the give frame gives rise to a different rate of pro-sociality than the same game in the take frame, leading to mixed results. Here we contribute to this debate with two experiments. In Study 1 ($N=567$) we implement an extreme dictator game in which the dictator either gets \$0.50 and the recipient gets nothing, or the opposite (i.e., the recipient gets \$0.50 and the dictator gets nothing). We experimentally manipulate the words describing the available actions using six terms, from very negative (e.g., stealing) to very positive (e.g., donating) connotations. We find that the rate of pro-sociality is affected by the words used to describe the available actions. In Study 2 ($N=221$) we ask brand new participants to rate each of the words used in Study 1 from ``extremely wrong'' to ``extremely right'' . We find that these moral judgments explain the framing effect in Study 1. In sum, our studies provide evidence that framing effects in an extreme Dictator game can be generated using morally loaded language.
△ Less
Submitted 6 April, 2019; v1 submitted 8 January, 2019;
originally announced January 2019.