Skip to main content

Showing 1–20 of 20 results for author: Moreno, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.10836  [pdf, other

    cs.CV eess.IV

    Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors

    Authors: João Luzio, Alexandre Bernardino, Plinio Moreno

    Abstract: The aim of this work is to establish how accurately a recent semantic-based foveal active perception model is able to complete visual tasks that are regularly performed by humans, namely, scene exploration and visual search. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across mu… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  2. arXiv:2402.17184  [pdf, other

    cs.CL cs.SD eess.AS

    Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

    Authors: Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno

    Abstract: The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the enc… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted to 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

  3. arXiv:2306.08133  [pdf, ps, other

    eess.AS cs.CL

    Large-scale Language Model Rescoring on Long-form Data

    Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

    Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More

    Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted in ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  4. arXiv:2305.15590  [pdf

    q-bio.QM eess.IV

    Deep Representation Learning of Tissue Metabolome and Computed Tomography Images Annotates Non-invasive Classification and Prognosis Prediction of NSCLC

    Authors: Marc Boubnovski Martell, Kristofer Linton-Reid, Sumeet Hindocha, Mitchell Chen, OCTAPUS-AI, Paula Moreno, Marina Álvarez-Benito, Ángel Salvatierra, Richard Lee, Joram M. Posma, Marco A Calzado, Eric O Aboagye

    Abstract: The rich chemical information from tissue metabolomics provides a powerful means to elaborate tissue physiology or tumor characteristics at cellular and tumor microenvironment levels. However, the process of obtaining such information requires invasive biopsies, is costly, and can delay clinical patient management. Conversely, computed tomography (CT) is a clinical standard of care but does not in… ▽ More

    Submitted 26 May, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  5. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  6. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  7. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  8. arXiv:2210.10027  [pdf, other

    cs.CL cs.SD eess.AS

    Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

    Authors: Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

    Abstract: Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech a… ▽ More

    Submitted 21 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

    MSC Class: 68T10 ACM Class: I.2.7

  9. arXiv:2209.06096  [pdf, other

    cs.CL cs.SD eess.AS

    Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

    Authors: Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

    Abstract: Attention layers are an integral part of modern end-to-end automatic speech recognition systems, for instance as part of the Transformer or Conformer architecture. Attention is typically multi-headed, where each head has an independent set of learned parameters and operates on the same input feature sequence. The output of multi-headed attention is a fusion of the outputs from the individual heads… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted for publication in Interspeech 2022

  10. arXiv:2208.11594  [pdf, other

    cs.CV eess.SY

    Active Gaze Control for Foveal Scene Exploration

    Authors: Alexandre M. F. Dias, Luís Simões, Plinio Moreno, Alexandre Bernardino

    Abstract: Active perception and foveal vision are the foundations of the human visual system. While foveal vision reduces the amount of information to process during a gaze fixation, active perception will change the gaze direction to the most promising parts of the visual field. We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene, identifying the objects pres… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: 6 pages, 8 figures, ICDL 2022 (International Conference on Development and Learning, formerly ICDL-EpiRob)

  11. arXiv:2204.03409  [pdf, other

    cs.CL cs.SD eess.AS

    MAESTRO: Matched Speech Text Representations through Modality Matching

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen

    Abstract: We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task.… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  12. A Scalable Model Specialization Framework for Training and Inference using Submodels and its Application to Speech Model Personalization

    Authors: Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro J. Moreno

    Abstract: Model fine-tuning and adaptation have become a common approach for model specialization for downstream tasks or domains. Fine-tuning the entire model or a subset of the parameters using light-weight adaptation has shown considerable success across different specialization tasks. Fine-tuning a model for a large number of domains typically requires starting a new training job for every domain posing… ▽ More

    Submitted 13 September, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH

  13. arXiv:2202.12719  [pdf, other

    cs.SD cs.CL eess.AS

    Ask2Mask: Guided Data Selection for Masked Speech Modeling

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Pedro Moreno

    Abstract: Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant informati… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

  14. arXiv:2108.12226  [pdf, other

    cs.CL cs.SD eess.AS

    Injecting Text in Self-Supervised Speech Pretraining

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

    Abstract: Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speec… ▽ More

    Submitted 27 August, 2021; originally announced August 2021.

    Comments: submit to ASRU 2021

    MSC Class: 68T10 ACM Class: I.2.7

  15. arXiv:2009.14495  [pdf, other

    eess.SY

    Forced variational integrator for distance-based shape control with flocking behavior of multi-agent systems

    Authors: Leonardo Colombo, Patricio Moreno, Mengbin Ye, Hector Garcia de Marina, Ming Cao

    Abstract: A multi-agent system designed to achieve distance-based shape control with flocking behavior can be seen as a mechanical system described by a Lagrangian function and subject to additional external forces. Forced variational integrators are given by the discretization of Lagrange-d'Alembert principle for systems subject to external forces, and have proved useful for numerical simulation studies of… ▽ More

    Submitted 30 September, 2020; originally announced September 2020.

    Comments: Presented at IFAC World Congress 2020, 6 pages + refs

  16. arXiv:1910.02564  [pdf, other

    cs.CV cs.RO eess.IV

    Action-conditioned Benchmarking of Robotic Video Prediction Models: a Comparative Study

    Authors: Manuel Serra Nunes, Atabak Dehban, Plinio Moreno, José Santos-Victor

    Abstract: A defining characteristic of intelligent systems is the ability to make action decisions based on the anticipated outcomes. Video prediction systems have been demonstrated as a solution for predicting how the future will unfold visually, and thus, many models have been proposed that are capable of predicting future frames based on a history of observed frames~(and sometimes robot actions). However… ▽ More

    Submitted 6 October, 2019; originally announced October 2019.

  17. arXiv:1909.11699  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Recognition with Augmented Synthesized Speech

    Authors: Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, Zelin Wu

    Abstract: Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at ASRU 2020

  18. arXiv:1904.04169  [pdf, other

    eess.AS cs.SD

    Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

    Authors: Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia

    Abstract: We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speec… ▽ More

    Submitted 29 October, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  19. arXiv:1809.09190  [pdf, other

    eess.AS cs.CL cs.SD

    From Audio to Semantics: Approaches to end-to-end spoken language understanding

    Authors: Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, Austin Waters

    Abstract: Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments. These modules are typically optimized independently. In this paper, we formulate audio to sem… ▽ More

    Submitted 24 September, 2018; originally announced September 2018.

  20. arXiv:1711.01694  [pdf, other

    eess.AS cs.AI cs.CL

    Multilingual Speech Recognition With A Single End-To-End Model

    Authors: Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

    Abstract: Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we presen… ▽ More

    Submitted 15 February, 2018; v1 submitted 5 November, 2017; originally announced November 2017.

    Comments: Accepted in ICASSP 2018