Skip to main content

Showing 1–9 of 9 results for author: Andrusenko, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.07096  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

    Authors: Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2310.09424  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

    Authors: Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg

    Abstract: We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recogni… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: submit to ICASSP 2024

    MSC Class: 68T10 ACM Class: I.2.7

  3. arXiv:2208.07657  [pdf, other

    eess.AS cs.LG cs.SD

    Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

    Authors: Andrei Andrusenko, Rauf Nasretdinov, Aleksei Romanenko

    Abstract: Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence… ▽ More

    Submitted 11 March, 2023; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2023

  4. arXiv:2104.02526  [pdf, ps, other

    eess.AS cs.CL cs.LG

    LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring

    Authors: Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko

    Abstract: Neural network-based language models are commonly used in rescoring approaches to improve the quality of modern automatic speech recognition (ASR) systems. Most of the existing methods are computationally expensive since they use autoregressive language models. We propose a novel rescoring approach, which processes the entire lattice in a single call to the model. The key feature of our rescoring… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: Submitted to InterSpeech 2021

  5. arXiv:2103.07186  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

    Authors: Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, Anton Mitrofanov, Ivan Medennikov, Yuri Matveev

    Abstract: With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, bu… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: 16 pages, 7 figures

  6. arXiv:2006.08274  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text Dataset

    Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

    Abstract: This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available valid… ▽ More

    Submitted 26 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: Accepted by SPECOM 2020

  7. Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

    Authors: Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, Aleksandr Laptev, Aleksei Romanenko

    Abstract: Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlap** speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD mode… ▽ More

    Submitted 27 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  8. arXiv:2005.07157  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

    Authors: Aleksandr Laptev, Roman Korostik, Aleksey Svischev, Andrei Andrusenko, Ivan Medennikov, Sergey Rybin

    Abstract: Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition mo… ▽ More

    Submitted 30 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

  9. arXiv:2004.10799  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

    Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

    Abstract: While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of challenging… ▽ More

    Submitted 7 August, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: Accepted by Interspeech 2020