Skip to main content

Showing 1–10 of 10 results for author: Alumäe, T

.
  1. arXiv:2403.02288  [pdf, other

    eess.AS

    PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

    Authors: Joonas Kalda, Clément Pagés, Ricard Marxer, Tanel Alumäe, Hervé Bredin

    Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: submitted to Speaker Odyssey 2024

  2. arXiv:2310.17448  [pdf, other

    cs.CL eess.AS

    Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

    Authors: Tanel Alumäe, Jiaming Kong, Daniil Robnikov

    Abstract: This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using addition… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Journal ref: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

  3. arXiv:2205.07086  [pdf, other

    eess.AS cs.CL cs.SD

    Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

    Authors: Joonas Kalda, Tanel Alumäe

    Abstract: In this paper, we present a novel training method for speaker change detection models. Speaker change detection is often viewed as a binary sequence labelling problem. The main challenges with this approach are the vagueness of annotated change points caused by the silences between speaker turns and imbalanced data due to the majority of frames not including a speaker change. Conventional training… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

    Comments: Accepted to Speaker Odyssey 2022

  4. arXiv:2205.07083  [pdf, other

    eess.AS cs.CL

    Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

    Authors: Tanel Alumäe, Kunnar Kukk

    Abstract: This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge: constrained and unconstrained language recognition. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

    Comments: Accepted to Speaker Odyssey 2022

  5. arXiv:2203.16972  [pdf, other

    eess.AS

    Improving Language Identification of Accented Speech

    Authors: Kunnar Kukk, Tanel Alumäe

    Abstract: Language identification from speech is a common preprocessing step in many spoken language processing systems. In recent years, this field has seen fast progress, mostly due to the use of self-supervised models pretrained on multilingual data and the use of large training corpora. This paper shows that for speech with a non-native or regional accent, the accuracy of spoken language identification… ▽ More

    Submitted 1 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  6. arXiv:2011.12998  [pdf, other

    eess.AS

    VoxLingua107: a Dataset for Spoken Language Recognition

    Authors: Jörgen Valk, Tanel Alumäe

    Abstract: This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is… ▽ More

    Submitted 25 November, 2020; originally announced November 2020.

    Comments: Accepted at IEEE Spoken Language Technology Workshop (SLT) 2021

  7. arXiv:2005.08520  [pdf, other

    cs.LG cs.CL stat.ML

    Robust Training of Vector Quantized Bottleneck Models

    Authors: Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

    Abstract: In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representat… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Comments: Published at IJCNN 2020

  8. Advanced Rich Transcription System for Estonian Speech

    Authors: Tanel Alumäe, Ottokar Tilk, Asadullah

    Abstract: This paper describes the current TTÜ speech transcription system for Estonian speech. The system is designed to handle semi-spontaneous speech, such as broadcast conversations, lecture recordings and interviews recorded in diverse acoustic conditions. The system is based on the Kaldi toolkit. Multi-condition training using background noise profiles extracted automatically from untranscribed data i… ▽ More

    Submitted 11 January, 2019; originally announced January 2019.

    Comments: Published in Baltic HLT 2018 (putting it on arXiv because Google Scholar doesn't index it properly)

    Journal ref: Series: Frontiers in Artificial Intelligence and Applications; Ebook Volume 307: Human Language Technologies -- The Baltic Perspective, 2018

  9. arXiv:1806.08621  [pdf, other

    cs.SD cs.CL cs.HC eess.AS

    Weakly Supervised Training of Speaker Identification Models

    Authors: Martin Karu, Tanel Alumäe

    Abstract: We propose an approach for training speaker identification models in a weakly supervised manner. We concentrate on the setting where the training data consists of a set of audio recordings and the speaker annotation is provided only at the recording level. The method uses speaker diarization to find unique speakers in each recording, and i-vectors to project the speech of each speaker to a fixed-d… ▽ More

    Submitted 22 June, 2018; originally announced June 2018.

    Comments: Odyssey 2018 The Speaker and Language Recognition Workshop

  10. arXiv:1707.09769  [pdf, ps, other

    cs.CL

    Low-Resource Neural Headline Generation

    Authors: Ottokar Tilk, Tanel Alumäe

    Abstract: Recent neural headline generation models have shown great results, but are generally trained on very large datasets. We focus our efforts on improving headline quality on smaller datasets by the means of pretraining. We propose new methods that enable pre-training all the parameters of the model and utilize all available text, resulting in improvements by up to 32.4% relative in perplexity and 2.8… ▽ More

    Submitted 31 July, 2017; originally announced July 2017.

    Comments: Accepted to EMNLP 2017 Workshop on New Frontiers in Summarization