Skip to main content

Showing 1–14 of 14 results for author: Khassanov, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.07842  [pdf, other

    eess.AS cs.CL

    Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

    Authors: Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new langua… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 4 tables

  2. arXiv:2305.15749  [pdf, other

    eess.AS cs.CL

    Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration

    Authors: Rustem Yeshpanov, Saida Mussakhojayeva, Yerbolat Khassanov

    Abstract: This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek. We specifically target the zero-shot learning scenario, where a TTS model trained using the data of one language is applied to synthesise speech for other, unseen languages. An end-to-end TTS… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: 5 pages, 1 figure, 3 tables, accepted to Interspeech

  3. arXiv:2210.15876  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

    Authors: Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

    Abstract: One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observatio… ▽ More

    Submitted 25 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: 5 pages, 3 figures, 4 tables

  4. arXiv:2201.05771  [pdf, other

    eess.AS cs.CL cs.SD

    KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics

    Authors: Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol

    Abstract: We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified with the help of new sources, including a book and Wikipedia articles. This… ▽ More

    Submitted 20 April, 2022; v1 submitted 15 January, 2022; originally announced January 2022.

    Comments: 8 pages, 2 figures, 5 tables, accepted to LREC 2022

  5. arXiv:2110.12136  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    A Study of Multimodal Person Verification Using Audio-Visual-Thermal Data

    Authors: Madina Abdrakhmanova, Saniya Abushakimova, Yerbolat Khassanov, Huseyin Atakan Varol

    Abstract: In this paper, we study an approach to multimodal person verification using audio, visual, and thermal modalities. The combination of audio and visual modalities has already been shown to be effective for robust person verification. From this perspective, we investigate the impact of further increasing the number of modalities by adding thermal images. In particular, we implemented unimodal, bimod… ▽ More

    Submitted 4 March, 2022; v1 submitted 23 October, 2021; originally announced October 2021.

    Comments: 7 pages, 4 figures, 4 tables

  6. arXiv:2108.01280  [pdf, ps, other

    eess.AS cs.CL

    A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English

    Authors: Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol

    Abstract: We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English. We first describe the development of multilingual E2E ASR based on Transformer networks and then perform an extensive assessment on the aforementioned languages. We also compare two variants of output grapheme set construction: combined and inde… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: 12 pages, 3 tables, accepted to SPECOM 2021

  7. arXiv:2107.14419  [pdf, other

    eess.AS cs.CL

    USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

    Authors: Muhammadjon Musaev, Saida Mussakhojayeva, Ilyos Khujayorov, Yerbolat Khassanov, Mannon Ochilov, Huseyin Atakan Varol

    Abstract: We present a freely available speech corpus for the Uzbek language and report preliminary automatic speech recognition (ASR) results using both the deep neural network hidden Markov model (DNN-HMM) and end-to-end (E2E) architectures. The Uzbek speech corpus (USC) comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. To the best of our knowledge, this is the fi… ▽ More

    Submitted 29 July, 2021; originally announced July 2021.

    Comments: 11 pages, 2 figures, 2 tables, accepted to SPECOM 2021

  8. KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset

    Authors: Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol

    Abstract: This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in… ▽ More

    Submitted 16 June, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: 5 pages, 4 tables, 2 figures, accepted to INTERSPEECH 2021

  9. arXiv:2010.12143  [pdf, other

    cs.SD eess.AS

    Enriching Under-Represented Named-Entities To Improve Speech Recognition Performance

    Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Aishan Wumaier, Eng Siong Chng

    Abstract: Automatic speech recognition (ASR) for under-represented named-entity (UR-NE) is challenging due to such named-entities (NE) have insufficient instances and poor contextual coverage in the training data to learn reliable estimates and representations. In this paper, we propose approaches to enriching UR-NEs to improve speech recognition performance. Specifically, our first priority is to ensure th… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  10. arXiv:2009.10334  [pdf, other

    eess.AS cs.CL cs.SD

    A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

    Authors: Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov, Alen Adiyev, Mukhamet Nurpeiissov, Huseyin Atakan Varol

    Abstract: We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database develop… ▽ More

    Submitted 13 January, 2021; v1 submitted 22 September, 2020; originally announced September 2020.

    Comments: 10 pages, 5 figures, 4 tables, accepted by EACL2021

    Journal ref: https://aclanthology.org/2021.eacl-main.58

  11. arXiv:2005.10407  [pdf, other

    eess.AS cs.LG cs.SD

    Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

    Authors: Zhi** Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

    Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More

    Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

  12. arXiv:2005.08742  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

    Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng

    Abstract: In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to dr… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  13. arXiv:1912.00863  [pdf, other

    cs.CL eess.AS

    Independent language modeling architecture for end-to-end ASR

    Authors: Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

    Abstract: The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language mo… ▽ More

    Submitted 25 November, 2019; originally announced December 2019.

  14. Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

    Authors: Yerbolat Khassanov, Eng Siong Chng

    Abstract: In automatic speech recognition (ASR) systems, recurrent neural network language models (RNNLM) are used to rescore a word lattice or N-best hypotheses list. Due to the expensive training, the RNNLM's vocabulary set accommodates only small shortlist of most frequent words. This leads to suboptimal performance if an input speech contains many out-of-shortlist (OOS) words. An effective solution is t… ▽ More

    Submitted 27 June, 2018; originally announced June 2018.

    Comments: 5 pages, 1 figure, accepted at INTERSPEECH 2018