Search | arXiv e-print repository

Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART

Authors: Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

Abstract: This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-bas… ▽ More This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice. △ Less

Submitted 29 February, 2024; originally announced March 2024.

arXiv:2401.06183 [pdf, other]

End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2

Authors: Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

Abstract: Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies… ▽ More Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies such as XLSR Wav2Vec2 for automatic speech recognition (ASR), mBART for neural machine translation (NMT), and a Text-to-Speech (TTS) synthesis component, this framework offers a unified and seamless approach to cross-lingual communication. We delve into the intricate details of each component, elucidating their individual contributions and exploring the synergies that enable a fluid transition from spoken Hindi to synthesized English audio. △ Less

Submitted 10 January, 2024; originally announced January 2024.

arXiv:2311.14836 [pdf, other]

Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion

Authors: Anand Kamble, Aniket Tathe, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

Abstract: This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and u… ▽ More This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications. △ Less

Submitted 9 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

Showing 1–3 of 3 results for author: Mitra, A C