Skip to main content

Showing 1–9 of 9 results for author: Wasnik, P

.
  1. arXiv:2406.08802  [pdf, other

    eess.AS cs.SD

    DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

    Authors: Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different l… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  2. arXiv:2406.08076  [pdf, other

    eess.AS cs.SD

    VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

    Authors: Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on co… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  3. arXiv:2404.12628  [pdf, other

    cs.CL

    Efficient infusion of self-supervised representations in Automatic Speech Recognition

    Authors: Darshan Prabhu, Sai Ganesh Mirishkar, Pankaj Wasnik

    Abstract: Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation c… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Accepted to ENLSP workshop, NeurIPS 2023

  4. arXiv:2403.15469  [pdf, other

    cs.CL cs.LG eess.AS

    Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

    Authors: Shivam Ratnakant Mhaskar, Nirmesh J. Shah, Mohammadi Zaki, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subseque… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted in NAACL2024 Findings

  5. arXiv:2402.15044  [pdf, other

    cs.CV cs.LG

    Fiducial Focus Augmentation for Facial Landmark Detection

    Authors: Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik, Vineeth Balasubramanian

    Abstract: Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effe… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: Accepted to BMVC'23

  6. arXiv:2306.02268  [pdf, other

    cs.CV cs.AI cs.LG

    Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection

    Authors: Purbayan Kar, Vishal Chudasama, Naoyuki Onoe, Pankaj Wasnik

    Abstract: Semi-supervised object detection (SSOD) has made significant progress with the development of pseudo-label-based end-to-end methods. However, many of these methods face challenges due to class imbalance, which hinders the effectiveness of the pseudo-label generator. Furthermore, in the literature, it has been observed that low-quality pseudo-labels severely limit the performance of SSOD. In this p… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted at the Efficient Deep Learning for Computer Vision Workshop, CVPR 2023

  7. arXiv:2206.02187  [pdf, other

    cs.CV cs.SD eess.AS

    M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

    Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe

    Abstract: Emotion Recognition in Conversations (ERC) is crucial in develo** sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using te… ▽ More

    Submitted 5 June, 2022; originally announced June 2022.

    Comments: Accepted for publication in the 5th Multimodal Learning and Applications (MULA) Workshop at CVPR 2022

  8. arXiv:1912.02487  [pdf, other

    cs.CV

    Smartphone Multi-modal Biometric Authentication: Database and Evaluation

    Authors: Raghavendra Ramachandra, Martin Stokkenes, Amir Mohammadi, Sushma Venkatesh, Kiran Raja, Pankaj Wasnik, Eric Poiret, Sébastien Marcel, Christoph Busch

    Abstract: Biometric-based verification is widely employed on the smartphones for various applications, including financial transactions. In this work, we present a new multimodal biometric dataset (face, voice, and periocular) acquired using a smartphone. The new dataset is comprised of 150 subjects that are captured in six different sessions reflecting real-life scenarios of smartphone assisted authenticat… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

  9. arXiv:1409.0977  [pdf, ps, other

    quant-ph physics.optics

    Photon statistics of radiation in an incoherently pumped three-level cascade system

    Authors: Shaik Ahmed, Preethi N. Wasnik, Suneel Singh, P. Anantha Lakshmi

    Abstract: We study the intensity-intensity correlations of the radiation emitted on probe transition in a three level cascade electromagnetically induced transparency (EIT) scheme. By applying an incoherent pump, we also monitor further changes in the characteristics of the emitted radiation. It is found that application of even a very weak incoherent pump can significantly alter the characteristics of the… ▽ More

    Submitted 3 September, 2014; originally announced September 2014.

    Comments: 12 pages, 5 figures (including subfigures, total of 11 figures)