Skip to main content

Showing 1–7 of 7 results for author: Mathur, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  2. arXiv:2404.07336  [pdf, other

    cs.CV cs.MM eess.AS

    PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

    Authors: Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania, Metehan Cekic, Marcello Federico, Kyu J. Han

    Abstract: Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 24 pages

  3. arXiv:2311.00697  [pdf, other

    cs.CL eess.AS

    End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

    Authors: Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico

    Abstract: Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation

  4. arXiv:2305.13204  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters

    Authors: Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Mathur, Alexandra Chronopoulou, Marcello Federico

    Abstract: To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023

  5. arXiv:2302.12979  [pdf, other

    cs.CL cs.SD eess.AS

    Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing

    Authors: Alexandra Chronopoulou, Brian Thompson, Prashant Mathur, Yogesh Virkar, Surafel M. Lakew, Marcello Federico

    Abstract: Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the spe… ▽ More

    Submitted 24 February, 2023; originally announced February 2023.

    Comments: 5 pages

  6. arXiv:2102.11922  [pdf, other

    eess.SP cs.AI

    Dynamic Graph Modeling of Simultaneous EEG and Eye-tracking Data for Reading Task Identification

    Authors: Puneet Mathur, Trisha Mittal, Dinesh Manocha

    Abstract: We present a new approach, that we call AdaGTCN, for identifying human reader intent from Electroencephalogram~(EEG) and Eye movement~(EM) data in order to help differentiate between normal reading and task-oriented reading. Understanding the physiological aspects of the reading process~(the cognitive load and the reading intent) can help improve the quality of crowd-sourced annotated data. Our me… ▽ More

    Submitted 21 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021

  7. arXiv:1904.11882  [pdf

    cs.OH cs.LG eess.SP

    Smart Laptop Bag with Machine Learning for Activity Recognition

    Authors: Dwij Sukeshkumar Sheth, Shantanu Singh, Prakhar S Mathur, Vydeki D

    Abstract: In todays world of smart living, the smart laptop bag, presented in this paper, provides a better solution to keep track of our precious possessions and monitoring them in real time. As the world moves towards a much tech-savvy direction, the novel laptop bag discussed here facilitates the user to perform location tracking, ambiance monitoring, user-state monitoring etc. in one device. The innovat… ▽ More

    Submitted 14 April, 2019; originally announced April 2019.