Skip to main content

Showing 1–28 of 28 results for author: Ro, Y M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.16021  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  2. arXiv:2402.15151  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

    Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

    Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More

    Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: An Erratum was added on the last page of this paper

  3. arXiv:2401.09802  [pdf, other

    eess.AS cs.CV cs.SD

    Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

    Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se ** Park, Yong Man Ro

    Abstract: This paper explores sentence-level Multilingual Visual Speech Recognition with a single model for the first time. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, the proposed visual speech unit is obtained by discretizing the visual spee… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  4. arXiv:2312.02512  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

    Authors: Jeongsoo Choi, Se ** Park, Minsu Kim, Yong Man Ro

    Abstract: This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast… ▽ More

    Submitted 26 March, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Code & Demo: https://choijeongsoo.github.io/av2av

  5. arXiv:2310.14946  [pdf, other

    cs.MM cs.SD eess.AS

    Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

    Authors: Joanna Hong, Se ** Park, Yong Man Ro

    Abstract: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similariti… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  6. arXiv:2310.05934  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

    Authors: Se ** Park, Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: Speech-driven 3D facial animation has gained significant attention for its ability to create realistic and expressive facial animations in 3D space based on speech. Learning-based methods have shown promising progress in achieving accurate facial motion synchronized with speech. However, one-to-many nature of speech-to-3D facial synthesis has not been fully explored: while the lip accurately synch… ▽ More

    Submitted 23 August, 2023; originally announced October 2023.

  7. arXiv:2309.08535  [pdf, other

    cs.CV cs.AI eess.AS

    Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

    Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

    Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More

    Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  8. arXiv:2309.08531  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

    Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

    Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  9. arXiv:2308.09311  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

    Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

    Abstract: This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order… ▽ More

    Submitted 12 January, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  10. arXiv:2308.07787  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

    Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro

    Abstract: Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speak… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  11. arXiv:2308.07593  [pdf, other

    cs.CV cs.MM eess.AS eess.IV

    AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

    Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different fro… ▽ More

    Submitted 11 January, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE Transactions on Multimedia

  12. arXiv:2308.01831  [pdf, other

    cs.CL eess.AS eess.SP

    Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

    Authors: Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

    Abstract: In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the aud… ▽ More

    Submitted 3 August, 2023; originally announced August 2023.

  13. arXiv:2306.16003  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

    Authors: Jeongsoo Choi, Minsu Kim, Se ** Park, Yong Man Ro

    Abstract: In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into t… ▽ More

    Submitted 18 January, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: ICASSP 2024

  14. arXiv:2305.19603  [pdf, other

    cs.SD cs.CV eess.AS

    Intelligible Lip-to-Speech Synthesis with Speech Units

    Authors: Jeongsoo Choi, Minsu Kim, Yong Man Ro

    Abstract: In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model. Therefore, the propos… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  15. arXiv:2305.19556  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

    Authors: Se ** Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

    Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temp… ▽ More

    Submitted 1 April, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2024

  16. arXiv:2305.04542  [pdf, other

    cs.CV eess.AS

    Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

    Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Mem… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Presented at ICASSP 2023

  17. arXiv:2303.08670  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

    Authors: Minsu Kim, Chae Won Kim, Yong Man Ro

    Abstract: Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking… ▽ More

    Submitted 26 February, 2023; originally announced March 2023.

    Comments: Accepted in AAAI2023

  18. arXiv:2303.08536  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

    Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

    Abstract: This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. Howev… ▽ More

    Submitted 20 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023. Implementation available: https://github.com/joannahong/AV-RelScore

  19. arXiv:2302.08841  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    Lip-to-Speech Synthesis in the Wild with Multi-task Learning

    Authors: Minsu Kim, Joanna Hong, Yong Man Ro

    Abstract: Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023. Demo available: https://github.com/joannahong/Lip-to-Speech-Synthesis-in-the-Wild

  20. arXiv:2302.08102  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS eess.IV

    Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

    Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR m… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  21. arXiv:2211.00924  [pdf, other

    cs.CV cs.AI eess.IV

    SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

    Authors: Se ** Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

    Abstract: The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at th… ▽ More

    Submitted 2 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted at AAAI 2022 (Oral)

  22. arXiv:2208.04498  [pdf, other

    cs.CV cs.AI eess.AS eess.IV

    Speaker-adaptive Lip Reading with User-dependent Padding

    Authors: Minsu Kim, Hyunjun Kim, Yong Man Ro

    Abstract: Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims… ▽ More

    Submitted 8 August, 2022; originally announced August 2022.

    Comments: Accepted at ECCV2022

  23. arXiv:2207.06020  [pdf, other

    cs.SD cs.AI cs.CV cs.MM eess.AS eess.IV

    Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

    Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro

    Abstract: This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise r… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  24. arXiv:2206.07458  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

    Authors: Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker set… ▽ More

    Submitted 20 July, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: Accepted by ECCV 2022

  25. arXiv:2204.01726  [pdf, other

    cs.CV cs.AI eess.AS

    Lip to Speech Synthesis with Visual Context Attentional GAN

    Authors: Minsu Kim, Joanna Hong, Yong Man Ro

    Abstract: In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a map** function of viseme-to-phoneme, while global visual context is embedded into the intermed… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at NeurIPS 2021

  26. arXiv:2204.01265  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

    Authors: Minsu Kim, Joanna Hong, Se ** Park, Yong Man Ro

    Abstract: In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal representations are what we want to obtain from the memor… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at ICCV 2021

  27. arXiv:2104.06782  [pdf, other

    cs.CV eess.IV

    Visual Comfort Aware-Reinforcement Learning for Depth Adjustment of Stereoscopic 3D Images

    Authors: Hak Gu Kim, Minho Park, Sangmin Lee, Seongyeop Kim, Yong Man Ro

    Abstract: Depth adjustment aims to enhance the visual experience of stereoscopic 3D (S3D) images, which accompanied with improving visual comfort and depth perception. For a human expert, the depth adjustment procedure is a sequence of iterative decision making. The human expert iteratively adjusts the depth until he is satisfied with the both levels of visual comfort and the perceived depth. In this work,… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: AAAI 2021

  28. arXiv:2104.06780  [pdf, other

    cs.CV eess.IV

    Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents

    Authors: Hak Gu Kim, Sangmin Lee, Seongyeop Kim, Heoun-taek Lim, Yong Man Ro

    Abstract: We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To make better understanding of VR sickness, it is r… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: AAAI 2021