Skip to main content

Showing 1–50 of 76 results for author: Chung, J S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.14559  [pdf, other

    cs.SD eess.AS

    Disentangled Representation Learning for Environment-agnostic Speaker Recognition

    Authors: KiHyun Nam, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung

    Abstract: This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation -… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. The official webpage can be found at https://mm.kaist.ac.kr/projects/voxceleb-disentangler/

  2. arXiv:2406.10549  [pdf, other

    eess.AS cs.CL cs.SD

    Lightweight Audio Segmentation for Long-form Speech Translation

    Authors: Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

    Abstract: Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performan… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  3. arXiv:2406.09286  [pdf, other

    eess.AS cs.SD

    FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

    Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

    Abstract: This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the numbe… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  4. arXiv:2406.05339  [pdf, other

    eess.AS cs.AI

    To what extent can ASV systems naturally defend against spoofing attacks?

    Authors: Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-** Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

    Abstract: The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically ex… ▽ More

    Submitted 14 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 3 tables, Interspeech 2024

  5. arXiv:2406.03344  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

    Authors: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

    Abstract: Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision task… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code is available at https://github.com/mhamzaerol/Audio-Mamba-AuM

  6. arXiv:2405.10272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

    Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

    Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  7. arXiv:2404.03477  [pdf, other

    cs.CV

    Towards Automated Movie Trailer Generation

    Authors: Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem

    Abstract: Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating trailers can be time-consuming and expensive. To streamline this process, we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation tec… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  8. arXiv:2404.03398  [pdf, other

    cs.CV

    Scaling Up Video Summarization Pretraining with Large Language Models

    Authors: Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

    Abstract: Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form vide… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  9. arXiv:2403.09502  [pdf, other

    cs.LG cs.AI cs.MM

    EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

    Authors: Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

    Abstract: Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To addre… ▽ More

    Submitted 20 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: 15 pages, 3 figures; Accepted to ICML 2024 (camera ready version)

  10. arXiv:2401.10032  [pdf, other

    eess.AS cs.AI eess.SP

    FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

    Authors: Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

    Abstract: The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated co… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  11. arXiv:2401.08415  [pdf, other

    cs.SD cs.LG eess.AS

    From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-pha… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  12. arXiv:2311.04066  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Can CLIP Help Sound Source Localization?

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment. We extend the application of these models, specifically CLIP, to the domain of sound source localization. Unlike conventional approaches, we employ the pre-trained CLIP model without explicit text input, re… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  13. arXiv:2310.19581  [pdf, other

    eess.AS cs.CV cs.SD

    Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

    Authors: Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung

    Abstract: The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Project page with demo: https://mm.kaist.ac.kr/projects/avdiffuss/

  14. arXiv:2309.14741  [pdf, other

    eess.AS cs.SD

    Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

    Authors: Hee-Soo Heo, KiHyun Nam, Bong-** Lee, Youngki Kwon, Minjae Lee, You ** Kim, Joon Son Chung

    Abstract: In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remain… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  15. arXiv:2309.13664  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    VoiceLDM: Text-to-Speech with Environmental Context

    Authors: Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

    Abstract: This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt. The former provides information about the overall environmental context of the audio, while the latter conveys the linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based on latent diffusion models… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

    Comments: Demos and code are available at https://voiceldm.github.io

  16. arXiv:2309.12306  [pdf, other

    cs.CV cs.SD eess.AS

    TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

    Authors: Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You ** Kim, Youngjoon Jang, Joon Son Chung

    Abstract: The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full se… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

  17. arXiv:2309.12304  [pdf, other

    cs.CV

    SlowFast Network for Continuous Sign Language Recognition

    Authors: Junseok Ahn, Youngjoon Jang, Joon Son Chung

    Abstract: The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

  18. arXiv:2309.10724  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Sound Source Localization is All about Cross-Modal Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for ge… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  19. arXiv:2308.15256  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

    Authors: Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

    Abstract: The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many map** caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly impro… ▽ More

    Submitted 4 January, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: Accepted to AAAI 2024

  20. arXiv:2307.09286  [pdf, other

    cs.SD cs.LG eess.AS

    FlexiAST: Flexibility is What AST Needs

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate change… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: Interspeech 2023

  21. arXiv:2304.03275  [pdf, other

    cs.CV

    That's What I Said: Fully-Controllable Talking Face Generation

    Authors: Youngjoon Jang, Kyeongha Rho, Jong-Bin Woo, Hyeongkeun Lee, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Joon Son Chung

    Abstract: The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentan… ▽ More

    Submitted 18 September, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

  22. arXiv:2303.17517  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

    Authors: Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

    Abstract: The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective. Bilingual VGS models are generally trained with an equal number of spoken captions from both languages. However, in reality, there can be an imbalance among the languages for the available spoken captions. Our key contribution in this work is to leverage the power of a high… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  23. arXiv:2303.11771  [pdf, other

    cs.CV

    Self-Sufficient Framework for Continuous Sign Language Recognition

    Authors: Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-** Kim, In So Kweon, Joon Son Chung

    Abstract: The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both ma… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

  24. arXiv:2302.13700  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

    Authors: Jiyoung Lee, Joon Son Chung, Soo-Whan Chung

    Abstract: The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: ICASSP 2023. Project page: https://facetts.github.io

  25. arXiv:2302.10248  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

    Authors: Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker re… ▽ More

    Submitted 6 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  26. arXiv:2211.01966  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    MarginNCE: Robust Sound Localization with a Negative Margin

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. Contrastive learning in the context of sound source localization leverages the natural correspondence between audio and visual signals where the audio-visual pairs from the same source are assumed as positive, while randomly selected pairs are negatives. However, this approach brings in noisy corre… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. SOTA performance in Audio-Visual Sound Localization. 5 Pages

  27. arXiv:2211.00448  [pdf, other

    cs.CV

    Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

    Authors: Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-** Kim, Joon Son Chung, In So Kweon

    Abstract: The goal of this work is background-robust continuous sign language recognition. Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background. However, signing is not limited only to studios in the real world. In order to analyze the robustness of CSLR models under background shifts, we first evaluate e… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: Our dataset is available at https://github.com/art-jang/Signing-Outside-the-Studio

  28. arXiv:2211.00439  [pdf, other

    eess.AS cs.SD

    Metric Learning for User-defined Keyword Spotting

    Authors: Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

    Abstract: The goal of this work is to detect new spoken terms defined by users. While most previous works address Keyword Spotting (KWS) as a closed-set classification problem, this limits their transferability to unseen terms. The ability to define custom keywords has advantages in terms of user experience. In this paper, we propose a metric learning-based training strategy for user-defined keyword spott… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  29. arXiv:2211.00437  [pdf, other

    eess.AS cs.SD

    Disentangled representation learning for multilingual speaker recognition

    Authors: Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee Soo Heo, Jee-weon Jung, Joon Son Chung

    Abstract: The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse t… ▽ More

    Submitted 6 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Interspeech 2023

  30. arXiv:2210.14682  [pdf, other

    cs.SD cs.AI eess.AS

    In search of strong embedding extractors for speaker diarisation

    Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung

    Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: 5pages, 1 figure, 2 tables, submitted to ICASSP

  31. arXiv:2210.10985  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Large-scale learning of generalised representations for speaker recognition

    Authors: Jee-weon Jung, Hee-Soo Heo, Bong-** Lee, Jaesong Lee, Hye-** Shim, Youngki Kwon, Joon Son Chung, Shinji Watanabe

    Abstract: The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be… ▽ More

    Submitted 27 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 5pages, 5 tables, submitted to ICASSP

  32. arXiv:2204.09976  [pdf, other

    cs.SD eess.AS

    Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

    Authors: Hye-** Shim, Hemlata Tak, Xuechen Liu, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung, Soo-Whan Chung, Ha-** Yu, Bong-** Lee, Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md Sahidullah, Tomi Kinnunen, Nicholas Evans

    Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained f… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: 8 pages, accepted by Odyssey 2022

  33. arXiv:2203.08488  [pdf, other

    eess.AS cs.AI

    Pushing the limits of raw waveform speaker recognition

    Authors: Jee-weon Jung, You ** Kim, Hee-Soo Heo, Bong-** Lee, Youngki Kwon, Joon Son Chung

    Abstract: In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs… ▽ More

    Submitted 28 March, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: submitted to INTERSPEECH 2022 as a conference paper. 5 pages, 2 figures, 5 tables

  34. arXiv:2201.04583  [pdf, other

    cs.SD eess.AS

    VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

    Authors: Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

    Abstract: The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from Yo… ▽ More

    Submitted 16 November, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2012.06867

  35. arXiv:2110.03380  [pdf, other

    cs.SD cs.CL

    Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity

    Authors: You ** Kim, Hee-Soo Heo, Jee-weon Jung, Youngki Kwon, Bong-** Lee, Joon Son Chung

    Abstract: The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise, adversely affecting performance. Our previous work has proposed an auto-encoder-based dimensionality reduction module to help remove the redundant informat… ▽ More

    Submitted 3 November, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: This paper was submitted to ICASSP 2023

  36. arXiv:2110.03361  [pdf, other

    eess.AS cs.AI

    Multi-scale speaker embedding-based graph attention networks for speaker diarisation

    Authors: Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You ** Kim, Bong-** Lee, Joon Son Chung

    Abstract: The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying le… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 2 figures, submitted to ICASSP as a conference paper

  37. arXiv:2110.02791  [pdf, other

    cs.SD cs.CL eess.AS

    Spell my name: keyword boosted speech recognition

    Authors: Namkyu Jung, Geonmin Kim, Joon Son Chung

    Abstract: Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context. However, the ability to recognise such words remains a challenge in modern automatic speech recognition (ASR) systems. In this paper, we propose a simple but powerful ASR decoding method that can better recognise these uncommon keywords, which in turn enables better reada… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

  38. arXiv:2110.01200  [pdf, other

    eess.AS cs.AI cs.LG

    AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks

    Authors: Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-** Shim, Joon Son Chung, Bong-** Lee, Ha-** Yu, Nicholas Evans

    Abstract: Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains. Their reliable detection usually depends upon computationally demanding ensemble systems where each subsystem is tuned to some specific artefacts. We seek to develop an efficient, single system that can detect a broad range of different spoofing attacks without score-level ensembles. We propo… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, 3 tables, submitted to ICASSP2022

  39. arXiv:2108.07640  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Look Who's Talking: Active Speaker Detection in the Wild

    Authors: You ** Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-** Lee, Youngki Kwon, Joon Son Chung

    Abstract: In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detec… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

    Comments: To appear in Interspeech 2021. Data will be available from https://github.com/clovaai/lookwhostalking

  40. arXiv:2104.02879  [pdf, other

    eess.AS cs.LG cs.SD

    Adapting Speaker Embeddings for Speaker Diarisation

    Authors: Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You ** Kim, Bong-** Lee, Joon Son Chung

    Abstract: The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to bett… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: 5 pages, 2 figures, 3 tables, submitted to Interspeech as a conference paper

  41. arXiv:2104.02878  [pdf, other

    eess.AS cs.LG cs.SD

    Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network

    Authors: Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-** Lee

    Abstract: In this work, we propose an overlapped speech detection system trained as a three-class classifier. Unlike conventional systems that perform binary classification as to whether or not a frame contains overlapped speech, the proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech. By training a network with the more detailed label definition, the mo… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: 5 pages, 2 figures, 4 tables, submitted to Interspeech as a conference paper

  42. arXiv:2012.06867  [pdf, other

    cs.SD cs.LG eess.AS

    VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

    Authors: Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

    Abstract: We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  43. arXiv:2011.14885  [pdf, ps, other

    cs.SD eess.AS

    Look who's not talking

    Authors: Youngki Kwon, Hee Soo Heo, Jaesung Huh, Bong-** Lee, Joon Son Chung

    Abstract: The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding… ▽ More

    Submitted 30 November, 2020; originally announced November 2020.

    Comments: SLT 2021

  44. arXiv:2011.05189  [pdf, other

    cs.SD eess.AS

    Supervised attention for speaker recognition

    Authors: Seong Min Kye, Joon Son Chung, Hoirin Kim

    Abstract: The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline… ▽ More

    Submitted 3 December, 2020; v1 submitted 10 November, 2020; originally announced November 2020.

    Comments: SLT 2021

  45. arXiv:2010.15809  [pdf, other

    cs.SD eess.AS

    The ins and outs of speaker recognition: lessons from VoxSRC 2020

    Authors: Yoohwan Kwon, Hee-Soo Heo, Bong-** Lee, Joon Son Chung

    Abstract: The VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020 offers a challenging evaluation for speaker recognition systems, which includes celebrities playing different parts in movies. The goal of this work is robust speaker recognition of utterances recorded in these challenging environments. We utilise variants of the popular ResNet architecture for speaker recognition and perform… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

  46. arXiv:2010.15716  [pdf, other

    cs.SD eess.AS

    Playing a Part: Speaker Verification at the Movies

    Authors: Andrew Brown, Jaesung Huh, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies con… ▽ More

    Submitted 11 February, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: The first three authors contributed equally to this work

  47. arXiv:2010.11543  [pdf, other

    eess.AS cs.CL cs.SD

    Graph Attention Networks for Speaker Verification

    Authors: Jee-weon Jung, Hee-Soo Heo, Ha-** Yu, Joon Son Chung

    Abstract: This work presents a novel back-end framework for speaker verification using graph attention networks. Segment-wise speaker embeddings extracted from multiple crops within an utterance are interpreted as node representations of a graph. The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score. We first construct a… ▽ More

    Submitted 8 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure, 2 tables, accepted for presentation at ICASSP 2021 as a conference paper

  48. arXiv:2009.14153  [pdf, other

    eess.AS cs.SD

    Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

    Authors: Hee Soo Heo, Bong-** Lee, Jaesung Huh, Joon Son Chung

    Abstract: This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing.… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

  49. arXiv:2008.05983  [pdf, other

    eess.AS cs.SD

    Cross attentive pooling for speaker verification

    Authors: Seong Min Kye, Yoohwan Kwon, Joon Son Chung

    Abstract: The goal of this paper is text-independent speaker verification where utterances come from 'in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilizes the context information across the reference-query p… ▽ More

    Submitted 3 December, 2020; v1 submitted 13 August, 2020; originally announced August 2020.

    Comments: SLT 2021. Code available at https://github.com/seongmin-kye/CAP

  50. arXiv:2008.04237  [pdf, other

    cs.CV cs.SD eess.AS

    Self-Supervised Learning of Audio-Visual Objects from Video

    Authors: Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

    Abstract: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: ECCV 2020