Skip to main content

Showing 1–35 of 35 results for author: Das, R K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.00291  [pdf, other

    eess.AS cs.SD

    FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

    Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

    Abstract: This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: Technical report for DCASE 2024 Challenge Task 4

  2. arXiv:2406.02483  [pdf, other

    eess.AS cs.AI cs.SD

    How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

    Authors: Tianchi Liu, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li

    Abstract: Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artif… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  3. arXiv:2404.17280  [pdf, other

    cs.SD eess.AS

    Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks

    Authors: Mingrui He, Longting Xu, Han Wang, Mingjun Zhang, Rohan Kumar Das

    Abstract: The most common spoofing attacks on automatic speaker verification systems are replay speech attacks. Detection of replay speech heavily relies on replay configuration information. Previous studies have shown that graph Fourier transform-derived features can effectively detect replay speech but ignore device and environmental noise effects. In this work, we propose a new feature, the graph frequen… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  4. arXiv:2404.09342  [pdf, other

    cs.CV cs.SD eess.AS

    Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

    Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

    Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More

    Submitted 16 April, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: ACM Multimedia Conference - Grand Challenge

  5. arXiv:2404.00861  [pdf, other

    eess.AS eess.IV

    Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

    Authors: Ruijie Tao, Xinyuan Qian, Rohan Kumar Das, Xiaoxue Gao, Jiadong Wang, Haizhou Li

    Abstract: Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selec… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: 10 pages

  6. arXiv:2402.02781  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Dual Knowledge Distillation for Efficient Sound Event Detection

    Authors: Yang Xiao, Rohan Kumar Das

    Abstract: Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for develo** efficient SED systems in this work. Our proposed dua… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 (Deep Neural Network Model Compression Workshop)

  7. arXiv:2401.04953  [pdf, other

    eess.IV eess.SP

    Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

    Authors: Jichen Yang, Fangfan Chen, Rohan Kumar Das, Zhengyu Zhu, Shunsi Zhang

    Abstract: Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value com… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: Accepted for Publication in IEEE ICASSP 2024

  8. arXiv:2305.10729  [pdf, other

    eess.AS

    A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

    Authors: Tanmay Khandelwal, Rohan Kumar Das

    Abstract: Sound event detection (SED) entails identifying the type of sound and estimating its temporal boundaries from acoustic signals. These events are uniquely characterized by their spatio-temporal features, which are determined by the way they are produced. In this study, we leverage some distinctive high-level acoustic characteristics of various sound events to assist the SED model training, without… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted for Publication at INTERSPEECH 2023

  9. arXiv:2304.12688  [pdf, other

    eess.AS

    Leveraging Audio-Tagging Assisted Sound Event Detection using Weakified Strong Labels and Frequency Dynamic Convolutions

    Authors: Tanmay Khandelwal, Rohan Kumar Das, Andrew Koh, Eng Siong Chng

    Abstract: Jointly learning from a small labeled set and a larger unlabeled set is an active research topic under semi-supervised learning (SSL). In this paper, we propose a novel SSL method based on a two-stage framework for leveraging a large unlabeled in-domain set. Stage-1 of our proposed framework focuses on audio-tagging (AT), which assists the sound event detection (SED) system in Stage-2. The AT syst… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted for Publication in IEEE-Statistical Signal Processing (SSP) Workshop 2023

  10. arXiv:2211.01091  [pdf, ps, other

    eess.AS cs.AI cs.SD

    I4U System Description for NIST SRE'20 CTS Challenge

    Authors: Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang , et al. (1 additional authors not shown)

    Abstract: This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (C… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021

  11. arXiv:2210.15385  [pdf, other

    eess.AS cs.SD eess.SP

    Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

    Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

    Abstract: We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sa… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: 13 pages

  12. arXiv:2202.01624  [pdf, other

    cs.SD cs.CL eess.AS eess.SP

    MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

    Authors: Tianchi Liu, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

    Abstract: The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale freq… ▽ More

    Submitted 15 February, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  13. arXiv:2111.06671  [pdf, ps, other

    eess.AS eess.SP

    HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE

    Authors: Rohan Kumar Das, Ruijie Tao, Haizhou Li

    Abstract: This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE). The challenge focuses on evaluation under CTS data containing multilingual speech. The systems developed at HLT-NUS consider time-delay neural network (TDNN) x-vector… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

    Comments: 3 pages

  14. arXiv:2110.03869  [pdf, other

    eess.AS eess.SP

    Self-supervised Speaker Recognition with Loss-gated Learning

    Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

    Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to stud… ▽ More

    Submitted 14 July, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures

  15. arXiv:2110.00797  [pdf, other

    eess.AS cs.SD

    Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

    Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna

    Abstract: The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation te… ▽ More

    Submitted 2 October, 2021; originally announced October 2021.

  16. arXiv:2109.08007  [pdf, other

    cs.MM cs.SD eess.AS

    Graph Fourier Transform based Audio Zero-watermarking

    Authors: Longting Xu, Daiyu Huang, Syed Faham Ali Zaidi, Abdul Rauf, Rohan Kumar Das

    Abstract: The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fou… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

  17. arXiv:2107.06592  [pdf, other

    eess.AS cs.SD eess.IV

    Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

    Authors: Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li

    Abstract: Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that ma… ▽ More

    Submitted 25 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

    Comments: ACM Multimedia 2021

  18. arXiv:2102.06332  [pdf, ps, other

    eess.AS

    Data Augmentation with Signal Companding for Detection of Logical Access Attacks

    Authors: Rohan Kumar Das, Jichen Yang, Haizhou Li

    Abstract: The recent advances in voice conversion (VC) and text-to-speech (TTS) make it possible to produce natural sounding speech that poses threat to automatic speaker verification (ASV) systems. To this end, research on spoofing countermeasures has gained attention to protect ASV systems from such attacks. While the advanced spoofing countermeasures are able to detect known nature of spoofing attacks, t… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

    Comments: 5 pages, Accepted for publication in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2021

  19. arXiv:2102.00270  [pdf, other

    eess.AS

    Enhancing the Intelligibility of Cleft Lip and Palate Speech using Cycle-consistent Adversarial Networks

    Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S R Mahadeva Prasanna

    Abstract: Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects' speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligi… ▽ More

    Submitted 30 January, 2021; originally announced February 2021.

    Comments: 8 pages, 4 figures, IEEE spoken language and technology workshop

  20. arXiv:2011.00699  [pdf, other

    eess.AS

    Transformer-based Arabic Dialect Identification

    Authors: Wanqiu Lin, Maulik Madhavi, Rohan Kumar Das, Haizhou Li

    Abstract: This paper presents a dialect identification (DID) system based on the transformer neural network architecture. The conventional convolutional neural network (CNN)-based systems use the shorter receptive fields. We believe that long range information is equally important for language and DID, and self-attention mechanism in transformer captures the long range dependencies. In addition, to reduce t… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted for publication in International Conference on Asian Language Processing (IALP) 2020

  21. arXiv:2010.03909  [pdf, other

    eess.AS cs.SD

    Emotion Invariant Speaker Embeddings for Speaker Identification with Emotional Speech

    Authors: Biswajit Dev Sarma, Rohan Kumar Das

    Abstract: Emotional state of a speaker is found to have significant effect in speech production, which can deviate speech from that arising from neutral state. This makes identifying speakers with different emotions a challenging task as generally the speaker models are trained using neutral speech. In this work, we propose to overcome this problem by creation of emotion invariant speaker embedding. We lear… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in APSIPA ASC 2020

  22. arXiv:2010.03907  [pdf, ps, other

    eess.AS cs.SD

    Classification of Speech with and without Face Mask using Acoustic Features

    Authors: Rohan Kumar Das, Haizhou Li

    Abstract: The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, fi… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in APSIPA ASC 2020

  23. arXiv:2010.03905  [pdf, other

    eess.AS cs.SD

    HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

    Authors: Rohan Kumar Das, Ruijie Tao, Jichen Yang, Wei Rao, Cheng Yu, Haizhou Li

    Abstract: This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses o… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in APSIPA ASC 2020

  24. arXiv:2009.09637  [pdf, other

    eess.AS

    Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

    Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

    Abstract: Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during… ▽ More

    Submitted 21 September, 2020; originally announced September 2020.

    Comments: Accepted for publication in Interspeech 2020

  25. arXiv:2009.03554  [pdf, other

    eess.AS cs.SD

    Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

    Authors: Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, Tomoki Toda

    Abstract: The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary perf… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  26. arXiv:2008.12527  [pdf, other

    eess.AS cs.SD

    Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

    Authors: Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, Tomoki Toda

    Abstract: The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, includ… ▽ More

    Submitted 28 August, 2020; originally announced August 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  27. arXiv:2008.08901  [pdf, other

    eess.AS cs.CL cs.SD eess.SP

    Speaker-Utterance Dual Attention for Speaker and Utterance Verification

    Authors: Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen, Haizhou Li

    Abstract: In this paper, we study a novel technique that exploits the interaction between speaker traits and linguistic content to improve both speaker verification and utterance verification performance. We implement an idea of speaker-utterance dual attention (SUDA) in a unified neural network. The dual attention refers to an attention mechanism for the two tasks of speaker and utterance verification. The… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

    Comments: Accepted by Interspeech 2020

  28. arXiv:2008.03894  [pdf, other

    eess.AS

    Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

    Authors: Ruijie Tao, Rohan Kumar Das, Haizhou Li

    Abstract: Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation betwee… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

  29. arXiv:2005.08046  [pdf, other

    eess.AS cs.SD

    The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

    Authors: Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li

    Abstract: The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to INTERSPEECH 2020

  30. arXiv:2004.08849  [pdf, other

    eess.AS cs.CR

    The Attacker's Perspective on Automatic Speaker Verification: An Overview

    Authors: Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li

    Abstract: Security of automatic speaker verification (ASV) systems is compromised by various spoofing attacks. While many types of non-proactive attacks (and their defenses) have been studied in the past, attacker's perspective on ASV, represents a far less explored direction. It can potentially help to identify the weakest parts of ASV systems and be used to develop attacker-aware systems. We present an ov… ▽ More

    Submitted 19 April, 2020; originally announced April 2020.

    Comments: 5 pages, 1 figure, Submitted to Interspeech 2020

  31. arXiv:2002.00387  [pdf, other

    cs.SD eess.AS

    The FFSVC 2020 Evaluation Plan

    Authors: Xiaoyi Qin, Ming Li, Hui Bu, Rohan Kumar Das, Wei Rao, Shrikanth Narayanan, Haizhou Li

    Abstract: The Far-Field Speaker Verification Challenge 2020 (FFSVC20) is designed to boost the speaker verification research with special focus on far-field distributed microphone arrays under noisy conditions in real scenarios. The objectives of this challenge are to: 1) benchmark the current speech verification technology under this challenging condition, 2) promote the development of new ideas and techno… ▽ More

    Submitted 4 February, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

  32. arXiv:1910.00496  [pdf, other

    eess.AS

    A Modularized Neural Network with Language-Specific Output Layers for Cross-lingual Voice Conversion

    Authors: Yi Zhou, Xiaohai Tian, Emre Yılmaz, Rohan Kumar Das, Haizhou Li

    Abstract: This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because humans share a common vocal production system, but acou… ▽ More

    Submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted for publication at IEEE ASRU Workshop 2019

  33. arXiv:1909.07655  [pdf, other

    eess.AS

    Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

    Authors: Xiaohai Tian, Rohan Kumar Das, Haizhou Li

    Abstract: Automatic speaker verification (ASV) systems in practice are greatly vulnerable to spoofing attacks. The latest voice conversion technologies are able to produce perceptually natural sounding speech that mimics any target speakers. However, the perceptual closeness to a speaker's identity may not be enough to deceive an ASV system. In this work, we propose a framework that uses the output scores o… ▽ More

    Submitted 29 October, 2019; v1 submitted 17 September, 2019; originally announced September 2019.

    Comments: 6 pages, 3 figures, This paper is submitted to ICASSP 2020

  34. arXiv:1904.07386  [pdf, other

    eess.AS cs.CL cs.SD

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

    Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: 5 pages

  35. arXiv:1809.06798  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generative x-vectors for text-independent speaker verification

    Authors: Longting Xu, Rohan Kumar Das, Emre Yılmaz, Jichen Yang, Haizhou Li

    Abstract: Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose… ▽ More

    Submitted 17 September, 2018; originally announced September 2018.

    Comments: Accepted for publication at SLT 2018