Skip to main content

Showing 1–12 of 12 results for author: Hautamaki, V

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09999  [pdf, other

    eess.AS

    ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

    Authors: Vishwanath Pratap Singh, Federico Malato, Ville Hautamaki, Md. Sahidullah, Tomi Kinnunen

    Abstract: While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fix… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted: Interspeech 2024

    Journal ref: Interspeech 2024

  2. arXiv:2401.02626  [pdf, other

    cs.SD eess.AS

    Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

    Authors: Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li

    Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is ba… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  3. arXiv:2210.15385  [pdf, other

    eess.AS cs.SD eess.SP

    Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

    Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

    Abstract: We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sa… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: 13 pages

  4. arXiv:2201.09709  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

    Authors: Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

    Abstract: As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published version available at: https://ieeexplore.ieee.org/document/9664367

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 477-488, 2022

  5. arXiv:2110.03869  [pdf, other

    eess.AS eess.SP

    Self-supervised Speaker Recognition with Loss-gated Learning

    Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

    Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to stud… ▽ More

    Submitted 14 July, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures

  6. arXiv:2110.00940  [pdf, other

    cs.SD cs.AI eess.AS

    PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction

    Authors: Yi Ma, Kong Aik Lee, Ville Hautamaki, Haizhou Li

    Abstract: Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation… ▽ More

    Submitted 3 October, 2021; originally announced October 2021.

  7. arXiv:2109.13510  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    VoxCeleb Enrichment for Age and Gender Recognition

    Authors: Khaled Hechmi, Trung Ngo Trong, Ville Hautamaki, Tomi Kinnunen

    Abstract: VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive ag… ▽ More

    Submitted 20 December, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: Accepted for presentation at ASRU 2021; repository: https://github.com/hechmik/voxceleb_enrichment_age_gender

  8. arXiv:2002.03801  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning

    Authors: Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

    Abstract: The spoofing countermeasure (CM) systems in automatic speaker verification (ASV) are not typically used in isolation of each other. These systems can be combined, for example, into a cascaded system where CM produces first a decision whether the input is synthetic or bona fide speech. In case the CM decides it is a bona fide sample, then the ASV system will consider it for speaker verification. En… ▽ More

    Submitted 8 April, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

    Comments: Odyssey 2020 The Speaker and Language Recognition Workshop. Code available at https://github.com/Miffyli/asv-cm-reinforce

  9. arXiv:1907.03164  [pdf, other

    cs.LG eess.AS stat.ML

    Towards Debugging Deep Neural Networks by Generating Speech Utterances

    Authors: Bilal Soomro, Anssi Kanervisto, Trung Ngo Trong, Ville Hautamäki

    Abstract: Deep neural networks (DNN) are able to successfully process and classify speech utterances. However, understanding the reason behind a classification by DNN is difficult. One such debugging method used with image classification DNNs is activation maximization, which generates example-images that are classified as one of the classes. In this work, we evaluate applicability of this method to speech… ▽ More

    Submitted 6 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019

  10. arXiv:1904.07386  [pdf, other

    eess.AS cs.CL cs.SD

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

    Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: 5 pages

  11. arXiv:1811.03293  [pdf, other

    eess.AS cs.SD

    Who Do I Sound Like? Showcasing Speaker Recognition Technology by YouTube Voice Search

    Authors: Ville Vestman, Bilal Soomro, Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen

    Abstract: The popularization of science can often be disregarded by scientists as it may be challenging to put highly sophisticated research into words that general public can understand. This work aims to help presenting speaker recognition research to public by proposing a publicly appealing concept for showcasing recognition systems. We leverage data from YouTube and use it in a large-scale voice search… ▽ More

    Submitted 10 February, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

    Comments: Accepted for presentation in ICASSP 2019

  12. arXiv:1804.08910  [pdf, other

    cs.SD cs.CY eess.AS

    Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification

    Authors: Rosa González Hautamäki, Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen

    Abstract: Voice disguise, purposeful modification of one's speaker identity with the aim of avoiding being identified as oneself, is a low-effort way to fool speaker recognition, whether performed by a human or an automatic speaker verification (ASV) system. We present an evaluation of the effectiveness of age stereotypes as a voice disguise strategy, as a follow up to our recent work where 60 native Finnis… ▽ More

    Submitted 28 May, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

    Comments: Accepted to Speaker Odyssey 2018: The Speaker and Language Recognition Workshop