Search | arXiv e-print repository

ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Authors: Vishwanath Pratap Singh, Federico Malato, Ville Hautamaki, Md. Sahidullah, Tomi Kinnunen

Abstract: While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fix… ▽ More While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fixed OAR approach in conventional data augmentation, our proposed method employs a deep Q-network (DQN) as the RL mechanism to learn the optimal dynamics of OAR throughout the wav2vec2.0 based ASR training. We conduct experiments using the LibriSpeech dataset with varying amounts of training data, specifically, the 10Min, 1H, 10H, and 100H splits to evaluate the efficacy of the proposed method under different data conditions. Our proposed method, on average, achieves a relative improvement of 4.96% over the open-source wav2vec2.0 base model on standard LibriSpeech test sets. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted: Interspeech 2024

Journal ref: Interspeech 2024

arXiv:2406.04913 [pdf, other]

Online Adaptation for Enhancing Imitation Learning Policies

Authors: Federico Malato, Ville Hautamaki

Abstract: Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming f… ▽ More Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming from a pre-trained policy with relevant experience recorded by an expert. The combination results in an adapted action that closely follows the expert. Our experiments show that an adapted agent performs better than its pure imitation learning counterpart. Notably, adapted agents can achieve reasonable performance even when the base, non-adapted policy catastrophically fails. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at IEEE Conference on Games 2024, Milan, Italy

arXiv:2403.13801 [pdf, other]

Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs

Authors: Yusuke Mikami, Andrew Melnik, Jun Miura, Ville Hautamäki

Abstract: We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natur… ▽ More We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: https://natural-language-as-policies.github.io/ △ Less

Submitted 6 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: 8 pages, 2 figures

ACM Class: I.2.9; I.2.7

arXiv:2401.16398 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447339

Zero-shot Imitation Policy via Search in Demonstration Dataset

Authors: Federco Malato, Florian Leopold, Andrew Melnik, Ville Hautamaki

Abstract: Behavioral cloning uses a dataset of demonstrations to learn a policy. To overcome computationally expensive training procedures and address the policy adaptation problem, we propose to use latent spaces of pre-trained foundation models to index a demonstration dataset, instantly access similar relevant experiences, and copy behavior from these situations. Actions from a selected similar situation… ▽ More Behavioral cloning uses a dataset of demonstrations to learn a policy. To overcome computationally expensive training procedures and address the policy adaptation problem, we propose to use latent spaces of pre-trained foundation models to index a demonstration dataset, instantly access similar relevant experiences, and copy behavior from these situations. Actions from a selected similar situation can be performed by the agent until representations of the agent's current situation and the selected experience diverge in the latent space. Thus, we formulate our control problem as a dynamic search problem over a dataset of experts' demonstrations. We test our approach on BASALT MineRL-dataset in the latent representation of a Video Pre-Training model. We compare our model to state-of-the-art, Imitation Learning-based Minecraft agents. Our approach can effectively recover meaningful demonstrations and show human-like behavior of an agent in the Minecraft environment in a wide variety of scenarios. Experimental results reveal that performance of our search-based approach clearly wins in terms of accuracy and perceptual evaluation over learning-based models. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.02626 [pdf, other]

Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

Authors: Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li

Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is ba… ▽ More Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2306.09082 [pdf, other]

Behavioral Cloning via Search in Embedded Demonstration Dataset

Authors: Federico Malato, Florian Leopold, Ville Hautamaki, Andrew Melnik

Abstract: Behavioural cloning uses a dataset of demonstrations to learn a behavioural policy. To overcome various learning and policy adaptation problems, we propose to use latent space to index a demonstration dataset, instantly access similar relevant experiences, and copy behavior from these situations. Actions from a selected similar situation can be performed by the agent until representations of the a… ▽ More Behavioural cloning uses a dataset of demonstrations to learn a behavioural policy. To overcome various learning and policy adaptation problems, we propose to use latent space to index a demonstration dataset, instantly access similar relevant experiences, and copy behavior from these situations. Actions from a selected similar situation can be performed by the agent until representations of the agent's current situation and the selected experience diverge in the latent space. Thus, we formulate our control problem as a search problem over a dataset of experts' demonstrations. We test our approach on BASALT MineRL-dataset in the latent representation of a Video PreTraining model. We compare our model to state-of-the-art Minecraft agents. Our approach can effectively recover meaningful demonstrations and show human-like behavior of an agent in the Minecraft environment in a wide variety of scenarios. Experimental results reveal that performance of our search-based approach is comparable to trained models, while allowing zero-shot task adaptation by changing the demonstration examples. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2303.13512 [pdf, other]

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition

Authors: Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Sharada Mohanty, Byron Galbraith, Ke Chen, Yan Song, Tianze Zhou, Bingquan Yu, He Liu, Kai Guan, Yu**g Hu, Tangjie Lv, Federico Malato, Florian Leopold, Amogh Raut, Ville Hautamäki, Andrew Melnik, Shu Ishida, João F. Henriques, Robert Klassert, Walter Laurito, Ellen Novoseller , et al. (5 additional authors not shown)

Abstract: To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use… ▽ More To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use human feedback as channels to learn the desired behavior. We describe the competition and provide an overview of the top solutions. We conclude by discussing the impact of the competition and future directions for improvement. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2212.13326 [pdf, other]

Behavioral Cloning via Search in Video PreTraining Latent Space

Authors: Federico Malato, Florian Leopold, Amogh Raut, Ville Hautamäki, Andrew Melnik

Abstract: Our aim is to build autonomous agents that can solve tasks in environments like Minecraft. To do so, we used an imitation learning-based approach. We formulate our control problem as a search problem over a dataset of experts' demonstrations, where the agent copies actions from a similar demonstration trajectory of image-action pairs. We perform a proximity search over the BASALT MineRL-dataset in… ▽ More Our aim is to build autonomous agents that can solve tasks in environments like Minecraft. To do so, we used an imitation learning-based approach. We formulate our control problem as a search problem over a dataset of experts' demonstrations, where the agent copies actions from a similar demonstration trajectory of image-action pairs. We perform a proximity search over the BASALT MineRL-dataset in the latent representation of a Video PreTraining model. The agent copies the actions from the expert trajectory as long as the distance between the state representations of the agent and the selected expert trajectory from the dataset do not diverge. Then the proximity search is repeated. Our approach can effectively recover meaningful demonstration trajectories and show human-like behavior of an agent in the Minecraft environment. △ Less

Submitted 17 April, 2023; v1 submitted 26 December, 2022; originally announced December 2022.

arXiv:2210.15385 [pdf, other]

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

Abstract: We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sa… ▽ More We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\% under the proposed progressive clustering strategy, and an EER of 1.44\%, 1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: 13 pages

arXiv:2205.07060 [pdf, other]

doi 10.1109/TG.2022.3173450

GAN-Aimbots: Using Machine Learning for Cheating in First Person Shooters

Authors: Anssi Kanervisto, Tomi Kinnunen, Ville Hautamäki

Abstract: Playing games with cheaters is not fun, and in a multi-billion-dollar video game industry with hundreds of millions of players, game developers aim to improve the security and, consequently, the user experience of their games by preventing cheating. Both traditional software-based methods and statistical systems have been successful in protecting against cheating, but recent advances in the automa… ▽ More Playing games with cheaters is not fun, and in a multi-billion-dollar video game industry with hundreds of millions of players, game developers aim to improve the security and, consequently, the user experience of their games by preventing cheating. Both traditional software-based methods and statistical systems have been successful in protecting against cheating, but recent advances in the automatic generation of content, such as images or speech, threaten the video game industry; they could be used to generate artificial gameplay indistinguishable from that of legitimate human players. To better understand this threat, we begin by reviewing the current state of multiplayer video game cheating, and then proceed to build a proof-of-concept method, GAN-Aimbot. By gathering data from various players in a first-person shooter game we show that the method improves players' performance while remaining hidden from automatic and manual protection mechanisms. By sharing this work we hope to raise awareness on this issue and encourage further research into protecting the gaming communities. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: Accepted to IEEE Transactions on Games. Source code available at https://github.com/miffyli/gan-aimbots

arXiv:2203.05074 [pdf, other]

The Transitive Information Theory and its Application to Deep Generative Models

Authors: Trung Ngo, Najwa Laabid, Ville Hautamäki, Merja Heinäniemi

Abstract: Paradoxically, a Variational Autoencoder (VAE) could be pushed in two opposite directions, utilizing powerful decoder model for generating realistic images but collapsing the learned representation, or increasing regularization coefficient for disentangling representation but ultimately generating blurry examples. Existing methods narrow the issues to the rate-distortion trade-off between compress… ▽ More Paradoxically, a Variational Autoencoder (VAE) could be pushed in two opposite directions, utilizing powerful decoder model for generating realistic images but collapsing the learned representation, or increasing regularization coefficient for disentangling representation but ultimately generating blurry examples. Existing methods narrow the issues to the rate-distortion trade-off between compression and reconstruction. We argue that a good reconstruction model does learn high capacity latents that encode more details, however, its use is hindered by two major issues: the prior is random noise which is completely detached from the posterior and allow no controllability in the generation; mean-field variational inference doesn't enforce hierarchy structure which makes the task of recombining those units into plausible novel output infeasible. As a result, we develop a system that learns a hierarchy of disentangled representation together with a mechanism for recombining the learned representation for generalization. This is achieved by introducing a minimal amount of inductive bias to learn controllable prior for the VAE. The idea is supported by here developed transitive information theory, that is, the mutual information between two target variables could alternately be maximized through the mutual information to the third variable, thus bypassing the rate-distortion bottleneck in VAE design. In particular, we show that our model, named SemafoVAE (inspired by the similar concept in computer science), could generate high-quality examples in a controllable manner, perform smooth traversals of the disentangled factors and intervention at a different level of representation hierarchy. △ Less

Submitted 28 March, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

arXiv:2201.09709 [pdf, other]

doi 10.1109/TASLP.2021.3138681

Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

Authors: Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

Abstract: As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with… ▽ More As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting. △ Less

Submitted 24 January, 2022; originally announced January 2022.

Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published version available at: https://ieeexplore.ieee.org/document/9664367

Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 477-488, 2022

arXiv:2201.07719 [pdf, other]

Improving Behavioural Cloning with Human-Driven Dynamic Dataset Augmentation

Authors: Federico Malato, Joona Jehkonen, Ville Hautamäki

Abstract: Behavioural cloning has been extensively used to train agents and is recognized as a fast and solid approach to teach general behaviours based on expert trajectories. Such method follows the supervised learning paradigm and it strongly depends on the distribution of the data. In our paper, we show how combining behavioural cloning with human-in-the-loop training solves some of its flaws and provid… ▽ More Behavioural cloning has been extensively used to train agents and is recognized as a fast and solid approach to teach general behaviours based on expert trajectories. Such method follows the supervised learning paradigm and it strongly depends on the distribution of the data. In our paper, we show how combining behavioural cloning with human-in-the-loop training solves some of its flaws and provides an agent task-specific corrections to overcome tricky situations while speeding up the training time and lowering the required resources. To do this, we introduce a novel approach that allows an expert to take control of the agent at any moment during a simulation and provide optimal solutions to its problematic situations. Our experiments show that this approach leads to better policies both in terms of quantitative evaluation and in human-likeliness. △ Less

Submitted 19 January, 2022; originally announced January 2022.

Comments: 6 pages, 5 figures, 2 code snippets, accepted at the AAAI-22 Workshop on Interactive Machine Learning

arXiv:2110.03869 [pdf, other]

Self-supervised Speaker Recognition with Loss-gated Learning

Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to stud… ▽ More In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a $46.3\%$ performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of $1.66\%$ on the VoxCeleb1 original test set. Code has been made available at: https://github.com/TaoRuijie/Loss-Gated-Learning. △ Less

Submitted 14 July, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures

arXiv:2110.00940 [pdf, other]

PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction

Authors: Yi Ma, Kong Aik Lee, Ville Hautamaki, Haizhou Li

Abstract: Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation… ▽ More Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction. This framework is optimized based on the feedback of the speaker identification task and the high-level perceptual deviation between the raw speech signal and its noisy version. We conducted speaker verification tasks in both noisy and clean environment respectively to evaluate our system. Compared to the baseline, our method shows better performance in both clean and noisy environments, which means our method can not only enhance the speaker relative information but also avoid adding distortions. △ Less

Submitted 3 October, 2021; originally announced October 2021.

arXiv:2109.13510 [pdf, other]

VoxCeleb Enrichment for Age and Gender Recognition

Authors: Khaled Hechmi, Trung Ngo Trong, Ville Hautamaki, Tomi Kinnunen

Abstract: VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive ag… ▽ More VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive age and gender labels. We also compare the original VoxCeleb gender labels with our labels to identify records that might be mislabeled in the original VoxCeleb data. On modeling side, we design a comprehensive study of multiple features and models for recognizing gender and age. Our best system, using i-vector features, achieved an F1-score of 0.9829 for gender recognition task using logistic regression, and the lowest mean absolute error (MAE) in age regression, 9.443 years, is obtained with ridge regression. This indicates challenge in age estimation from in-the-wild style speech data. △ Less

Submitted 20 December, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

Comments: Accepted for presentation at ASRU 2021; repository: https://github.com/hechmik/voxceleb_enrichment_age_gender

arXiv:2107.00703 [pdf, other]

Distilling Reinforcement Learning Tricks for Video Games

Authors: Anssi Kanervisto, Christian Scheller, Yanick Schraner, Ville Hautamäki

Abstract: Reinforcement learning (RL) research focuses on general solutions that can be applied across different domains. This results in methods that RL practitioners can use in almost any domain. However, recent studies often lack the engineering steps ("tricks") which may be needed to effectively use RL, such as reward sha**, curriculum learning, and splitting a large task into smaller chunks. Such tri… ▽ More Reinforcement learning (RL) research focuses on general solutions that can be applied across different domains. This results in methods that RL practitioners can use in almost any domain. However, recent studies often lack the engineering steps ("tricks") which may be needed to effectively use RL, such as reward sha**, curriculum learning, and splitting a large task into smaller chunks. Such tricks are common, if not necessary, to achieve state-of-the-art results and win RL competitions. To ease the engineering efforts, we distill descriptions of tricks from state-of-the-art results and study how well these tricks can improve a standard deep Q-learning agent. The long-term goal of this work is to enable combining proven RL methods with domain-specific tricks by providing a unified software framework and accompanying insights in multiple domains. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: To appear in IEEE Conference on Games 2021. Experiment code is available at https://github.com/Miffyli/rl-human-prior-tricks

arXiv:2104.10753 [pdf, other]

Multi-task Learning with Attention for End-to-end Autonomous Driving

Authors: Keishi Ishihara, Anssi Kanervisto, Jun Miura, Ville Hautamäki

Abstract: Autonomous driving systems need to handle complex scenarios such as lane following, avoiding collisions, taking turns, and responding to traffic signals. In recent years, approaches based on end-to-end behavioral cloning have demonstrated remarkable performance in point-to-point navigational scenarios, using a realistic simulator and standard benchmarks. Offline imitation learning is readily avail… ▽ More Autonomous driving systems need to handle complex scenarios such as lane following, avoiding collisions, taking turns, and responding to traffic signals. In recent years, approaches based on end-to-end behavioral cloning have demonstrated remarkable performance in point-to-point navigational scenarios, using a realistic simulator and standard benchmarks. Offline imitation learning is readily available, as it does not require expensive hand annotation or interaction with the target environment, but it is difficult to obtain a reliable system. In addition, existing methods have not specifically addressed the learning of reaction for traffic lights, which are a rare occurrence in the training datasets. Inspired by the previous work on multi-task learning and attention modeling, we propose a novel multi-task attention-aware network in the conditional imitation learning (CIL) framework. This does not only improve the success rate of standard benchmarks, but also the ability to react to traffic lights, which we show with standard benchmarks. △ Less

Submitted 21 April, 2021; originally announced April 2021.

Comments: Accepted to CVPR 2021 Workshop on Autonomous Driving

arXiv:2012.04199 [pdf, other]

Cost Sensitive Optimization of Deepfake Detector

Authors: Ivan Kukanov, Janne Karttunen, Hannu Sillanpää, Ville Hautamäki

Abstract: Since the invention of cinema, the manipulated videos have existed. But generating manipulated videos that can fool the viewer has been a time-consuming endeavor. With the dramatic improvements in the deep generative modeling, generating believable looking fake videos has become a reality. In the present work, we concentrate on the so-called deepfake videos, where the source face is swapped with t… ▽ More Since the invention of cinema, the manipulated videos have existed. But generating manipulated videos that can fool the viewer has been a time-consuming endeavor. With the dramatic improvements in the deep generative modeling, generating believable looking fake videos has become a reality. In the present work, we concentrate on the so-called deepfake videos, where the source face is swapped with the targets. We argue that deepfake detection task should be viewed as a screening task, where the user, such as the video streaming platform, will screen a large number of videos daily. It is clear then that only a small fraction of the uploaded videos are deepfakes, so the detection performance needs to be measured in a cost-sensitive way. Preferably, the model parameters also need to be estimated in the same way. This is precisely what we propose here. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2012.01244 [pdf, other]

General Characterization of Agents by States they Visit

Authors: Anssi Kanervisto, Tomi Kinnunen, Ville Hautamäki

Abstract: Behavioural characterizations (BCs) of decision-making agents, or their policies, are used to study outcomes of training algorithms and as part of the algorithms themselves to encourage unique policies, match expert policy or restrict changes to policy per update. However, previously presented solutions are not applicable in general, either due to lack of expressive power, computational constraint… ▽ More Behavioural characterizations (BCs) of decision-making agents, or their policies, are used to study outcomes of training algorithms and as part of the algorithms themselves to encourage unique policies, match expert policy or restrict changes to policy per update. However, previously presented solutions are not applicable in general, either due to lack of expressive power, computational constraint or constraints on the policy or environment. Furthermore, many BCs rely on the actions of policies. We discuss and demonstrate how these BCs can be misleading, especially in stochastic environments, and propose a novel solution based on what states policies visit. We run experiments to evaluate the quality of the proposed BC against baselines and evaluate their use in studying training algorithms, novelty search and trust-region policy optimization. The code is available at https://github.com/miffyli/policy-supervectors. △ Less

Submitted 28 October, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: Deep Reinforcement Learning Workshop, NeurIPS 2021

arXiv:2006.07698 [pdf, other]

Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya

Authors: Abrhalei Tela, Abraham Woubie, Ville Hautamaki

Abstract: In recent years, transformer models have achieved great success in natural language processing (NLP) tasks. Most of the current state-of-the-art NLP results are achieved by using monolingual transformer models, where the model is pre-trained using a single language unlabelled text corpus. Then, the model is fine-tuned to the specific downstream task. However, the cost of pre-training a new transfo… ▽ More In recent years, transformer models have achieved great success in natural language processing (NLP) tasks. Most of the current state-of-the-art NLP results are achieved by using monolingual transformer models, where the model is pre-trained using a single language unlabelled text corpus. Then, the model is fine-tuned to the specific downstream task. However, the cost of pre-training a new transformer model is high for most languages. In this work, we propose a cost-effective transfer learning method to adopt a strong source language model, trained from a large monolingual corpus to a low-resource language. Thus, using XLNet language model, we demonstrate competitive performance with mBERT and a pre-trained target language model on the cross-lingual sentiment (CLS) dataset and on a new sentiment analysis dataset for low-resourced language Tigrinya. With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet has achieved 78.88% F1-Score outperforming BERT and mBERT by 10% and 7%, respectively. More interestingly, fine-tuning (English) XLNet model on the CLS dataset has promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language. △ Less

Submitted 19 June, 2020; v1 submitted 13 June, 2020; originally announced June 2020.

arXiv:2005.03374 [pdf, other]

Playing Minecraft with Behavioural Cloning

Authors: Anssi Kanervisto, Janne Karttunen, Ville Hautamäki

Abstract: MineRL 2019 competition challenged participants to train sample-efficient agents to play Minecraft, by using a dataset of human gameplay and a limit number of steps the environment. We approached this task with behavioural cloning by predicting what actions human players would take, and reached fifth place in the final ranking. Despite being a simple algorithm, we observed the performance of such… ▽ More MineRL 2019 competition challenged participants to train sample-efficient agents to play Minecraft, by using a dataset of human gameplay and a limit number of steps the environment. We approached this task with behavioural cloning by predicting what actions human players would take, and reached fifth place in the final ranking. Despite being a simple algorithm, we observed the performance of such an approach can vary significantly, based on when the training is stopped. In this paper, we detail our submission to the competition, run further experiments to study how performance varied over training and study how different engineering decisions affected these results. △ Less

Submitted 7 May, 2020; originally announced May 2020.

Comments: To appear in Post Proceedings of the Competitions & Demonstrations Track @ NeurIPS2019. Source code available at https://github.com/Miffyli/minecraft-bc

arXiv:2004.00981 [pdf, other]

Benchmarking End-to-End Behavioural Cloning on Video Games

Authors: Anssi Kanervisto, Joonas Pussinen, Ville Hautamäki

Abstract: Behavioural cloning, where a computer is taught to perform a task based on demonstrations, has been successfully applied to various video games and robotics tasks, with and without reinforcement learning. This also includes end-to-end approaches, where a computer plays a video game like humans do: by looking at the image displayed on the screen, and sending keystrokes to the game. As a general app… ▽ More Behavioural cloning, where a computer is taught to perform a task based on demonstrations, has been successfully applied to various video games and robotics tasks, with and without reinforcement learning. This also includes end-to-end approaches, where a computer plays a video game like humans do: by looking at the image displayed on the screen, and sending keystrokes to the game. As a general approach to playing video games, this has many inviting properties: no need for specialized modifications to the game, no lengthy training sessions and the ability to re-use the same tools across different games. However, related work includes game-specific engineering to achieve the results. We take a step towards a general approach and study the general applicability of behavioural cloning on twelve video games, including six modern video games (published after 2010), by using human demonstrations as training data. Our results show that these agents cannot match humans in raw performance but do learn basic dynamics and rules. We also demonstrate how the quality of the data matters, and how recording data from humans is subject to a state-action mismatch, due to human reflexes. △ Less

Submitted 18 May, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: To appear in IEEE Conference on Games 2020. Experiment code available at https://github.com/joonaspu/video-game-behavioural-cloning and https://github.com/joonaspu/ViControl

arXiv:2004.00980 [pdf, other]

Action Space Sha** in Deep Reinforcement Learning

Authors: Anssi Kanervisto, Christian Scheller, Ville Hautamäki

Abstract: Reinforcement learning (RL) has been successful in training agents in various learning environments, including video-games. However, such work modifies and shrinks the action space from the game's original. This is to avoid trying "pointless" actions and to ease the implementation. Currently, this is mostly done based on intuition, with little systematic research supporting the design decisions. I… ▽ More Reinforcement learning (RL) has been successful in training agents in various learning environments, including video-games. However, such work modifies and shrinks the action space from the game's original. This is to avoid trying "pointless" actions and to ease the implementation. Currently, this is mostly done based on intuition, with little systematic research supporting the design decisions. In this work, we aim to gain insight on these action space modifications by conducting extensive experiments in video-game environments. Our results show how domain-specific removal of actions and discretization of continuous actions can be crucial for successful learning. With these insights, we hope to ease the use of RL in new environments, by clarifying what action-spaces are easy to learn. △ Less

Submitted 26 May, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: To appear in IEEE Conference on Games 2020. Experiment code is available at https://github.com/Miffyli/rl-action-space-sha**

arXiv:2002.03801 [pdf, other]

An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning

Authors: Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

Abstract: The spoofing countermeasure (CM) systems in automatic speaker verification (ASV) are not typically used in isolation of each other. These systems can be combined, for example, into a cascaded system where CM produces first a decision whether the input is synthetic or bona fide speech. In case the CM decides it is a bona fide sample, then the ASV system will consider it for speaker verification. En… ▽ More The spoofing countermeasure (CM) systems in automatic speaker verification (ASV) are not typically used in isolation of each other. These systems can be combined, for example, into a cascaded system where CM produces first a decision whether the input is synthetic or bona fide speech. In case the CM decides it is a bona fide sample, then the ASV system will consider it for speaker verification. End users of the system are not interested in the performance of the individual sub-modules, but instead are interested in the performance of the combined system. Such combination can be evaluated with tandem detection cost function (t-DCF) measure, yet the individual components are trained separately from each other using their own performance metrics. In this work we study training the ASV and CM components together for a better t-DCF measure by using reinforcement learning. We demonstrate that such training procedure indeed is able to improve the performance of the combined system, and does so with more reliable results than with the standard supervised learning techniques we compare against. △ Less

Submitted 8 April, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

Comments: Odyssey 2020 The Speaker and Language Recognition Workshop. Code available at https://github.com/Miffyli/asv-cm-reinforce

arXiv:1907.03164 [pdf, other]

Towards Debugging Deep Neural Networks by Generating Speech Utterances

Authors: Bilal Soomro, Anssi Kanervisto, Trung Ngo Trong, Ville Hautamäki

Abstract: Deep neural networks (DNN) are able to successfully process and classify speech utterances. However, understanding the reason behind a classification by DNN is difficult. One such debugging method used with image classification DNNs is activation maximization, which generates example-images that are classified as one of the classes. In this work, we evaluate applicability of this method to speech… ▽ More Deep neural networks (DNN) are able to successfully process and classify speech utterances. However, understanding the reason behind a classification by DNN is difficult. One such debugging method used with image classification DNNs is activation maximization, which generates example-images that are classified as one of the classes. In this work, we evaluate applicability of this method to speech utterance classifiers as the means to understanding what DNN "listens to". We trained a classifier using the speech command corpus and then use activation maximization to pull samples from the trained model. Then we synthesize audio from features using WaveNet vocoder for subjective analysis. We measure the quality of generated samples by objective measurements and crowd-sourced human evaluations. Results show that when combined with the prior of natural speech, activation maximization can be used to generate examples of different classes. Based on these results, activation maximization can be used to start opening up the DNN black-box in speech tasks. △ Less

Submitted 6 July, 2019; originally announced July 2019.

Comments: Accepted to Interspeech 2019

arXiv:1905.04192 [pdf, other]

Do Autonomous Agents Benefit from Hearing?

Authors: Abraham Woubie, Anssi Kanervisto, Janne Karttunen, Ville Hautamaki

Abstract: Map** states to actions in deep reinforcement learning is mainly based on visual information. The commonly used approach for dealing with visual information is to extract pixels from images and use them as state representation for reinforcement learning agent. But, any vision only agent is handicapped by not being able to sense audible cues. Using hearing, animals are able to sense targets that… ▽ More Map** states to actions in deep reinforcement learning is mainly based on visual information. The commonly used approach for dealing with visual information is to extract pixels from images and use them as state representation for reinforcement learning agent. But, any vision only agent is handicapped by not being able to sense audible cues. Using hearing, animals are able to sense targets that are outside of their visual range. In this work, we propose the use of audio as complementary information to visual only in state representation. We assess the impact of such multi-modal setup in reach-the-goal tasks in ViZDoom environment. Results show that the agent improves its behavior when visual information is accompanied with audio features. △ Less

Submitted 10 May, 2019; originally announced May 2019.

arXiv:1905.00741 [pdf, other]

From Video Game to Real Robot: The Transfer between Action Spaces

Authors: Janne Karttunen, Anssi Kanervisto, Ville Kyrki, Ville Hautamäki

Abstract: Deep reinforcement learning has proven to be successful for learning tasks in simulated environments, but applying same techniques for robots in real-world domain is more challenging, as they require hours of training. To address this, transfer learning can be used to train the policy first in a simulated environment and then transfer it to physical agent. As the simulation never matches reality p… ▽ More Deep reinforcement learning has proven to be successful for learning tasks in simulated environments, but applying same techniques for robots in real-world domain is more challenging, as they require hours of training. To address this, transfer learning can be used to train the policy first in a simulated environment and then transfer it to physical agent. As the simulation never matches reality perfectly, the physics, visuals and action spaces by necessity differ between these environments to some degree. In this work, we study how general video games can be directly used instead of fine-tuned simulations for the sim-to-real transfer. Especially, we study how the agent can learn the new action space autonomously, when the game actions do not match the robot actions. Our results show that the different action space can be learned by re-training only part of neural network and we obtain above 90% mean success rate in simulation and robot experiments. △ Less

Submitted 23 March, 2020; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: Two first authors contributed equally. Accepted by ICASSP 2020

arXiv:1904.07386 [pdf, other]

I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation. △ Less

Submitted 15 April, 2019; originally announced April 2019.

Comments: 5 pages

arXiv:1811.03293 [pdf, other]

Who Do I Sound Like? Showcasing Speaker Recognition Technology by YouTube Voice Search

Authors: Ville Vestman, Bilal Soomro, Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen

Abstract: The popularization of science can often be disregarded by scientists as it may be challenging to put highly sophisticated research into words that general public can understand. This work aims to help presenting speaker recognition research to public by proposing a publicly appealing concept for showcasing recognition systems. We leverage data from YouTube and use it in a large-scale voice search… ▽ More The popularization of science can often be disregarded by scientists as it may be challenging to put highly sophisticated research into words that general public can understand. This work aims to help presenting speaker recognition research to public by proposing a publicly appealing concept for showcasing recognition systems. We leverage data from YouTube and use it in a large-scale voice search web application that finds the celebrity voices that best match to the user's voice. The concept was tested in a public event as well as "in the wild" and the received feedback was mostly positive. The i-vector based speaker identification back end was found to be fast (665 ms per request) and had a high identification accuracy (93 %) for the YouTube target speakers. To help other researchers to develop the idea further, we share the source codes of the web platform used for the demo at https://github.com/bilalsoomro/speech-demo-platform. △ Less

Submitted 10 February, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

Comments: Accepted for presentation in ICASSP 2019

arXiv:1807.10110 [pdf, other]

ToriLLE: Learning Environment for Hand-to-Hand Combat

Authors: Anssi Kanervisto, Ville Hautamäki

Abstract: We present Toribash Learning Environment (ToriLLE), a learning environment for machine learning agents based on the video game Toribash. Toribash is a MuJoCo-like environment of two humanoid character fighting each other hand-to-hand, controlled by changing actuation modes of the joints. Competitive nature of Toribash as well its focused domain provide a platform for evaluating self-play methods,… ▽ More We present Toribash Learning Environment (ToriLLE), a learning environment for machine learning agents based on the video game Toribash. Toribash is a MuJoCo-like environment of two humanoid character fighting each other hand-to-hand, controlled by changing actuation modes of the joints. Competitive nature of Toribash as well its focused domain provide a platform for evaluating self-play methods, and evaluating machine learning agents against human players. In this paper we describe the environment with ToriLLE's capabilities and limitations, and experimentally show its applicability as a learning environment. The source code of the environment and conducted experiments can be found at https://github.com/Miffyli/ToriLLE. △ Less

Submitted 4 June, 2019; v1 submitted 26 July, 2018; originally announced July 2018.

Comments: https://github.com/Miffyli/ToriLLE . Accepted to IEEE Conference on Games 2019

arXiv:1804.11067 [pdf, other]

Staircase Network: structural language identification via hierarchical attentive units

Authors: Trung Ngo Trong, Ville Hautamäki, Kristiina Jokinen

Abstract: Language recognition system is typically trained directly to optimize classification error on the target language labels, without using the external, or meta-information in the estimation of the model parameters. However labels are not independent of each other, there is a dependency enforced by, for example, the language family, which affects negatively on classification. The other external infor… ▽ More Language recognition system is typically trained directly to optimize classification error on the target language labels, without using the external, or meta-information in the estimation of the model parameters. However labels are not independent of each other, there is a dependency enforced by, for example, the language family, which affects negatively on classification. The other external information sources (e.g. audio encoding, telephony or video speech) can also decrease classification accuracy. In this paper, we attempt to solve these issues by constructing a deep hierarchical neural network, where different levels of meta-information are encapsulated by attentive prediction units and also embedded into the training progress. The proposed method learns auxiliary tasks to obtain robust internal representation and to construct a variant of attentive units within the hierarchical model. The final result is the structural prediction of the target language and a closely related language family. The algorithm reflects a "staircase" way of learning in both its architecture and training, advancing from the fundamental audio encoding to the language family level and finally to the target language level. This process not only improves generalization but also tackles the issues of imbalanced class priors and channel variability in the deep neural network model. Our experimental findings show that the proposed architecture outperforms the state-of-the-art i-vector approaches on both small and big language corpora by a significant margin. △ Less

Submitted 30 April, 2018; originally announced April 2018.

arXiv:1804.08910 [pdf, other]

Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification

Authors: Rosa González Hautamäki, Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen

Abstract: Voice disguise, purposeful modification of one's speaker identity with the aim of avoiding being identified as oneself, is a low-effort way to fool speaker recognition, whether performed by a human or an automatic speaker verification (ASV) system. We present an evaluation of the effectiveness of age stereotypes as a voice disguise strategy, as a follow up to our recent work where 60 native Finnis… ▽ More Voice disguise, purposeful modification of one's speaker identity with the aim of avoiding being identified as oneself, is a low-effort way to fool speaker recognition, whether performed by a human or an automatic speaker verification (ASV) system. We present an evaluation of the effectiveness of age stereotypes as a voice disguise strategy, as a follow up to our recent work where 60 native Finnish speakers attempted to sound like an elderly and like a child. In that study, we presented evidence that both ASV and human observers could easily miss the target speaker but we did not address how believable the presented vocal age stereotypes were; this study serves to fill that gap. The interesting cases would be speakers who succeed in being missed by the ASV system, and which a typical listener cannot detect as being a disguise. We carry out a perceptual test to study the quality of the disguised speech samples. The listening test was carried out both locally and with the help of Amazon's Mechanical Turk (MT) crowd-workers. A total of 91 listeners participated in the test and were instructed to estimate both the speaker's chronological and intended age. The results indicate that age estimations for the intended old and child voices for female speakers were towards the target age groups, while for male speakers, the age estimations corresponded to the direction of the target voice only for elderly voices. In the case of intended child's voice, listeners estimated the age of male speakers to be older than their chronological age for most of the speakers and not the intended target age. △ Less

Submitted 28 May, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: Accepted to Speaker Odyssey 2018: The Speaker and Language Recognition Workshop

arXiv:1602.01929 [pdf, other]

Fantastic 4 system for NIST 2015 Language Recognition Evaluation

Authors: Kong Aik Lee, Ville Hautamäki, Anthony Larcher, Wei Rao, Hanwu Sun, Trung Hieu Nguyen, Guangsen Wang, Aleksandr Sizov, Ivan Kukanov, Amir Poorjam, Trung Ngo Trong, Xiong Xiao, Cheng-Lin Xu, Hai-Hua Xu, Bin Ma, Haizhou Li, Sylvain Meignier

Abstract: This article describes the systems jointly submitted by Institute for Infocomm (I$^2$R), the Laboratoire d'Informatique de l'Université du Maine (LIUM), Nanyang Technology University (NTU) and the University of Eastern Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE). The submitted system is a fusion of nine sub-systems based on i-vectors extracted from different types of features… ▽ More This article describes the systems jointly submitted by Institute for Infocomm (I$^2$R), the Laboratoire d'Informatique de l'Université du Maine (LIUM), Nanyang Technology University (NTU) and the University of Eastern Finland (UEF) for 2015 NIST Language Recognition Evaluation (LRE). The submitted system is a fusion of nine sub-systems based on i-vectors extracted from different types of features. Given the i-vectors, several classifiers are adopted for the language detection task including support vector machines (SVM), multi-class logistic regression (MCLR), Probabilistic Linear Discriminant Analysis (PLDA) and Deep Neural Networks (DNN). △ Less

Submitted 5 February, 2016; originally announced February 2016.

Comments: Technical report for NIST LRE 2015 Workshop

Showing 1–34 of 34 results for author: Hautamaki, V