-
Distilling the Knowledge from Conditional Normalizing Flows
Authors:
Dmitry Baranchuk,
Vladimir Aliev,
Artem Babenko
Abstract:
Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems. In contrast to other generative models, normalizing flows are latent variable models with tractable likelihoods and allow for stable training. However, they have to be carefully designed to represent invertible functions with efficient Jacobian determinant calculation…
▽ More
Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems. In contrast to other generative models, normalizing flows are latent variable models with tractable likelihoods and allow for stable training. However, they have to be carefully designed to represent invertible functions with efficient Jacobian determinant calculation. In practice, these requirements lead to overparameterized and sophisticated architectures that are inferior to alternative feed-forward models in terms of inference time and memory consumption. In this work, we investigate whether one can distill flow-based models into more efficient alternatives. We provide a positive answer to this question by proposing a simple distillation approach and demonstrating its effectiveness on state-of-the-art conditional flow-based models for image super-resolution and speech synthesis.
△ Less
Submitted 5 August, 2021; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Free-Lunch Saliency via Attention in Atari Agents
Authors:
Dmitry Nikulin,
Anastasia Ianina,
Vladimir Aliev,
Sergey Nikolenko
Abstract:
We propose a new approach to visualize saliency maps for deep neural network models and apply it to deep reinforcement learning agents trained on Atari environments. Our method adds an attention module that we call FLS (Free Lunch Saliency) to the feature extractor from an established baseline (Mnih et al., 2015). This addition results in a trainable model that can produce saliency maps, i.e., vis…
▽ More
We propose a new approach to visualize saliency maps for deep neural network models and apply it to deep reinforcement learning agents trained on Atari environments. Our method adds an attention module that we call FLS (Free Lunch Saliency) to the feature extractor from an established baseline (Mnih et al., 2015). This addition results in a trainable model that can produce saliency maps, i.e., visualizations of the importance of different parts of the input for the agent's current decision making. We show experimentally that a network with an FLS module exhibits performance similar to the baseline (i.e., it is "free", with no performance cost) and can be used as a drop-in replacement for reinforcement learning agents. We also design another feature extractor that scores slightly lower but provides higher-fidelity visualizations. In addition to attained scores, we report saliency metrics evaluated on the Atari-HEAD dataset of human gameplay.
△ Less
Submitted 30 October, 2019; v1 submitted 7 August, 2019;
originally announced August 2019.
-
Learning State Representations in Complex Systems with Multimodal Data
Authors:
Pavel Solovev,
Vladimir Aliev,
Pavel Ostyakov,
Gleb Sterkin,
Elizaveta Logacheva,
Stepan Troeshestov,
Roman Suvorov,
Anton Mashikhin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset a…
▽ More
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.
△ Less
Submitted 15 January, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
Deep Learning Approaches for Understanding Simple Speech Commands
Authors:
Roman A. Solovyev,
Maxim Vakhrushev,
Alexander Radionov,
Vladimir Aliev,
Alexey A. Shvets
Abstract:
Automatic classification of sound commands is becoming increasingly important, especially for mobile and embedded devices. Many of these devices contain both cameras and microphones, and companies that develop them would like to use the same technology for both of these classification tasks. One way of achieving this is to represent sound commands as images, and use convolutional neural networks w…
▽ More
Automatic classification of sound commands is becoming increasingly important, especially for mobile and embedded devices. Many of these devices contain both cameras and microphones, and companies that develop them would like to use the same technology for both of these classification tasks. One way of achieving this is to represent sound commands as images, and use convolutional neural networks when classifying images as well as sounds. In this paper we consider several approaches to the problem of sound classification that we applied in TensorFlow Speech Recognition Challenge organized by Google Brain team on the Kaggle platform. Here we show different representation of sounds (Wave frames, Spectrograms, Mel-Spectrograms, MFCCs) and apply several 1D and 2D convolutional neural networks in order to get the best performance. Our experiments show that we found appropriate sound representation and corresponding convolutional neural networks. As a result we achieved good classification accuracy that allowed us to finish the challenge on 8-th place among 1315 teams.
△ Less
Submitted 4 October, 2018;
originally announced October 2018.
-
Label Denoising with Large Ensembles of Heterogeneous Neural Networks
Authors:
Pavel Ostyakov,
Elizaveta Logacheva,
Roman Suvorov,
Vladimir Aliev,
Gleb Sterkin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, m…
▽ More
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, model architectures, and model combination. Our final model is based on a large ensemble of video- and frame-level models but fits into rather limiting hardware constraints. We apply an approach based on knowledge distillation to deal with noisy labels in the original dataset and the recently developed mixup technique to improve the basic models.
△ Less
Submitted 15 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.