Search | arXiv e-print repository

arXiv:2007.02676 [pdf, other]

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Authors: Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

Abstract: Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio… ▽ More Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics. △ Less

Submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.08386 [pdf, other]

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors. △ Less

Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

arXiv:2006.01919 [pdf, other]

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

Authors: Archontis Politis, Sharath Adavanne, Tuomas Virtanen

Abstract: This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datase… ▽ More This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datasets of diverse sound events occurring under realistic acoustic conditions are needed. Compared to the previous challenge, a significantly more complex dataset was created for DCASE 2020. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scenes are created using real room impulse responses captured in a continuous manner with a slowly moving excitation source. Both static and moving sound events are synthesized from them. Ambient noise recorded on location is added to complete the generation of scene recordings. A baseline SELD method accompanies the dataset, based on a convolutional recurrent neural network, to provide benchmark scores for the task. The baseline is an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance. △ Less

Submitted 27 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

arXiv:2005.14623 [pdf, other]

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Authors: Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

Abstract: This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis… ▽ More This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added. △ Less

Submitted 2 November, 2020; v1 submitted 29 May, 2020; originally announced May 2020.

Comments: published in DCASE 2020 Workshop

arXiv:2002.05033 [pdf, other]

Active Learning for Sound Event Detection

Authors: Shuyang Zhao, Toni Heittola, Tuomas Virtanen

Abstract: This paper proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the select… ▽ More This paper proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2% of the training data, the achieved SED performance is similar to annotating all the training data. △ Less

Submitted 9 September, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

arXiv:2002.00476 [pdf, other]

Sound Event Detection with Depthwise Separable and Dilated Convolutions

Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen

Abstract: State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount o… ▽ More State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively. △ Less

Submitted 2 February, 2020; originally announced February 2020.

arXiv:1911.10888 [pdf]

Sound event detection via dilated convolutional recurrent neural networks

Authors: Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen

Abstract: Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal cont… ▽ More Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively. △ Less

Submitted 20 July, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: 5 pages, 3 tables and 3 figures

arXiv:1911.07098 [pdf, other]

VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

Authors: Shayan Gharib, Konstantinos Drossos, Eemi Fagerlund, Tuomas Virtanen

Abstract: The performance of sound event detection methods can significantly degrade when they are used in unseen conditions (e.g. recording devices, ambient noise). Domain adaptation is a promising way to tackle this problem. In this paper, we present VOICe, the first dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three dif… ▽ More The performance of sound event detection methods can significantly degrade when they are used in unseen conditions (e.g. recording devices, ambient noise). Domain adaptation is a promising way to tackle this problem. In this paper, we present VOICe, the first dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise. VOICe is freely available online (https://doi.org/10.5281/zenodo.3514950). In addition, using an adversarial-based training method, we evaluate the performance of a domain adaptation method on VOICe. △ Less

Submitted 25 November, 2019; v1 submitted 16 November, 2019; originally announced November 2019.

Comments: Fixed the footnote at the abstract

arXiv:1911.03128 [pdf, ps, other]

doi 10.1109/LSP.2020.2970310

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Authors: Paul Magron, Tuomas Virtanen

Abstract: Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we… ▽ More Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we address in this paper. First, it has originally been introduced in a heuristic fashion: we propose here a rigorous optimization framework in which MISI is derived, thus proving the convergence of this algorithm. Besides, while MISI operates offline, we propose here an online version of MISI called oMISI, which is suitable for low-latency source separation, an important requirement for e.g., hearing aids applications. oMISI also allows one to use alternative phase initialization schemes exploiting the temporal structure of audio signals. Experiments conducted on a speech separation task show that oMISI performs as well as its offline counterpart, thus demonstrating its potential for real-time source separation. △ Less

Submitted 24 February, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

arXiv:1911.00527 [pdf, other]

Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters

Authors: Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti

Abstract: Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the prop… ▽ More Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%. △ Less

Submitted 1 November, 2019; originally announced November 2019.

arXiv:1910.09387 [pdf, ps, other]

Clotho: An Audio Captioning Dataset

Authors: Konstantinos Drossos, Samuel Lip**, Tuomas Virtanen

Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and… ▽ More Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684). △ Less

Submitted 21 October, 2019; originally announced October 2019.

arXiv:1907.09238 [pdf, other]

Crowdsourcing a Dataset of Audio Captions

Authors: Samuel Lip**, Konstantinos Drossos, Tuomas Virtanen

Abstract: Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an aud… ▽ More Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, kee** the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information. △ Less

Submitted 22 July, 2019; originally announced July 2019.

arXiv:1907.08506 [pdf, other]

Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling

Authors: Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen

Abstract: A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine tra… ▽ More A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 9% and 2% at the F1 (higher is better) and a decrease of 7% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 4% at F1 score and an increase of 7% at ER for the TUT-SED Synthetic 2016 dataset. △ Less

Submitted 6 November, 2019; v1 submitted 19 July, 2019; originally announced July 2019.

Comments: Fixed the display of URLs at footnote, updated the results

arXiv:1905.08546 [pdf, other]

A multi-room reverberant dataset for sound event localization and detection

Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Abstract: This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and el… ▽ More This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup. △ Less

Submitted 24 May, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

arXiv:1905.01926 [pdf]

Zero-Shot Audio Classification Based on Class Label Embeddings

Authors: Huang Xie, Tuomas Virtanen

Abstract: This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a… ▽ More This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes. △ Less

Submitted 7 August, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

Comments: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

arXiv:1905.00979 [pdf, other]

City classification from multiple real-world sound scenes

Authors: Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a… ▽ More The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches. △ Less

Submitted 29 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: Accepted to WASPAA 2019

arXiv:1905.00078 [pdf, other]

doi 10.1109/JSTSP.2019.2908700

Deep Learning for Audio Signal Processing

Authors: Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-yiin Chang, Tara Sainath

Abstract: Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fer… ▽ More Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified. △ Less

Submitted 25 May, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

Comments: 15 pages, 2 pdf figures

ACM Class: I.2.6; H.5.1

Journal ref: Journal of Selected Topics of Signal Processing 14, No. 8 (2019)

arXiv:1904.12769 [pdf, other]

Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Abstract: This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared w… ▽ More This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared with a stand-alone tracking method that combines a multi-source (DOA) estimator and a particle filter. Their respective performance is evaluated in various acoustic conditions such as anechoic and reverberant scenarios, stationary and moving sources at several angular velocities, and with a varying number of overlap** sources. The results show that the CRNN manages to track multiple sources more consistently than the parametric method across acoustic scenarios, but at the cost of higher localization error. △ Less

Submitted 29 April, 2019; originally announced April 2019.

arXiv:1904.10678 [pdf, ps, other]

Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification

Authors: Konstantinos Drossos, Paul Magron, Tuomas Virtanen

Abstract: A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a diff… ▽ More A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset. △ Less

Submitted 6 November, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

Comments: Updated indices at Eq 6

arXiv:1902.07033 [pdf, other]

Low-Latency Deep Clustering For Speech Separation

Authors: Shanshan Wang, Gaurav Naithani, Tuomas Virtanen

Abstract: This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginnin… ▽ More This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginning of audio mixture to estimate cluster centres corresponding to constituent speakers which are then utilized to separate speakers within the rest of the signal. The buffer duration would serve as an initialization phase after which the system is capable of operating with 8 ms algorithmic latency. We evaluate our proposed approach on two-speaker mixtures from the Wall Street Journal (WSJ0) corpus. We observe that the use of LSTM yields around one dB lower SDR as compared to the baseline bidirectional LSTM in terms of source to distortion ratio (SDR). Moreover, using an 8 ms synthesis window instead of 32 ms degrades the separation performance by around 2.1 dB as compared to the baseline. Finally, we also report separation performance with different buffer durations noting that separation can be achieved even for buffer duration as low as 300ms. △ Less

Submitted 19 February, 2019; originally announced February 2019.

Comments: To appear in ICASSP 2019

arXiv:1808.05777 [pdf, other]

Unsupervised adversarial domain adaptation for acoustic scene classification

Authors: Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitriy Serdyuk, Tuomas Virtanen

Abstract: A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of… ▽ More A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve $\sim 10\%$ increase at the accuracy on an unseen and unlabeled dataset, while kee** almost the same performance on the labeled dataset. △ Less

Submitted 17 August, 2018; originally announced August 2018.

arXiv:1808.02357 [pdf, other]

Acoustic Scene Classification: A Competition Review

Authors: Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen

Abstract: In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and stud… ▽ More In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback. △ Less

Submitted 2 August, 2018; originally announced August 2018.

Comments: This work has been accepted in IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2018). Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:1807.11298 [pdf, other]

Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery

Authors: Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, Tuomas Virtanen

Abstract: Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, w… ▽ More Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach. △ Less

Submitted 30 July, 2018; originally announced July 2018.

arXiv:1807.09840 [pdf, other]

A multi-device dataset for urban acoustic scene classification

Authors: Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

Abstract: This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set cl… ▽ More This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup. △ Less

Submitted 11 October, 2018; v1 submitted 25 July, 2018; originally announced July 2018.

Comments: accepted to DCASE 2018 Workshop

arXiv:1807.06899 [pdf, other]

Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications

Authors: Gaurav Naithani, Joonas Nikunen, Lars Bramsløw, Tuomas Virtanen

Abstract: Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency ($\leq 10$ ms) is important. We use long short-term memor… ▽ More Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency ($\leq 10$ ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing in noise (HINT) dataset. We show that the proposed loss function can offer improved or at par objective intelligibility (in terms of ESTOI) compared to an MSE optimized baseline while resulting in lower objective separation performance (in terms of the source to distortion ratio (SDR)). We then proceed to propose an approach where the network is first initialized with weights optimized for MSE criterion and then trained with the proposed ESTOI loss criterion. This approach mitigates some of the losses in objective separation performance while preserving the gains in objective intelligibility. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: To appear at International Workshop on Acoustic Signal Enhancement (IWAENC) 2018

arXiv:1807.00129 [pdf, other]

doi 10.1109/JSTSP.2018.2885636

Sound Event Localization and Detection of Overlap** Sources Using Convolutional Recurrent Neural Networks

Authors: Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen

Abstract: In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlap** sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-labe… ▽ More In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlap** sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlap** sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlap** sound events. △ Less

Submitted 17 December, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

Comments: Published in Journal of Selected Topics in Signal Processing 2018

arXiv:1805.03647 [pdf, other]

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Authors: Emre Çakır, Tuomas Virtanen

Abstract: Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a map** between these features and the target sound events using a classifier. Recently, the focus of sound event detection research has been mostly shifted to the latter stage using standard features such as mel spectrogram as the input for classifiers such a… ▽ More Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a map** between these features and the target sound events using a classifier. Recently, the focus of sound event detection research has been mostly shifted to the latter stage using standard features such as mel spectrogram as the input for classifiers such as deep neural networks. In this work, we utilize end-to-end approach and propose to combine these two stages in a single deep neural network classifier. The feature extraction over the raw waveform is conducted by a feedforward layer block, whose parameters are initialized to extract the time-frequency representations. The feature extraction parameters are updated during training, resulting with a representation that is optimized for the specific task. This feature extraction block is followed by (and jointly trained with) a convolutional recurrent network, which has recently given state-of-the-art results in many sound recognition tasks. The proposed system does not outperform a convolutional recurrent network with fixed hand-crafted features. The final magnitude spectrum characteristics of the feature extraction block parameters indicate that the most relevant information for the given task is contained in 0 - 3 kHz frequency range, and this is also supported by the empirical results on the SED performance. △ Less

Submitted 9 May, 2018; originally announced May 2018.

Comments: accepted to IJCNN 2018

arXiv:1802.05132 [pdf, ps, other]

Close Miking Empirical Practice Verification: A Source Separation Approach

Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, Gerald Schuller

Abstract: Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itsel… ▽ More Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we compute the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording. △ Less

Submitted 13 February, 2018; originally announced February 2018.

Journal ref: In Proceedings of the 142nd Audio Engineering Society Convention, Berlin, Germany, 2017

arXiv:1802.03156 [pdf, ps, other]

Complex ISNMF: a Phase-Aware Model for Monaural Audio Source Separation

Authors: Paul Magron, Tuomas Virtanen

Abstract: This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary… ▽ More This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary works, we introduce here a Bayesian anisotropic Gaussian source model in which the phase is no longer uniform. Such a model permits us to favor a phase value that originates from a signal model through a Markov chain prior structure. The variance of the latent variables are structured with nonnegative matrix factorization (NMF). The resulting model is called complex Itakura-Saito NMF (ISNMF) since it generalizes the ISNMF model to the case of non-isotropic variables. It combines the advantages of ISNMF, which uses a distortion measure adapted to audio and yields a set of estimates which preserve the overall energy of the mixture, and of complex NMF, which enables one to account for some phase constraints. We derive a generalized expectation-maximization algorithm to estimate the model parameters. Experiments conducted on a musical source separation task in a semi-informed setting show that the proposed approach outperforms state-of-the-art phase-aware separation techniques. △ Less

Submitted 30 September, 2018; v1 submitted 9 February, 2018; originally announced February 2018.

arXiv:1802.00300 [pdf, other]

MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

Abstract: Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upo… ▽ More Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results. △ Less

Submitted 1 February, 2018; originally announced February 2018.

arXiv:1801.09522 [pdf, other]

Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features

Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Abstract: In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter- and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datase… ▽ More In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter- and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datasets with different number of overlap** sound sources are synthesized. Each of this dataset has a four-channel first-order Ambisonic, binaural, and single-channel versions, on which the performance of SED using the proposed method are compared to study the potential of SED using multichannel audio. A similar study is also done with the binaural and single-channel versions of the real-life recording TUT-SED 2017 development dataset. The proposed method learns to recognize overlap** sound events from multichannel features faster and performs better SED with a fewer number of training epochs. The results show that on using multichannel Ambisonic audio in place of single-channel audio we improve the overall F-score by 7.5%, overall error rate by 10% and recognize 15.6% more sound events in time frames with four overlap** sound sources. △ Less

Submitted 29 January, 2018; originally announced January 2018.

arXiv:1711.01437 [pdf, other]

Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask

Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

Abstract: Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during trainin… ▽ More Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation. △ Less

Submitted 13 February, 2018; v1 submitted 4 November, 2017; originally announced November 2017.

arXiv:1710.10059 [pdf, other]

Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all t… ▽ More This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio. △ Less

Submitted 5 August, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

Comments: EUSIPCO 2018

arXiv:1710.10005 [pdf, other]

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Authors: Joonas Nikunen, Aleksandr Diment, Tuomas Virtanen

Abstract: In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by s… ▽ More In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories. △ Less

Submitted 27 October, 2017; originally announced October 2017.

Comments: Preprint of manuscript submitted to IEEE/ACM Transactions on Audio Speech and Language processing (R1)

arXiv:1710.02998 [pdf, other]

Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network

Authors: Sharath Adavanne, Tuomas Virtanen

Abstract: This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed b… ▽ More This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log mel-band energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split. △ Less

Submitted 9 October, 2017; originally announced October 2017.

Comments: Accepted in Detection and Classification of Acoustic Scenes and Events (DCASE 2017)

arXiv:1710.02997 [pdf, other]

A report on sound event detection with different binaural features

Authors: Sharath Adavanne, Tuomas Virtanen

Abstract: In this paper, we compare the performance of using binaural audio features in place of single-channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single-channel and binaural features using stacked convolutional and recu… ▽ More In this paper, we compare the performance of using binaural audio features in place of single-channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single-channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric. △ Less

Submitted 9 October, 2017; originally announced October 2017.

Comments: Technical report for the top performing method in Task 3: Real life sound event detection challenge, at Detection and classification of acoustic scene and events (DCASE) 2017

arXiv:1709.00611 [pdf, other]

A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation

Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, Gerald Schuller

Abstract: The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude… ▽ More The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB. △ Less

Submitted 24 April, 2018; v1 submitted 2 September, 2017; originally announced September 2017.

arXiv:1706.10006 [pdf, other]

Automated Audio Captioning with Recurrent Neural Networks

Authors: Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen

Abstract: We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with… ▽ More We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered. △ Less

Submitted 24 October, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

Comments: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017

arXiv:1706.02293 [pdf, other]

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Authors: Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, Tuomas Virtanen

Abstract: In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlap** sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlap… ▽ More In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlap** sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlap** sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database. The usage of spatial and harmonic features are shown to improve the performance of SED. △ Less

Submitted 7 June, 2017; originally announced June 2017.

arXiv:1706.02292 [pdf, other]

Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition

Authors: Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina

Abstract: This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and… ▽ More This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the 'MediaEval2015 emotion in music' dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: Accepted for Sound and Music Computing (SMC 2017)

arXiv:1706.02291 [pdf, other]

Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

Authors: Sharath Adavanne, Pasi Pertilä, Tuomas Virtanen

Abstract: This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network… ▽ More This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and 2.7% on the TUT-SED 2009 dataset that is fifteen times larger. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: Accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017)

arXiv:1706.02047 [pdf, other]

Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

Authors: Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen

Abstract: This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and t… ▽ More This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: Accepted for European Signal Processing Conference 2017

arXiv:1703.02317 [pdf, other]

Convolutional Recurrent Neural Networks for Bird Audio Detection

Authors: EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen

Abstract: Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invarian… ▽ More Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge. △ Less

Submitted 7 March, 2017; originally announced March 2017.

Comments: Submitted to EUSIPCO 2017 Special Session on Bird Audio Signal Processing

arXiv:1702.06286 [pdf, other]

doi 10.1109/TASLP.2017.2690575

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Authors: Emre Çakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, Tuomas Virtanen

Abstract: Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs an… ▽ More Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: Accepted for IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Sound Scene and Event Analysis

arXiv:1610.08444 [pdf, other]

Temperedness of measures defined by polynomial equations over local fields

Authors: David W. Taylor, V. S. Varadarajan, Jukka T. Virtanen, David E. Weisbart

Abstract: We investigate the asymptotic growth of the canonical measures on the fibers of morphisms between vector spaces over local fields of arbitrary characteristic. For non-archimedean local fields we use a version of the Łojasiewicz inequality (\cite{lojasiewicz1959}, \cite{hormander1958division}) which follows from Greenberg \cite{greenberg1966rational}, \cite{bollaerts1990estimate}, together with the… ▽ More We investigate the asymptotic growth of the canonical measures on the fibers of morphisms between vector spaces over local fields of arbitrary characteristic. For non-archimedean local fields we use a version of the Łojasiewicz inequality (\cite{lojasiewicz1959}, \cite{hormander1958division}) which follows from Greenberg \cite{greenberg1966rational}, \cite{bollaerts1990estimate}, together with the theory of the Brauer group of local fields to construct definite forms of arbitrarily high degree, and to transfer questions at infinity to questions near the origin. We then use these to generalize results of H{ö}rmander \cite{hormander1958division} on estimating the growth of polynomials at infinity in terms of the distance to their zero loci. Specifically, when a fiber corresponds to a non-critical value which is stable, i.e. remains non-critical under small perturbations, we show that the canonical measure on the fiber is tempered, which generalizes results of Igusa and Raghavan \cite{igusa1978lectures}, and Virtanen and Weisbart \cite{virtanen2014elementary}. △ Less

Submitted 20 November, 2016; v1 submitted 26 October, 2016; originally announced October 2016.

Comments: Paper read in New Trends in Mathematics and Physics, Conference held in Moscow, Russia, on October 7 2016

arXiv:1604.00861 [pdf, other]

doi 10.1109/ICASSP.2016.7472917

Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings

Authors: Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen

Abstract: In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large d… ▽ More In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different everyday contexts. The proposed method outperforms previous approaches by a large margin, and the results are further improved using data augmentation techniques. Overall, our system reports an average F1-score of 65.5% on 1 second blocks and 64.7% on single frames, a relative improvement over previous state-of-the-art approach of 6.8% and 15.1% respectively. △ Less

Submitted 4 April, 2016; originally announced April 2016.

Comments: To appean in Proceedings of IEEE ICASSP 2016

arXiv:1101.3528 [pdf, other]

doi 10.1103/PhysRevB.83.224521

Fermi liquid theory applied to vibrating wire measurements in 3He-4He mixtures

Authors: Timo H. Virtanen, Erkki Thuneberg

Abstract: We use Fermi liquid theory to study the mechanical impedance of 3He-4He mixtures at low temperatures. The theory is applied to the case of vibrating wires, immersed in the liquid. We present numerical results based on a direct solution of Landau-Boltzmann equation for the 3He quasiparticle distribution for the full scale of the quasiparticle mean-free-path. The two-fluid nature of mixtures is take… ▽ More We use Fermi liquid theory to study the mechanical impedance of 3He-4He mixtures at low temperatures. The theory is applied to the case of vibrating wires, immersed in the liquid. We present numerical results based on a direct solution of Landau-Boltzmann equation for the 3He quasiparticle distribution for the full scale of the quasiparticle mean-free-path. The two-fluid nature of mixtures is taken into account in the theory, and the effect of Fermi liquid interactions and boundary conditions are studied in detail. The results are in fair quantitative agreement with experimental data. In particular, we can reproduce the anomalous decrease in inertia, observed in vibrating wire experiments reaching the ballistic limit. The essential effect of the experimental container and second-sound resonances is demonstrated. △ Less

Submitted 18 January, 2011; originally announced January 2011.

Comments: 17 pages, 10 figures

Journal ref: Phys. Rev. B 83, 224521 (2011)

arXiv:1010.4016 [pdf, ps, other]

doi 10.1103/PhysRevB.83.245137

Fermi liquid theory of Fermi-Bose mixtures

Authors: E. V. Thuneberg, T. H. Virtanen

Abstract: We write down the basic equations of Fermi-liquid theory for mixtures of fermions and bosons, an example being 3He-4He mixtures at low temperatures. Basically the theory is identical to the one derived by Khalatnikov, but it is derived in a different way, and includes more discussion. A simplifying transformation of the equations is found where the coupling of the normal and superfluid components… ▽ More We write down the basic equations of Fermi-liquid theory for mixtures of fermions and bosons, an example being 3He-4He mixtures at low temperatures. Basically the theory is identical to the one derived by Khalatnikov, but it is derived in a different way, and includes more discussion. A simplifying transformation of the equations is found where the coupling of the normal and superfluid components appears in a simple form. The boundary conditions are discussed. △ Less

Submitted 19 October, 2010; originally announced October 2010.

Comments: 9 pages, no figures

Journal ref: Phys. Rev. B 83, 245137 (2011)

arXiv:1010.4015 [pdf, other]

doi 10.1103/PhysRevLett.106.055301

Pendulum in Fermi liquid

Authors: Timo H. Virtanen, Erkki Thuneberg

Abstract: The Fermi liquid theory formulated by Landau is a basic paradigm of the behavior of an interacting many-body system. We present a new application of this theory to calculate "Landau force" on a macroscopic object. We show that immersing a pendulum in Fermi liquid can increase its oscillation frequency, and evidence of this has been observed in mixtures of 3He and 4He. The Fermi liquid theory formulated by Landau is a basic paradigm of the behavior of an interacting many-body system. We present a new application of this theory to calculate "Landau force" on a macroscopic object. We show that immersing a pendulum in Fermi liquid can increase its oscillation frequency, and evidence of this has been observed in mixtures of 3He and 4He. △ Less

Submitted 19 October, 2010; originally announced October 2010.

Comments: 4 pages, 2 figures

Journal ref: Phys. Rev. Lett. 106, 055301 (2011)

arXiv:1002.0047 [pdf, ps, other]

Structure, classification, and conformal symmetry of elementary particles over non-archimedean space-time

Authors: V. S. Varadarajan, Jukka T. Virtanen

Abstract: It is well known that at distances shorter than Planck length, no length measurements are possible. The Volovich hypothesis asserts that at sub-Planckian distances and times, spacetime itself has a non-Archimedean geometry. We discuss the structure of elementary particles, their classification, and their conformal symmetry under this hypothesis. Specifically, we investigate the projective repres… ▽ More It is well known that at distances shorter than Planck length, no length measurements are possible. The Volovich hypothesis asserts that at sub-Planckian distances and times, spacetime itself has a non-Archimedean geometry. We discuss the structure of elementary particles, their classification, and their conformal symmetry under this hypothesis. Specifically, we investigate the projective representations of the $p$-adic Poincaré and Galilean groups, using a new variant of the Mackey machine for projective unitary representations of semidirect products of locally compact and second countable (lcsc) groups. We construct the conformal spacetime over $p$-adic fields and discuss the imbedding of the $p$-adic Poincaré group into the $p$-adic conformal group. Finally, we show that the massive and so called eventually masssive particles of the Poincaré group do not have conformal symmetry. The whole picture bears a close resemblance to what happens over the field of real numbers, but with some significant variations. △ Less

Submitted 30 January, 2010; originally announced February 2010.

MSC Class: 22E50; 22E70; 20C35; 81R05

Showing 51–100 of 100 results for author: Virtanen, T