Skip to main content

Showing 51–100 of 100 results for author: Virtanen, T

.
  1. arXiv:2007.02676  [pdf, other

    eess.AS cs.LG cs.SD

    Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

    Authors: Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  2. arXiv:2006.08386  [pdf, other

    cs.LG cs.IR eess.AS stat.ML

    COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More

    Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

  3. arXiv:2006.01919  [pdf, other

    eess.AS cs.SD

    A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Tuomas Virtanen

    Abstract: This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datase… ▽ More

    Submitted 27 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

  4. arXiv:2005.14623  [pdf, other

    eess.AS

    Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

    Authors: Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

    Abstract: This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis… ▽ More

    Submitted 2 November, 2020; v1 submitted 29 May, 2020; originally announced May 2020.

    Comments: published in DCASE 2020 Workshop

  5. arXiv:2002.05033  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Active Learning for Sound Event Detection

    Authors: Shuyang Zhao, Toni Heittola, Tuomas Virtanen

    Abstract: This paper proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the select… ▽ More

    Submitted 9 September, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

  6. arXiv:2002.00476  [pdf, other

    cs.SD cs.LG eess.AS

    Sound Event Detection with Depthwise Separable and Dilated Convolutions

    Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen

    Abstract: State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount o… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

  7. arXiv:1911.10888  [pdf

    eess.AS

    Sound event detection via dilated convolutional recurrent neural networks

    Authors: Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal cont… ▽ More

    Submitted 20 July, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: 5 pages, 3 tables and 3 figures

  8. arXiv:1911.07098  [pdf, other

    cs.SD eess.AS

    VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

    Authors: Shayan Gharib, Konstantinos Drossos, Eemi Fagerlund, Tuomas Virtanen

    Abstract: The performance of sound event detection methods can significantly degrade when they are used in unseen conditions (e.g. recording devices, ambient noise). Domain adaptation is a promising way to tackle this problem. In this paper, we present VOICe, the first dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three dif… ▽ More

    Submitted 25 November, 2019; v1 submitted 16 November, 2019; originally announced November 2019.

    Comments: Fixed the footnote at the abstract

  9. Online Spectrogram Inversion for Low-Latency Audio Source Separation

    Authors: Paul Magron, Tuomas Virtanen

    Abstract: Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we… ▽ More

    Submitted 24 February, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

  10. arXiv:1911.00527  [pdf, other

    eess.AS cs.LG cs.PF cs.SD

    Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters

    Authors: Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti

    Abstract: Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the prop… ▽ More

    Submitted 1 November, 2019; originally announced November 2019.

  11. arXiv:1910.09387  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Clotho: An Audio Captioning Dataset

    Authors: Konstantinos Drossos, Samuel Lip**, Tuomas Virtanen

    Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  12. arXiv:1907.09238  [pdf, other

    cs.SD eess.AS

    Crowdsourcing a Dataset of Audio Captions

    Authors: Samuel Lip**, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an aud… ▽ More

    Submitted 22 July, 2019; originally announced July 2019.

  13. arXiv:1907.08506  [pdf, other

    cs.SD cs.LG eess.AS

    Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling

    Authors: Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen

    Abstract: A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine tra… ▽ More

    Submitted 6 November, 2019; v1 submitted 19 July, 2019; originally announced July 2019.

    Comments: Fixed the display of URLs at footnote, updated the results

  14. arXiv:1905.08546  [pdf, other

    cs.SD eess.AS

    A multi-room reverberant dataset for sound event localization and detection

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and el… ▽ More

    Submitted 24 May, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

  15. arXiv:1905.01926  [pdf

    cs.LG cs.SD eess.AS stat.ML

    Zero-Shot Audio Classification Based on Class Label Embeddings

    Authors: Huang Xie, Tuomas Virtanen

    Abstract: This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a… ▽ More

    Submitted 7 August, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

    Comments: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

  16. arXiv:1905.00979  [pdf, other

    eess.AS cs.SD

    City classification from multiple real-world sound scenes

    Authors: Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

    Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a… ▽ More

    Submitted 29 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

    Comments: Accepted to WASPAA 2019

  17. arXiv:1905.00078  [pdf, other

    cs.SD eess.AS stat.ML

    Deep Learning for Audio Signal Processing

    Authors: Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-yiin Chang, Tara Sainath

    Abstract: Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fer… ▽ More

    Submitted 25 May, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

    Comments: 15 pages, 2 pdf figures

    ACM Class: I.2.6; H.5.1

    Journal ref: Journal of Selected Topics of Signal Processing 14, No. 8 (2019)

  18. arXiv:1904.12769  [pdf, other

    cs.SD cs.LG eess.AS

    Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared w… ▽ More

    Submitted 29 April, 2019; originally announced April 2019.

  19. arXiv:1904.10678  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification

    Authors: Konstantinos Drossos, Paul Magron, Tuomas Virtanen

    Abstract: A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a diff… ▽ More

    Submitted 6 November, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

    Comments: Updated indices at Eq 6

  20. arXiv:1902.07033  [pdf, other

    cs.SD eess.AS

    Low-Latency Deep Clustering For Speech Separation

    Authors: Shanshan Wang, Gaurav Naithani, Tuomas Virtanen

    Abstract: This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginnin… ▽ More

    Submitted 19 February, 2019; originally announced February 2019.

    Comments: To appear in ICASSP 2019

  21. arXiv:1808.05777  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised adversarial domain adaptation for acoustic scene classification

    Authors: Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitriy Serdyuk, Tuomas Virtanen

    Abstract: A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of… ▽ More

    Submitted 17 August, 2018; originally announced August 2018.

  22. arXiv:1808.02357  [pdf, other

    eess.AS cs.CV cs.LG cs.SD stat.ML

    Acoustic Scene Classification: A Competition Review

    Authors: Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen

    Abstract: In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and stud… ▽ More

    Submitted 2 August, 2018; originally announced August 2018.

    Comments: This work has been accepted in IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2018). Copyright may be transferred without notice, after which this version may no longer be accessible

  23. arXiv:1807.11298  [pdf, other

    cs.SD eess.AS

    Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery

    Authors: Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, Tuomas Virtanen

    Abstract: Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, w… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

  24. arXiv:1807.09840  [pdf, other

    eess.AS cs.SD

    A multi-device dataset for urban acoustic scene classification

    Authors: Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

    Abstract: This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set cl… ▽ More

    Submitted 11 October, 2018; v1 submitted 25 July, 2018; originally announced July 2018.

    Comments: accepted to DCASE 2018 Workshop

  25. arXiv:1807.06899  [pdf, other

    cs.SD eess.AS

    Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications

    Authors: Gaurav Naithani, Joonas Nikunen, Lars Bramsløw, Tuomas Virtanen

    Abstract: Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency ($\leq 10$ ms) is important. We use long short-term memor… ▽ More

    Submitted 18 July, 2018; originally announced July 2018.

    Comments: To appear at International Workshop on Acoustic Signal Enhancement (IWAENC) 2018

  26. Sound Event Localization and Detection of Overlap** Sources Using Convolutional Recurrent Neural Networks

    Authors: Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen

    Abstract: In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlap** sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-labe… ▽ More

    Submitted 17 December, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

    Comments: Published in Journal of Selected Topics in Signal Processing 2018

  27. arXiv:1805.03647  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

    Authors: Emre Çakır, Tuomas Virtanen

    Abstract: Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a map** between these features and the target sound events using a classifier. Recently, the focus of sound event detection research has been mostly shifted to the latter stage using standard features such as mel spectrogram as the input for classifiers such a… ▽ More

    Submitted 9 May, 2018; originally announced May 2018.

    Comments: accepted to IJCNN 2018

  28. arXiv:1802.05132  [pdf, ps, other

    eess.AS cs.SD

    Close Miking Empirical Practice Verification: A Source Separation Approach

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, Gerald Schuller

    Abstract: Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itsel… ▽ More

    Submitted 13 February, 2018; originally announced February 2018.

    Journal ref: In Proceedings of the 142nd Audio Engineering Society Convention, Berlin, Germany, 2017

  29. arXiv:1802.03156  [pdf, ps, other

    cs.SD eess.AS

    Complex ISNMF: a Phase-Aware Model for Monaural Audio Source Separation

    Authors: Paul Magron, Tuomas Virtanen

    Abstract: This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary… ▽ More

    Submitted 30 September, 2018; v1 submitted 9 February, 2018; originally announced February 2018.

  30. arXiv:1802.00300  [pdf, other

    cs.SD eess.AS

    MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upo… ▽ More

    Submitted 1 February, 2018; originally announced February 2018.

  31. arXiv:1801.09522  [pdf, other

    cs.SD cs.LG eess.AS

    Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter- and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datase… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

  32. arXiv:1711.01437  [pdf, other

    cs.SD eess.AS

    Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during trainin… ▽ More

    Submitted 13 February, 2018; v1 submitted 4 November, 2017; originally announced November 2017.

  33. arXiv:1710.10059  [pdf, other

    cs.SD cs.LG eess.AS

    Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all t… ▽ More

    Submitted 5 August, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

    Comments: EUSIPCO 2018

  34. arXiv:1710.10005  [pdf, other

    cs.SD eess.AS

    Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

    Authors: Joonas Nikunen, Aleksandr Diment, Tuomas Virtanen

    Abstract: In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by s… ▽ More

    Submitted 27 October, 2017; originally announced October 2017.

    Comments: Preprint of manuscript submitted to IEEE/ACM Transactions on Audio Speech and Language processing (R1)

  35. arXiv:1710.02998  [pdf, other

    cs.SD eess.AS

    Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network

    Authors: Sharath Adavanne, Tuomas Virtanen

    Abstract: This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed b… ▽ More

    Submitted 9 October, 2017; originally announced October 2017.

    Comments: Accepted in Detection and Classification of Acoustic Scenes and Events (DCASE 2017)

  36. arXiv:1710.02997  [pdf, other

    cs.SD eess.AS

    A report on sound event detection with different binaural features

    Authors: Sharath Adavanne, Tuomas Virtanen

    Abstract: In this paper, we compare the performance of using binaural audio features in place of single-channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single-channel and binaural features using stacked convolutional and recu… ▽ More

    Submitted 9 October, 2017; originally announced October 2017.

    Comments: Technical report for the top performing method in Task 3: Real life sound event detection challenge, at Detection and classification of acoustic scene and events (DCASE) 2017

  37. arXiv:1709.00611  [pdf, other

    cs.SD

    A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation

    Authors: Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, Gerald Schuller

    Abstract: The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude… ▽ More

    Submitted 24 April, 2018; v1 submitted 2 September, 2017; originally announced September 2017.

  38. arXiv:1706.10006  [pdf, other

    cs.SD cs.CL cs.LG

    Automated Audio Captioning with Recurrent Neural Networks

    Authors: Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen

    Abstract: We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with… ▽ More

    Submitted 24 October, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

    Comments: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017

  39. arXiv:1706.02293  [pdf, other

    cs.SD cs.LG

    Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

    Authors: Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, Tuomas Virtanen

    Abstract: In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlap** sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlap… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

  40. arXiv:1706.02292  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition

    Authors: Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina

    Abstract: This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for Sound and Music Computing (SMC 2017)

  41. arXiv:1706.02291  [pdf, other

    cs.SD cs.LG

    Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

    Authors: Sharath Adavanne, Pasi Pertilä, Tuomas Virtanen

    Abstract: This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017)

  42. arXiv:1706.02047  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

    Authors: Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen

    Abstract: This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and t… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for European Signal Processing Conference 2017

  43. arXiv:1703.02317  [pdf, other

    cs.SD cs.LG stat.ML

    Convolutional Recurrent Neural Networks for Bird Audio Detection

    Authors: EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invarian… ▽ More

    Submitted 7 March, 2017; originally announced March 2017.

    Comments: Submitted to EUSIPCO 2017 Special Session on Bird Audio Signal Processing

  44. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

    Authors: Emre Çakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, Tuomas Virtanen

    Abstract: Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs an… ▽ More

    Submitted 21 February, 2017; originally announced February 2017.

    Comments: Accepted for IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Sound Scene and Event Analysis

  45. arXiv:1610.08444  [pdf, other

    math.AG

    Temperedness of measures defined by polynomial equations over local fields

    Authors: David W. Taylor, V. S. Varadarajan, Jukka T. Virtanen, David E. Weisbart

    Abstract: We investigate the asymptotic growth of the canonical measures on the fibers of morphisms between vector spaces over local fields of arbitrary characteristic. For non-archimedean local fields we use a version of the Łojasiewicz inequality (\cite{lojasiewicz1959}, \cite{hormander1958division}) which follows from Greenberg \cite{greenberg1966rational}, \cite{bollaerts1990estimate}, together with the… ▽ More

    Submitted 20 November, 2016; v1 submitted 26 October, 2016; originally announced October 2016.

    Comments: Paper read in New Trends in Mathematics and Physics, Conference held in Moscow, Russia, on October 7 2016

  46. Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings

    Authors: Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen

    Abstract: In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large d… ▽ More

    Submitted 4 April, 2016; originally announced April 2016.

    Comments: To appean in Proceedings of IEEE ICASSP 2016

  47. arXiv:1101.3528  [pdf, other

    cond-mat.str-el cond-mat.supr-con

    Fermi liquid theory applied to vibrating wire measurements in 3He-4He mixtures

    Authors: Timo H. Virtanen, Erkki Thuneberg

    Abstract: We use Fermi liquid theory to study the mechanical impedance of 3He-4He mixtures at low temperatures. The theory is applied to the case of vibrating wires, immersed in the liquid. We present numerical results based on a direct solution of Landau-Boltzmann equation for the 3He quasiparticle distribution for the full scale of the quasiparticle mean-free-path. The two-fluid nature of mixtures is take… ▽ More

    Submitted 18 January, 2011; originally announced January 2011.

    Comments: 17 pages, 10 figures

    Journal ref: Phys. Rev. B 83, 224521 (2011)

  48. Fermi liquid theory of Fermi-Bose mixtures

    Authors: E. V. Thuneberg, T. H. Virtanen

    Abstract: We write down the basic equations of Fermi-liquid theory for mixtures of fermions and bosons, an example being 3He-4He mixtures at low temperatures. Basically the theory is identical to the one derived by Khalatnikov, but it is derived in a different way, and includes more discussion. A simplifying transformation of the equations is found where the coupling of the normal and superfluid components… ▽ More

    Submitted 19 October, 2010; originally announced October 2010.

    Comments: 9 pages, no figures

    Journal ref: Phys. Rev. B 83, 245137 (2011)

  49. Pendulum in Fermi liquid

    Authors: Timo H. Virtanen, Erkki Thuneberg

    Abstract: The Fermi liquid theory formulated by Landau is a basic paradigm of the behavior of an interacting many-body system. We present a new application of this theory to calculate "Landau force" on a macroscopic object. We show that immersing a pendulum in Fermi liquid can increase its oscillation frequency, and evidence of this has been observed in mixtures of 3He and 4He.

    Submitted 19 October, 2010; originally announced October 2010.

    Comments: 4 pages, 2 figures

    Journal ref: Phys. Rev. Lett. 106, 055301 (2011)

  50. arXiv:1002.0047  [pdf, ps, other

    math-ph

    Structure, classification, and conformal symmetry of elementary particles over non-archimedean space-time

    Authors: V. S. Varadarajan, Jukka T. Virtanen

    Abstract: It is well known that at distances shorter than Planck length, no length measurements are possible. The Volovich hypothesis asserts that at sub-Planckian distances and times, spacetime itself has a non-Archimedean geometry. We discuss the structure of elementary particles, their classification, and their conformal symmetry under this hypothesis. Specifically, we investigate the projective repres… ▽ More

    Submitted 30 January, 2010; originally announced February 2010.

    MSC Class: 22E50; 22E70; 20C35; 81R05