Skip to main content

Showing 1–50 of 82 results for author: Virtanen, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.03228  [pdf, other

    eess.AS

    Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement

    Authors: Wang Dai, Xiaofei Li, Archontis Politis, Tuomas Virtanen

    Abstract: In end-to-end multi-channel speech enhancement, the traditional approach of designating one microphone signal as the reference for processing may not always yield optimal results. The limitation is particularly in scenarios with large distributed microphone arrays with varying speaker-to-microphone distances or compact, highly directional microphone arrays where speaker or microphone positions cha… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by EUSIPCO 2024

  2. Speaker Distance Estimation in Enclosures from Single-Channel Audio

    Authors: Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen

    Abstract: Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  3. arXiv:2403.08525  [pdf, other

    cs.SD cs.LG eess.AS

    From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning

    Authors: John Martinsson, Olof Mogren, Maria Sandsten, Tuomas Virtanen

    Abstract: In this work we propose an audio recording segmentation method based on an adaptive change point detection (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activation's of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guid… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Under review at EUSIPCO 2024

  4. arXiv:2401.05916  [pdf, other

    eess.AS cs.SD

    Neural Ambisonics encoding for compact irregular microphone arrays

    Authors: Mikko Heikkinen, Archontis Politis, Tuomas Virtanen

    Abstract: Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays. This paper proposes a method for Ambisonics encoding that uses a deep neural network (DNN) to estimate a signal transform from microphone inputs to Ambisonics signals. The approach uses… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted for publication in Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing

  5. arXiv:2312.10756  [pdf, other

    eess.AS cs.LG eess.SP

    Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios

    Authors: Yuzhu Wang, Archontis Politis, Tuomas Virtanen

    Abstract: Current multichannel speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios. This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings. Specifically, we study the application of linear and nonlinear attention-based methods for estimating time-varying spati… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

  6. arXiv:2310.16550  [pdf, other

    cs.SD eess.AS

    Dynamic Processing Neural Network Architecture For Hearing Loss Compensation

    Authors: Szymon Drgas, Lars Bramsløw, Archontis Politis, Gaurav Naithani, Tuomas Virtanen

    Abstract: This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar t… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  7. arXiv:2308.04960  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

    Authors: Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation o… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  8. arXiv:2306.09820  [pdf, other

    eess.AS cs.SD

    Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

    Authors: Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen

    Abstract: This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate their judgements of how much the sound content of an audio clip matches the text, where 0 indicates no content match at all and 100 indicates perfect co… ▽ More

    Submitted 15 August, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted at DCASE 2023 Workshop

  9. arXiv:2306.09126  [pdf, other

    cs.SD cs.CV cs.MM eess.AS eess.IV

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

    Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More

    Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

  10. arXiv:2306.08510  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications

    Authors: David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen

    Abstract: Many multi-source localization and tracking models based on neural networks use one or several recurrent layers at their final stages to track the movement of the sources. Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state. However, this approach re… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted for publication at Forum Acusticum 2023

  11. Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

    Authors: Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen

    Abstract: Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted by EUSIPCO 2023

  12. arXiv:2305.19769  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Attention-Based Methods For Audio Question Answering

    Authors: Parthasaarathy Sudarsanam, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are releva… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  13. arXiv:2305.18045  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes

    Authors: Wei Xie, Yanxiong Li, Qianhua He, Wenchang Cao, Tuomas Virtanen

    Abstract: New classes of sounds constantly emerge with a few samples, making it challenging for models to adapt to dynamic acoustic environments. This challenge motivates us to address the new problem of few-shot class-incremental audio classification. This study aims to enable a model to continuously recognize new classes of sounds with a few training samples of new classes while remembering the learned on… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 5 pages,2 figures, Accepted by Interspeech 2023

  14. arXiv:2305.00011  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Adversarial Representation Learning for Robust Privacy Preservation in Audio

    Authors: Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial train… ▽ More

    Submitted 3 January, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

    Comments: Published in IEEE Open Journal of Signal Processing

  15. arXiv:2303.07816  [pdf, other

    eess.AS cs.SD

    Multi-Channel Masking with Learnable Filterbank for Sound Source Separation

    Authors: Wang Dai, Archontis Politis, Tuomas Virtanen

    Abstract: This work proposes a learnable filterbank based on a multi-channel masking framework for multi-channel source separation. The learnable filterbank is a 1D Conv layer, which transforms the raw waveform into a 2D representation. In contrast to the conventional single-channel masking method, we estimate a mask for each individual microphone channel. The estimated masks are then applied to the transfo… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  16. arXiv:2303.01864  [pdf, ps, other

    cs.SD eess.AS

    Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints

    Authors: Paul Magron, Tuomas Virtanen

    Abstract: Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and… ▽ More

    Submitted 30 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

  17. arXiv:2211.04070  [pdf, other

    eess.AS cs.SD

    On Negative Sampling for Contrastive Audio-Text Retrieval

    Authors: Huang Xie, Okko Räsänen, Tuomas Virtanen

    Abstract: This paper investigates negative sampling for contrastive learning in the context of audio-text retrieval. The strategy for negative sampling refers to selecting negatives (either audio clips or textual descriptions) from a pool of candidates for a positive audio-text pair. We explore sampling strategies via model-estimated within-modality and cross-modality relevance scores for audio and text sam… ▽ More

    Submitted 17 February, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP2023

  18. arXiv:2210.14536  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP

    Position tracking of a varying number of sound sources with sliding permutation invariant training

    Authors: David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

    Abstract: Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward impleme… ▽ More

    Submitted 5 June, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: Accepted for publication at the 31st European Signal Processing Conference (EUSIPCO 2023)

  19. arXiv:2209.09967   

    eess.AS cs.SD

    Language-based Audio Retrieval Task in DCASE 2022 Challenge

    Authors: Huang Xie, Samuel Lip**, Tuomas Virtanen

    Abstract: Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which aims at develo** computational systems to model relationships between audio signals and free-form textual descriptions. Compared with audio captioning (Subtask 6A), whi… ▽ More

    Submitted 4 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Update for arXiv:2206.06108 mistakenly submitted as a new article

  20. arXiv:2208.05057  [pdf, other

    cs.SD cs.MM eess.AS

    Subjective Evaluation of Deep Neural Network Based Speech Enhancement Systems in Real-World Conditions

    Authors: Gaurav Naithani, Kirsi Pietilä, Riitta Niemistö, Erkki Paajanen, Tero Takala, Tuomas Virtanen

    Abstract: Subjective evaluation results for two low-latency deep neural networks (DNN) are compared to a matured version of a traditional Wiener-filter based noise suppressor. The target use-case is real-world single-channel speech enhancement applications, e.g., communications. Real-world recordings consisting of additive stationary and non-stationary noise types are included. The evaluation is divided int… ▽ More

    Submitted 14 August, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Accepted for publication in IEEE MMSP 2022

  21. arXiv:2208.02406  [pdf

    eess.AS cs.SD

    Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network

    Authors: Yanxiong Li, Wenchang Cao, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: 6 pages, 5 figures, 4 tables. Accepted by IEEE MMSP 2022

  22. arXiv:2206.06108  [pdf, other

    eess.AS

    Language-based Audio Retrieval Task in DCASE 2022 Challenge

    Authors: Huang Xie, Samuel Lip**, Tuomas Virtanen

    Abstract: Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which aims at develo** computational systems to model relationships between audio signals and free-form textual descriptions. Compared with audio captioning (Subtask 6A), whi… ▽ More

    Submitted 30 September, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: Accepted at DCASE 2022 Workshop

  23. arXiv:2206.04984  [pdf, other

    cs.SD cs.LG eess.AS

    Zero-Shot Audio Classification using Image Embeddings

    Authors: Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

    Abstract: Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image emb… ▽ More

    Submitted 10 June, 2022; originally announced June 2022.

    Comments: Accepted to the European Signal Processing Conference (EUSIPCO) 2022

  24. arXiv:2206.03835  [pdf, other

    eess.AS

    Low-complexity acoustic scene classification in DCASE 2022 Challenge

    Authors: Irene Martín-Morató, Francesco Paissan, Alberto Ancilotto, Toni Heittola, Annamaria Mesaros, Elisabetta Farella, Alessio Brutti, Tuomas Virtanen

    Abstract: This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The task was a continuation from the previous years, but the low-complexity requirements were changed to the following: the maximum number of allowed parameters, including the zero-valued ones, was 128 K, with parameters being represented using INT8 numerical format; and the maximum nu… ▽ More

    Submitted 13 July, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

  25. arXiv:2206.01948  [pdf, other

    eess.AS cs.SD

    STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

    Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

    Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More

    Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

  26. Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

    Authors: Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen

    Abstract: Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  27. arXiv:2204.09634  [pdf, other

    cs.SD cs.LG eess.AS

    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

    Authors: Samuel Lip**, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we coll… ▽ More

    Submitted 17 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  28. arXiv:2111.00030  [pdf, other

    eess.AS cs.SD

    Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressor… ▽ More

    Submitted 29 October, 2021; originally announced November 2021.

    Comments: Submitted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2021)

  29. arXiv:2110.02939  [pdf, other

    eess.AS eess.SP

    Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

    Authors: Huang Xie, Okko Räsänen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss cri… ▽ More

    Submitted 21 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted at ICASSP 2022

  30. Sound Event Detection: A Tutorial

    Authors: Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley

    Abstract: The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal. This paper gives a tutorial presentation of sound event detection, including its definition, signal processing and machine learning approaches, ev… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: to appear in IEEE Signal Processing Magazine, Volume 38, Issue 5

  31. arXiv:2106.14787  [pdf, other

    eess.AS

    Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments

    Authors: Pasi Pertilä, Emre Cakir, Aapo Hakala, Eemi Fagerlund, Tuomas Virtanen, Archontis Politis, Antti Eronen

    Abstract: Joint sound event localization and detection (SELD) is an integral part of develo** context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated us… ▽ More

    Submitted 28 June, 2021; originally announced June 2021.

    Comments: to be published in the proceedings of the 29th European Signal Processing Conference, EUSIPCO 2021

  32. arXiv:2106.11794  [pdf, other

    eess.AS cs.SD

    Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair

    Authors: Shanshan Wang, Gaurav Naithani, Archontis Politis, Tuomas Virtanen

    Abstract: Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the usage of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time spee… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to EUSIPCO-2021

  33. arXiv:2106.06999  [pdf, other

    eess.AS cs.SD

    A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen

    Abstract: This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical sy… ▽ More

    Submitted 4 July, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

  34. arXiv:2105.13734  [pdf, other

    eess.AS

    Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems

    Authors: Irene Martín-Morató, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

    Abstract: This paper presents the details of Task 1A Acoustic Scene Classification in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for h… ▽ More

    Submitted 20 July, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

  35. arXiv:2105.13675  [pdf, other

    eess.AS cs.SD

    Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

    Authors: Shanshan Wang, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

    Abstract: This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have bette… ▽ More

    Submitted 20 July, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

  36. arXiv:2011.12657  [pdf, other

    eess.AS

    Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections

    Authors: Huang Xie, Okko Räsänen, Tuomas Virtanen

    Abstract: In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper… ▽ More

    Submitted 2 February, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

    Comments: Accepted by ICASSP 2021

  37. arXiv:2011.12133  [pdf, other

    eess.AS

    Zero-Shot Audio Classification via Semantic Embeddings

    Authors: Huang Xie, Tuomas Virtanen

    Abstract: In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an ac… ▽ More

    Submitted 11 February, 2021; v1 submitted 24 November, 2020; originally announced November 2020.

    Comments: Submitted to Transactions on Audio, Speech and Language Processing

  38. arXiv:2011.00030  [pdf, other

    eess.AS

    A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

    Authors: Shanshan Wang, Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

    Abstract: This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual… ▽ More

    Submitted 11 February, 2021; v1 submitted 30 October, 2020; originally announced November 2020.

    Comments: accepted by ICASSP 2021

  39. arXiv:2010.14171  [pdf, other

    cs.SD cs.IR cs.LG eess.AS stat.ML

    Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learni… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure

  40. arXiv:2010.11716  [pdf, other

    cs.SD cs.LG eess.AS

    Robust Audio-Based Vehicle Counting in Low-to-Moderate Traffic Flow

    Authors: Slobodan Djukanović, Jiři Matas, Tuomas Virtanen

    Abstract: The paper presents a method for audio-based vehicle counting (VC) in low-to-moderate traffic using one-channel sound. We formulate VC as a regression problem, i.e., we predict the distance between a vehicle and the microphone. Minima of the proposed distance function correspond to vehicles passing by the microphone. VC is carried out via local minima detection in the predicted distance. We propose… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: The paper has been accepted for the IV2020 conference

  41. arXiv:2010.11659  [pdf, other

    cs.SD cs.LG eess.AS

    Neural Network-based Acoustic Vehicle Counting

    Authors: Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen

    Abstract: This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of clipped vehicle-to-microphone distance. This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously… ▽ More

    Submitted 27 March, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  42. arXiv:2010.11098  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

    Authors: An Tran, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Submitted for review at ICASSP2021

  43. Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

    Authors: Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen

    Abstract: Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic datase… ▽ More

    Submitted 11 January, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

  44. arXiv:2007.05183  [pdf, other

    cs.SD cs.LG eess.AS

    Conditioned Time-Dilated Convolutions for Sound Event Detection

    Authors: Konstantinos Drossos, Stylianos I. Mimilakis, Tuomas Virtanen

    Abstract: Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expa… ▽ More

    Submitted 10 July, 2020; originally announced July 2020.

  45. arXiv:2007.04660  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Multi-task Regularization Based on Infrequent Classes for Audio Captioning

    Authors: Emre Çakır, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the c… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  46. arXiv:2007.02683  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

    Authors: Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  47. arXiv:2007.02676  [pdf, other

    eess.AS cs.LG cs.SD

    Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

    Authors: Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  48. arXiv:2006.08386  [pdf, other

    cs.LG cs.IR eess.AS stat.ML

    COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More

    Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

  49. arXiv:2006.01919  [pdf, other

    eess.AS cs.SD

    A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Tuomas Virtanen

    Abstract: This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datase… ▽ More

    Submitted 27 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

  50. arXiv:2005.14623  [pdf, other

    eess.AS

    Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

    Authors: Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

    Abstract: This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis… ▽ More

    Submitted 2 November, 2020; v1 submitted 29 May, 2020; originally announced May 2020.

    Comments: published in DCASE 2020 Workshop