Search | arXiv e-print repository

arXiv:2406.01253 [pdf, other]

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Authors: Julian C. Schäfer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, Ariana Strandburg-Peshkin

Abstract: Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challeng… ▽ More Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challenging, focusing on models conceived for computer vision, where the audio waveforms are engineered into spectrographic representations for training and inference. We improve the current state of deep learning in bioacoustics in two ways: First, we present the animal2vec framework: a fully interpretable transformer model and self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. Second, we openly publish MeerKAT: Meerkat Kalahari Audio Transcripts, a large-scale dataset containing audio collected via biologgers deployed on free-ranging meerkats with a length of over 1068h, of which 184h have twelve time-resolved vocalization-type classes, each with ms-resolution, making it the largest publicly-available labeled dataset on terrestrial mammals. Further, we benchmark animal2vec against the NIPS4Bplus birdsong dataset. We report new state-of-the-art results on both datasets and evaluate the few-shot capabilities of animal2vec of labeled training data. Finally, we perform ablation studies to highlight the differences between our architecture and a vanilla transformer baseline for human-produced sounds. animal2vec allows researchers to classify massive amounts of sparse bioacoustic data even with little ground truth information available. In addition, the MeerKAT dataset is the first large-scale, millisecond-resolution corpus for benchmarking bioacoustic models in the pretrain/finetune paradigm. We believe this sets the stage for a new reference point for bioacoustics. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Code available at: https://github.com/livingingroups/animal2vec | Dataset available at: https://doi.org/10.17617/3.0J0DYB

arXiv:2404.03474 [pdf, other]

Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Authors: Rita Pucci, Vincent J. Kalkman, Dan Stowell

Abstract: With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distrib… ▽ More With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time-consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT), and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost, and gradient activity. We offer insights that we haven't yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that the ViT performs the best on inference speed and computational cost while the LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2312.09269 [pdf, other]

Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation

Authors: Drew Priebe, Burooj Ghani, Dan Stowell

Abstract: The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides… ▽ More The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides in deep learning in recent years, the deployment of large neural networks on compact devices poses challenges due to memory and latency constraints. Our approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics. In particular, we employed the MobileNetV3-Small-Pi model to create compact yet effective student architectures to compare against the larger EcoVAD teacher model, a well-regarded voice detection architecture in eco-acoustic monitoring. The comparative analysis included examining various configurations of the MobileNetV3-Small-Pi derived student models to identify optimal performance. Additionally, a thorough evaluation of different distillation techniques was conducted to ascertain the most effective method for model selection. Our findings revealed that the distilled models exhibited comparable performance to the EcoVAD teacher model, indicating a promising approach to overcoming computational barriers for real-time ecological monitoring. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2311.04945 [pdf, other]

Auto deep learning for bioacoustic signals

Authors: Giulio Tosato, Abdelrahman Shehata, Joshua Janssen, Kees Kamp, Pramatya Jati, Dan Stowell

Abstract: This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search a… ▽ More This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search and hyperparameter tuning. Comparative analysis validates our hypothesis that the AutoKeras-derived model consistently outperforms traditional models like MobileNet, ResNet50 and VGG16. Our approach and findings underscore the transformative potential of automated deep learning for advancing bioacoustics research and models. In fact, the automated techniques eliminate the need for manual feature engineering and model design while improving performance. This study illuminates best practices in sampling, evaluation and reporting to enhance reproducibility in this nascent field. All the code used is available at https: //github.com/giuliotosato/AutoKeras-bioacustic Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms. △ Less

Submitted 26 December, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

arXiv:2311.01526 [pdf, other]

ATGNN: Audio Tagging Graph Neural Network

Authors: Shubhr Singh, Christian J. Steinmetz, Emmanouil Benetos, Huy Phan, Dan Stowell

Abstract: Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough… ▽ More Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2307.11112 [pdf, other]

Comparison between transformers and convolutional models for fine-grained classification of insects

Authors: Rita Pucci, Vincent J. Kalkman, Dan Stowell

Abstract: Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to differentiate. We consider the taxonomical class of Insecta. The identification of insects is essent… ▽ More Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to differentiate. We consider the taxonomical class of Insecta. The identification of insects is essential in biodiversity monitoring as they are one of the inhabitants at the base of many ecosystems. Citizen science is doing brilliant work of collecting images of insects in the wild giving the possibility to experts to create improved distribution maps in all countries. We have billions of images that need to be automatically classified and deep neural network algorithms are one of the main techniques explored for fine-grained tasks. At the SOTA, the field of deep learning algorithms is extremely fruitful, so how to identify the algorithm to use? We focus on Odonata and Coleoptera orders, and we propose an initial comparative study to analyse the two best-known layer structures for computer vision: transformer and convolutional layers. We compare the performance of T2TViT, a fully transformer-base, EfficientNet, a fully convolutional-base, and ViTAE, a hybrid. We analyse the performance of the three models in identical conditions evaluating the performance per species, per morph together with sex, the inference time, and the overall performance with unbalanced datasets of images from smartphones. Although we observe high performances with all three families of models, our analysis shows that the hybrid model outperforms the fully convolutional-base and fully transformer-base models on accuracy performance and the fully transformer-base model outperforms the others on inference speed and, these prove the transformer to be robust to the shortage of samples and to be faster at inference time. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.09223 [pdf, other]

Few-shot bioacoustic event detection at the DCASE 2023 challenge

Authors: Ines Nolasco, Burooj Ghani, Shubhr Singh, Ester Vidaña-Vila, Helen Whitehead, Emily Grout, Michael Emmerson, Frants Jensen, Ivan Kiskin, Joe Morford, Ariana Strandburg-Peshkin, Lisa Gill, Hanna Pamuła, Vincent Lostanlen, Dan Stowell

Abstract: Few-shot bioacoustic event detection consists in detecting sound events of specified types, in varying soundscapes, while having access to only a few examples of the class of interest. This task ran as part of the DCASE challenge for the third time this year with an evaluation set expanded to include new animal species, and a new rule: ensemble models were no longer allowed. The 2023 few shot task… ▽ More Few-shot bioacoustic event detection consists in detecting sound events of specified types, in varying soundscapes, while having access to only a few examples of the class of interest. This task ran as part of the DCASE challenge for the third time this year with an evaluation set expanded to include new animal species, and a new rule: ensemble models were no longer allowed. The 2023 few shot task received submissions from 6 different teams with F-scores reaching as high as 63% on the evaluation set. Here we describe the task, focusing on describing the elements that differed from previous years. We also take a look back at past editions to describe how the task has evolved. Not only have the F-score results steadily improved (40% to 60% to 63%), but the type of systems proposed have also become more complex. Sound event detection systems are no longer simple variations of the baselines provided: multiple few-shot learning methodologies are still strong contenders for the task. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: submitted to DCASE 2023 workshop

arXiv:2305.13210 [pdf, other]

doi 10.1016/j.ecoinf.2023.102258

Learning to detect an animal sound from five examples

Authors: Inês Nolasco, Shubhr Singh, Veronica Morfi, Vincent Lostanlen, Ariana Strandburg-Peshkin, Ester Vidaña-Vila, Lisa Gill, Hanna Pamuła, Helen Whitehead, Ivan Kiskin, Frants H. Jensen, Joe Morford, Michael G. Emmerson, Elisabetta Versace, Emily Grout, Haohe Liu, Dan Stowell

Abstract: Automatic detection and classification of animal sounds has many applications in biodiversity monitoring and animal behaviour. In the past twenty years, the volume of digitised wildlife sound available has massively increased, and automatic classification through deep learning now shows strong results. However, bioacoustics is not a single task but a vast range of small-scale tasks (such as indivi… ▽ More Automatic detection and classification of animal sounds has many applications in biodiversity monitoring and animal behaviour. In the past twenty years, the volume of digitised wildlife sound available has massively increased, and automatic classification through deep learning now shows strong results. However, bioacoustics is not a single task but a vast range of small-scale tasks (such as individual ID, call type, emotional indication) with wide variety in data characteristics, and most bioacoustic tasks do not come with strongly-labelled training data. The standard paradigm of supervised learning, focussed on a single large-scale dataset and/or a generic pre-trained algorithm, is insufficient. In this work we recast bioacoustic sound event detection within the AI framework of few-shot learning. We adapt this framework to sound event detection, such that a system can be given the annotated start/end times of as few as 5 events, and can then detect events in long-duration audio -- even when the sound category was not known at the time of algorithm training. We introduce a collection of open datasets designed to strongly test a system's ability to perform few-shot sound event detections, and we present the results of a public contest to address the task. We show that prototypical networks are a strong-performing method, when enhanced with adaptations for general characteristics of animal sounds. We demonstrate that widely-varying sound event durations are an important factor in performance, as well as non-stationarity, i.e. gradual changes in conditions throughout the duration of a recording. For fine-grained bioacoustic recognition tasks without massive annotated training data, our results demonstrate that few-shot sound event detection is a powerful new method, strongly outperforming traditional signal-processing detection methods in the fully automated scenario. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2304.12739 [pdf]

doi 10.1371/journal.pcbi.1011541

Adaptive Representations of Sound for Automatic Insect Recognition

Authors: Marius Faiß, Dan Stowell

Abstract: Insect population numbers and biodiversity have been rapidly declining with time, and monitoring these trends has become increasingly important for conservation measures to be effectively implemented. But monitoring methods are often invasive, time and resource intense, and prone to various biases. Many insect species produce characteristic sounds that can easily be detected and recorded without l… ▽ More Insect population numbers and biodiversity have been rapidly declining with time, and monitoring these trends has become increasingly important for conservation measures to be effectively implemented. But monitoring methods are often invasive, time and resource intense, and prone to various biases. Many insect species produce characteristic sounds that can easily be detected and recorded without large cost or effort. Using deep learning methods, insect sounds from field recordings could be automatically detected and classified to monitor biodiversity and species distribution ranges. We implement this using recently published datasets of insect sounds (Orthoptera and Cicadidae) and machine learning methods and evaluate their potential for acoustic insect monitoring. We compare the performance of the conventional spectrogram-based audio representation against LEAF, a new adaptive and waveform-based frontend. LEAF achieved better classification performance than the mel-spectrogram frontend by adapting its feature extraction parameters during training. This result is encouraging for future implementations of deep learning technology for automatic insect sound recognition, especially as larger datasets become available. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: 35 pages, 11 figures. arXiv admin note: substantial text overlap with arXiv:2211.09503

arXiv:2210.07685 [pdf]

Full-Stack Bioacoustics: Field Kit to AI to Action (Workshop report)

Authors: Dan Stowell, Caitlin Black, Florencia Noriega, Sarab S. Sethi

Abstract: Acoustic data (sound recordings) are a vital source of evidence for detecting, counting, and distinguishing wildlife. This domain of "bioacoustics" has grown in the past decade due to the massive advances in signal processing and machine learning, recording devices, and the capacity of data processing and storage. Numerous research papers describe the use of Raspberry Pi or similar devices for aco… ▽ More Acoustic data (sound recordings) are a vital source of evidence for detecting, counting, and distinguishing wildlife. This domain of "bioacoustics" has grown in the past decade due to the massive advances in signal processing and machine learning, recording devices, and the capacity of data processing and storage. Numerous research papers describe the use of Raspberry Pi or similar devices for acoustic monitoring, and other research papers describe automatic classification of animal sounds by machine learning. But for most ecologists, zoologists, conservationists, the pieces of the puzzle do not come together: the domain is fragmented. In this Lorentz workshop we bridge this gap by bringing together leading exponents of open hardware and open-source software for bioacoustic monitoring and machine learning, as well as ecologists and other field researchers. We share skills while also building a vision for the future development of "bioacoustic AI". This report contains an overview of the workshop aims and structure, as well as reports from the six groups. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Workshop report: Lorentz Center, Leiden, the Netherlands, 1-5 August 2022

arXiv:2207.07911 [pdf, other]

Few-shot bioacoustic event detection at the DCASE 2022 challenge

Authors: I. Nolasco, S. Singh, E. Vidana-Villa, E. Grout, J. Morford, M. Emmerson, F. Jensens, H. Whitehead, I. Kiskin, A. Strandburg-Peshkin, L. Gill, H. Pamula, V. Lostanlen, V. Morfi, D. Stowell

Abstract: Few-shot sound event detection is the task of detecting sound events, despite having only a few labelled examples of the class of interest. This framework is particularly useful in bioacoustics, where often there is a need to annotate very long recordings but the expert annotator time is limited. This paper presents an overview of the second edition of the few-shot bioacoustic sound event detectio… ▽ More Few-shot sound event detection is the task of detecting sound events, despite having only a few labelled examples of the class of interest. This framework is particularly useful in bioacoustics, where often there is a need to annotate very long recordings but the expert annotator time is limited. This paper presents an overview of the second edition of the few-shot bioacoustic sound event detection task included in the DCASE 2022 challenge. A detailed description of the task objectives, dataset, and baselines is presented, together with the main results obtained and characteristics of the submitted systems. This task received submissions from 15 different teams from which 13 scored higher than the baselines. The highest F-score was of 60% on the evaluation set, which leads to a huge improvement over last year's edition. Highly-performing methods made use of prototypical networks, transductive learning, and addressed the variable length of events from all target classes. Furthermore, by analysing results on each of the subsets we can identify the main difficulties that the systems face, and conclude that few-show bioacoustic sound event detection remains an open challenge. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: submitted to DCASE2022 workshop

arXiv:2207.06349 [pdf]

Polyphonic sound event detection for highly dense birdsong scenes

Authors: Alberto García Arroba Parrilla, Dan Stowell

Abstract: One hour before sunrise, one can experience the dawn chorus where birds from different species sing together. In this scenario, high levels of polyphony, as in the number of overlap** sound sources, are prone to happen resulting in a complex acoustic outcome. Sound Event Detection (SED) tasks analyze acoustic scenarios in order to identify the occurring events and their respective temporal infor… ▽ More One hour before sunrise, one can experience the dawn chorus where birds from different species sing together. In this scenario, high levels of polyphony, as in the number of overlap** sound sources, are prone to happen resulting in a complex acoustic outcome. Sound Event Detection (SED) tasks analyze acoustic scenarios in order to identify the occurring events and their respective temporal information. However, highly dense scenarios can be hard to process and have not been studied in depth. Here we show, using a Convolutional Recurrent Neural Network (CRNN), how birdsong polyphonic scenarios can be detected when dealing with higher polyphony and how effectively this type of model can face a very dense scene with up to 10 overlap** birds. We found that models trained with denser examples (i.e., higher polyphony) learn at a similar rate as models that used simpler samples in their training set. Additionally, the model trained with the densest samples maintained a consistent score for all polyphonies, while the model trained with the least dense samples degraded as the polyphony increased. Our results demonstrate that highly dense acoustic scenarios can be dealt with using CRNNs. We expect that this study serves as a starting point for working on highly populated bird scenarios such as dawn chorus or other dense acoustic problems. △ Less

Submitted 13 July, 2022; originally announced July 2022.

arXiv:2112.06725 [pdf, other]

doi 10.7717/peerj.13152

Computational bioacoustics with deep learning: a review and roadmap

Authors: Dan Stowell

Abstract: Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, a… ▽ More Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, and to huge progress in informatics such as big data, signal processing and machine learning. Methods are inherited from the wider field of deep learning, including speech and image processing. However, the tasks, demands and data characteristics are often different from those addressed in speech or music analysis. There remain unsolved problems, and tasks for which evidence is surely present in many acoustic signals, but not yet realised. In this paper I perform a review of the state of the art in deep learning for computational bioacoustics, aiming to clarify key concepts and identify and analyse knowledge gaps. Based on this, I offer a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should aim to address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions. △ Less

Submitted 13 December, 2021; originally announced December 2021.

arXiv:2110.05941 [pdf, ps, other]

doi 10.1109/ICASSP43922.2022.9746907

Rank-based loss for learning hierarchical representations

Authors: Ines Nolasco, Dan Stowell

Abstract: Hierarchical taxonomies are common in many contexts, and they are a very natural structure humans use to organise information. In machine learning, the family of methods that use the 'extra' information is called hierarchical classification. However, applied to audio classification, this remains relatively unexplored. Here we focus on how to integrate the hierarchical information of a problem to l… ▽ More Hierarchical taxonomies are common in many contexts, and they are a very natural structure humans use to organise information. In machine learning, the family of methods that use the 'extra' information is called hierarchical classification. However, applied to audio classification, this remains relatively unexplored. Here we focus on how to integrate the hierarchical information of a problem to learn embeddings representative of the hierarchical relationships. Previously, triplet loss has been proposed to address this problem, however it presents some issues like requiring the careful construction of the triplets, and being limited in the extent of hierarchical information it uses at each iteration. In this work we propose a rank based loss function that uses hierarchical information and translates this into a rank ordering of target distances between the examples. We show that rank based loss is suitable to learn hierarchical representations of the data. By testing on unseen fine level classes we show that this method is also capable of learning hierarchically correct representations of the new classes. Rank based loss has two promising aspects, it is generalisable to hierarchies with any number of levels, and is capable of dealing with data with incomplete hierarchical labels. △ Less

Submitted 11 February, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: This version corrects a bug in the baseline results

arXiv:2012.03216 [pdf, other]

doi 10.17743/jaes.2021.0019

Guitar Effects Recognition and Parameter Estimation with Convolutional Neural Networks

Authors: Marco Comunità, Dan Stowell, Joshua D. Reiss

Abstract: Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, convolutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assemb… ▽ More Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, convolutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assembled, with four sub-datasets consisting of monophonic or polyphonic samples and discrete or continuous settings values, for a total of about 250 hours of processed samples. Results were compared for networks trained and tested on the same or on a different sub-dataset. We found that discrete datasets could lead to equally high performance as continuous ones, whilst being easier to design, analyse and modify. Classification accuracy was above 80\%, with confusion matrices reflecting similarities in the effects timbre and circuits design. With parameter values between 0.0 and 1.0, the mean absolute error is in most cases below 0.05, while the root mean square error is below 0.1 in all cases but one. △ Less

Submitted 6 December, 2020; originally announced December 2020.

Journal ref: JAES Volume 69 Issue 7/8 pp. 594-604; July 2021

arXiv:2010.02275 [pdf, other]

Short-term prediction of photovoltaic power generation using Gaussian process regression

Authors: Yahya Al Lawati, Jack Kelly, Dan Stowell

Abstract: Photovoltaic (PV) power is affected by weather conditions, making the power generated from the PV systems uncertain. Solving this problem would help improve the reliability and cost effectiveness of the grid, and could help reduce reliance on fossil fuel plants. The present paper focuses on evaluating predictions of the energy generated by PV systems in the United Kingdom Gaussian process regressi… ▽ More Photovoltaic (PV) power is affected by weather conditions, making the power generated from the PV systems uncertain. Solving this problem would help improve the reliability and cost effectiveness of the grid, and could help reduce reliance on fossil fuel plants. The present paper focuses on evaluating predictions of the energy generated by PV systems in the United Kingdom Gaussian process regression (GPR). Gaussian process regression is a Bayesian non-parametric model that can provide predictions along with the uncertainty in the predicted value, which can be very useful in applications with a high degree of uncertainty. The model is evaluated for short-term forecasts of 48 hours against three main factors -- training period, sky area coverage and kernel model selection -- and for very short-term forecasts of four hours against sky area. We also compare very short-term forecasts in terms of cloud coverage within the prediction period and only initial cloud coverage as a predictor. △ Less

Submitted 5 October, 2020; originally announced October 2020.

arXiv:1908.04672 [pdf, other]

Estimating & Mitigating the Impact of Acoustic Environments on Machine-to-Machine Signalling

Authors: Amogh Matt, Dan Stowell

Abstract: The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine signals while decoding. Various methods have been studied to combat the effects of reverberation in speech and audio signals, but it is not c… ▽ More The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine signals while decoding. Various methods have been studied to combat the effects of reverberation in speech and audio signals, but it is not clear how well they generalise to other sound types. We look at extending these models to facilitate machine-to-machine acoustic signalling. This research investigates dereverberation techniques to shortlist a single-channel reverberation suppression method through a pilot test. In order to apply the chosen dereverberation method a novel method of estimating acoustic parameters governing reverberation is proposed. The performance of the final algorithm is evaluated on quality metrics as well as the performance of a real machine-to-machine decoder. We demonstrate a dramatic reduction in error rate for both audible and ultrasonic signals. △ Less

Submitted 13 August, 2019; originally announced August 2019.

arXiv:1905.03204 [pdf, other]

doi 10.1103/PhysRevResearch.2.023069

Efficient On-line Computation of Visibility Graphs

Authors: Delia Fano Yela, Florian Thalmann, Vincenzo Nicosia, Dan Stowell, Mark Sandler

Abstract: A visibility algorithm maps time series into complex networks following a simple criterion. The resulting visibility graph has recently proven to be a powerful tool for time series analysis. However its straightforward computation is time-consuming and rigid, motivating the development of more efficient algorithms. Here we present a highly efficient method to compute visibility graphs with the fur… ▽ More A visibility algorithm maps time series into complex networks following a simple criterion. The resulting visibility graph has recently proven to be a powerful tool for time series analysis. However its straightforward computation is time-consuming and rigid, motivating the development of more efficient algorithms. Here we present a highly efficient method to compute visibility graphs with the further benefit of flexibility: on-line computation. We propose an encoder/decoder approach, with an on-line adjustable binary search tree codec for time series as well as its corresponding decoder for visibility graphs. The empirical evidence suggests the proposed method for computation of visibility graphs offers an on-line computation solution at no additional computation time cost. The source code is available online. △ Less

Submitted 8 May, 2019; originally announced May 2019.

Comments: code https://github.com/delialia/bst

Journal ref: Phys. Rev. Research 2, 023069 (2020)

arXiv:1903.01976 [pdf, other]

Spectral Visibility Graphs: Application to Similarity of Harmonic Signals

Authors: Delia Fano Yela, Dan Stowell, Mark Sandler

Abstract: Graph theory is emerging as a new source of tools for time series analysis. One promising method is to transform a signal into its visibility graph, a representation which captures many interesting aspects of the signal. Here we introduce the visibility graph for audio spectra and propose a novel representation for audio analysis: the spectral visibility graph degree. Such representation inherentl… ▽ More Graph theory is emerging as a new source of tools for time series analysis. One promising method is to transform a signal into its visibility graph, a representation which captures many interesting aspects of the signal. Here we introduce the visibility graph for audio spectra and propose a novel representation for audio analysis: the spectral visibility graph degree. Such representation inherently captures the harmonic content of the signal whilst being resilient to broadband noise. We present experiments demonstrating its utility to measure robust similarity between harmonic signals in real and synthesised audio data. The source code is available online. △ Less

Submitted 20 June, 2019; v1 submitted 5 March, 2019; originally announced March 2019.

Comments: European Signal Processing Conference (EUSIPCO)

arXiv:1901.11436 [pdf, other]

End-to-End Probabilistic Inference for Nonstationary Audio Analysis

Authors: William J. Wilkinson, Michael Riis Andersen, Joshua D. Reiss, Dan Stowell, Arno Solin

Abstract: A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters… ▽ More A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters. Further, we formulate this nonlinear model's state space representation, making it amenable to infinite-horizon Gaussian process regression with approximate inference via expectation propagation, which scales linearly in the number of time steps and quadratically in the state dimensionality. By doing so, we are able to process audio signals with hundreds of thousands of data points. We demonstrate, on various tasks with empirical data, how this inference scheme outperforms more standard techniques that rely on extended Kalman filtering. △ Less

Submitted 27 April, 2019; v1 submitted 31 January, 2019; originally announced January 2019.

Comments: Accepted to the Thirty-sixth International Conference on Machine Learning (ICML) 2019

arXiv:1811.02489 [pdf, other]

Unifying Probabilistic Models for Time-Frequency Analysis

Authors: William J. Wilkinson, Michael Riis Andersen, Joshua D. Reiss, Dan Stowell, Arno Solin

Abstract: In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude and phase information, making time domain resynthesis straightforward. However, these models are still not widely used since they come at a high computational cos… ▽ More In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude and phase information, making time domain resynthesis straightforward. However, these models are still not widely used since they come at a high computational cost, and because they are formulated in such a way that it can be difficult to interpret all the modelling assumptions. By showing their equivalence to Spectral Mixture Gaussian processes, we illuminate the underlying model assumptions and provide a general framework for constructing more complex models that better approximate real-world signals. Our interpretation makes it intuitive to inspect, compare, and alter the models since all prior knowledge is encoded in the Gaussian process kernel functions. We utilise a state space representation to perform efficient inference via Kalman smoothing, and we demonstrate how our interpretation allows for efficient parameter learning in the frequency domain. △ Less

Submitted 12 February, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: Accepted to International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019

arXiv:1811.02275 [pdf, other]

NIPS4Bplus: a richly annotated birdsong audio dataset

Authors: Veronica Morfi, Yves Bas, Hanna Pamuła, Hervé Glotin, Dan Stowell

Abstract: Recent advances in birdsong detection and classification have approached a limit due to the lack of fully annotated recordings. In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. Statistical information about the r… ▽ More Recent advances in birdsong detection and classification have approached a limit due to the lack of fully annotated recordings. In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. Statistical information about the recordings, their species specific tags and their temporal annotations are presented along with example uses. NIPS4Bplus could be used in various ecoacoustic tasks, such as training models for bird population monitoring, species classification, birdsong vocalisation detection and classification. △ Less

Submitted 14 November, 2018; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: 5 pages, 5 figures, submitted to ICASSP 2019

arXiv:1810.12679 [pdf, other]

Sparse Gaussian Process Audio Source Separation Using Spectrum Priors in the Time-Domain

Authors: Pablo A. Alvarado, Mauricio A. Álvarez, Dan Stowell

Abstract: Gaussian process (GP) audio source separation is a time-domain approach that circumvents the inherent phase approximation issue of spectrogram based methods. Furthermore, through its kernel, GPs elegantly incorporate prior knowledge about the sources into the separation model. Despite these compelling advantages, the computational complexity of GP inference scales cubically with the number of audi… ▽ More Gaussian process (GP) audio source separation is a time-domain approach that circumvents the inherent phase approximation issue of spectrogram based methods. Furthermore, through its kernel, GPs elegantly incorporate prior knowledge about the sources into the separation model. Despite these compelling advantages, the computational complexity of GP inference scales cubically with the number of audio samples. As a result, source separation GP models have been restricted to the analysis of short audio frames. We introduce an efficient application of GPs to time-domain audio source separation, without compromising performance. For this purpose, we used GP regression, together with spectral mixture kernels, and variational sparse GPs. We compared our method with LD-PSDTF (positive semi-definite tensor factorization), KL-NMF (Kullback-Leibler non-negative matrix factorization), and IS-NMF (Itakura-Saito NMF). Results show that the proposed method outperforms these techniques. △ Less

Submitted 21 November, 2018; v1 submitted 30 October, 2018; originally announced October 2018.

Comments: Paper submitted to the 44th International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019. To be held in Brighton, United Kingdom, between May 12 and May 17, 2019

arXiv:1810.09273 [pdf, other]

Automatic acoustic identification of individual animals: Improving generalisation across species and recording conditions

Authors: Dan Stowell, Tereza Petrusková, Martin Šálek, Pavel Linhart

Abstract: Many animals emit vocal sounds which, independently from the sounds' function, embed some individually-distinctive signature. Thus the automatic recognition of individuals by sound is a potentially powerful tool for zoology and ecology research and practical monitoring. Here we present a general automatic identification method, that can work across multiple animal species with various levels of co… ▽ More Many animals emit vocal sounds which, independently from the sounds' function, embed some individually-distinctive signature. Thus the automatic recognition of individuals by sound is a potentially powerful tool for zoology and ecology research and practical monitoring. Here we present a general automatic identification method, that can work across multiple animal species with various levels of complexity in their communication systems. We further introduce new analysis techniques based on dataset manipulations that can evaluate the robustness and generality of a classifier. By using these techniques we confirmed the presence of experimental confounds in situations resembling those from past studies. We introduce data manipulations that can reduce the impact of these confounds, compatible with any classifier. We suggest that assessment of confounds should become a standard part of future studies to ensure they do not report over-optimistic results. We provide annotated recordings used for analyses along with this study and we call for dataset sharing to be a common practice to enhance development of methods and comparisons of results. △ Less

Submitted 22 October, 2018; originally announced October 2018.

arXiv:1807.06972 [pdf, other]

Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Learning

Authors: Veronica Morfi, Dan Stowell

Abstract: We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are "weakly labelled" having only a list of events present in each recording without any temporal information for training. Secondly, deep neural netw… ▽ More We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are "weakly labelled" having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on two low-resource datasets that lack temporal labels. △ Less

Submitted 26 October, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

Comments: 5 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:1807.03697

arXiv:1807.05812 [pdf, other]

doi 10.1111/2041-210X.13103

Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge

Authors: Dan Stowell, Yannis Stylianou, Mike Wood, Hanna Pamuła, Hervé Glotin

Abstract: Assessing the presence and abundance of birds is important for monitoring specific species as well as overall ecosystem health. Many birds are most readily detected by their sounds, and thus passive acoustic monitoring is highly appropriate. Yet acoustic monitoring is often held back by practical limitations such as the need for manual configuration, reliance on example sound libraries, low accura… ▽ More Assessing the presence and abundance of birds is important for monitoring specific species as well as overall ecosystem health. Many birds are most readily detected by their sounds, and thus passive acoustic monitoring is highly appropriate. Yet acoustic monitoring is often held back by practical limitations such as the need for manual configuration, reliance on example sound libraries, low accuracy, low robustness, and limited ability to generalise to novel acoustic conditions. Here we report outcomes from a collaborative data challenge showing that with modern machine learning including deep learning, general-purpose acoustic bird detection can achieve very high retrieval rates in remote monitoring data --- with no manual recalibration, and no pre-training of the detector for the target species or the acoustic conditions in the target environment. Multiple methods were able to attain performance of around 88% AUC (area under the ROC curve), much higher performance than previous general-purpose methods. We present new acoustic monitoring datasets, summarise the machine learning techniques proposed by challenge teams, conduct detailed performance evaluation, and discuss how such approaches to detection can be integrated into remote monitoring projects. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1807.03697 [pdf, other]

Deep Learning for Audio Transcription on Low-Resource Datasets

Authors: Veronica Morfi, Dan Stowell

Abstract: In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to… ▽ More In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets. We evaluate three data-efficient approaches of training a stacked convolutional and recurrent neural network for the intermediate tasks. Our results show that different methods of training have different advantages and disadvantages. △ Less

Submitted 11 July, 2018; v1 submitted 10 July, 2018; originally announced July 2018.

Comments: 20 pages, 5 figures

arXiv:1804.02325 [pdf, other]

Does k Matter? k-NN Hubness Analysis for Kernel Additive Modelling Vocal Separation

Authors: Delia Fano Yela, Dan Stowell, Mark Sandler

Abstract: Kernel Additive Modelling (KAM) is a framework for source separation aiming to explicitly model inherent properties of sound sources to help with their identification and separation. KAM separates a given source by applying robust statistics on the selection of time-frequency bins obtained through a source-specific kernel, typically the k-NN function. Even though the parameter k appears to be key… ▽ More Kernel Additive Modelling (KAM) is a framework for source separation aiming to explicitly model inherent properties of sound sources to help with their identification and separation. KAM separates a given source by applying robust statistics on the selection of time-frequency bins obtained through a source-specific kernel, typically the k-NN function. Even though the parameter k appears to be key for a successful separation, little discussion on its influence or optimisation can be found in the literature. Here we propose a novel method, based on graph theory statistics, to automatically optimise $k$ in a vocal separation task. We introduce the k-NN hubness as an indicator to find a tailored k at a low computational cost. Subsequently, we evaluate our method in comparison to the common approach to choose k. We further discuss the influence and importance of this parameter with illuminating results. △ Less

Submitted 6 April, 2018; originally announced April 2018.

Comments: LVA-ICA 2018 - Feedback always welcome

arXiv:1802.00680 [pdf, other]

A Generative Model for Natural Sounds Based on Latent Force Modelling

Authors: William J. Wilkinson, Joshua D. Reiss, Dan Stowell

Abstract: Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but existing approaches don't incorporate prior knowledge about the physical behaviour of amplitude envelopes, such as exponential decay and feedback.… ▽ More Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but existing approaches don't incorporate prior knowledge about the physical behaviour of amplitude envelopes, such as exponential decay and feedback. We use latent force modelling, a probabilistic learning paradigm that incorporates physical knowledge into Gaussian process regression, to model correlation across spectral subband envelopes. We augment the standard latent force model approach by explicitly modelling correlations over multiple time steps. Incorporating this prior knowledge strengthens the interpretation of the latent functions as the source that generated the signal. We examine this interpretation via an experiment which shows that sounds generated by sampling from our probabilistic model are perceived to be more realistic than those generated by similar models based on nonnegative matrix factorisation, even in cases where our model is outperformed from a reconstruction error perspective. △ Less

Submitted 27 March, 2019; v1 submitted 2 February, 2018; originally announced February 2018.

Comments: 10 pages, 5 figures

arXiv:1705.07104 [pdf, other]

Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music

Authors: Pablo A. Alvarado, Dan Stowell

Abstract: Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording. Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlap** between harmonics. We study whether the introduction of physically inspired Gaussian process (… ▽ More Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording. Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlap** between harmonics. We study whether the introduction of physically inspired Gaussian process (GP) priors into audio content analysis models improves the extraction of patterns required for AMT. Audio signals are described as a linear combination of sources. Each source is decomposed into the product of an amplitude-envelope, and a quasi-periodic component process. We introduce the Matérn spectral mixture (MSM) kernel for describing frequency content of singles notes. We consider two different regression approaches. In the sigmoid model every pitch-activation is independently non-linear transformed. In the softmax model several activation GPs are jointly non-linearly transformed. This introduce cross-correlation between activations. We use variational Bayes for approximate inference. We empirically evaluate how these models work in practice transcribing polyphonic music. We demonstrate that rather than encourage dependency between activations, what is relevant for improving pitch detection is to learnt priors that fit the frequency content of the sound events to detect. △ Less

Submitted 16 November, 2018; v1 submitted 19 May, 2017; originally announced May 2017.

Comments: Updated version with appendix section about derivation of amplitude modulated GP

arXiv:1612.05489 [pdf, other]

On-bird Sound Recordings: Automatic Acoustic Recognition of Activities and Contexts

Authors: Dan Stowell, Emmanouil Benetos, Lisa F. Gill

Abstract: We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal beh… ▽ More We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal behaviour, and have the potential to be used for investigating specific activities (movement) and context (background) within which vocalisations occur. To facilitate this approach, we investigate the automatic annotation of such recordings through two different sound scene analysis paradigms: a scene-classification method using feature learning, and an event-detection method using probabilistic latent component analysis (PLCA). We analyse recordings made with Eurasian jackdaws (Corvus monedula) in both captive and field settings. Results are comparable with the state of the art in sound scene analysis; we find that the current recognition quality level enables scalable automatic annotation of audio logger data, given partial annotation, but also find that individual differences between animals and/or their backpacks limit the generalisation from one individual to another. we consider the interrelation of 'scenes' and 'events' in this particular task, and issues of temporal resolution. △ Less

Submitted 16 December, 2016; originally announced December 2016.

arXiv:1608.03417 [pdf, other]

doi 10.1109/MLSP.2016.7738875

Bird detection in audio: a survey and a challenge

Authors: Dan Stowell, Mike Wood, Yannis Stylianou, Hervé Glotin

Abstract: Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challen… ▽ More Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challenge to address this need, to make possible the development of fully automatic algorithms for bird sound detection. △ Less

Submitted 11 August, 2016; originally announced August 2016.

Comments: Slightly extended preprint of paper accepted for MLSP 2016

arXiv:1606.01039 [pdf, ps, other]

Gaussian Processes for Music Audio Modelling and Content Analysis

Authors: Pablo A. Alvarado, Dan Stowell

Abstract: Real music signals are highly variable, yet they have strong statistical structure. Prior information about the underlying physical mechanisms by which sounds are generated and rules by which complex sound structure is constructed (notes, chords, a complete musical score), can be naturally unified using Bayesian modelling techniques. Typically algorithms for Automatic Music Transcription independe… ▽ More Real music signals are highly variable, yet they have strong statistical structure. Prior information about the underlying physical mechanisms by which sounds are generated and rules by which complex sound structure is constructed (notes, chords, a complete musical score), can be naturally unified using Bayesian modelling techniques. Typically algorithms for Automatic Music Transcription independently carry out individual tasks such as multiple-F0 detection and beat tracking. The challenge remains to perform joint estimation of all parameters. We present a Bayesian approach for modelling music audio, and content analysis. The proposed methodology based on Gaussian processes seeks joint estimation of multiple music concepts by incorporating into the kernel prior information about non-stationary behaviour, dynamics, and rich spectral content present in the modelled music signal. We illustrate the benefits of this approach via two tasks: pitch estimation, and inferring missing segments in a polyphonic audio recording. △ Less

Submitted 10 June, 2016; v1 submitted 3 June, 2016; originally announced June 2016.

arXiv:1603.07236 [pdf, other]

Individual identity in songbirds: signal representations and metric learning for locating the information in complex corvid calls

Authors: Dan Stowell, Veronica Morfi, Lisa F. Gill

Abstract: Bird calls range from simple tones to rich dynamic multi-harmonic structures. The more complex calls are very poorly understood at present, such as those of the scientifically important corvid family (jackdaws, crows, ravens, etc.). Individual birds can recognise familiar individuals from calls, but where in the signal is this identity encoded? We studied the question by applying a combination of… ▽ More Bird calls range from simple tones to rich dynamic multi-harmonic structures. The more complex calls are very poorly understood at present, such as those of the scientifically important corvid family (jackdaws, crows, ravens, etc.). Individual birds can recognise familiar individuals from calls, but where in the signal is this identity encoded? We studied the question by applying a combination of feature representations to a dataset of jackdaw calls, including linear predictive coding (LPC) and adaptive discrete Fourier transform (aDFT). We demonstrate through a classification paradigm that we can strongly outperform a standard spectrogram representation for identifying individuals, and we apply metric learning to determine which time-frequency regions contribute most strongly to robust individual identification. Computational methods can help to direct our search for understanding of these complex biological signals. △ Less

Submitted 26 April, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

arXiv:1603.07173 [pdf, other]

Deductive Refinement of Species Labelling in Weakly Labelled Birdsong Recordings

Authors: Veronica Morfi, Dan Stowell

Abstract: Many approaches have been used in bird species classification from their sound in order to provide labels for the whole of a recording. However, a more precise classification of each bird vocalization would be of great importance to the use and management of sound archives and bird monitoring. In this work, we introduce a technique that using a two step process can first automatically detect all b… ▽ More Many approaches have been used in bird species classification from their sound in order to provide labels for the whole of a recording. However, a more precise classification of each bird vocalization would be of great importance to the use and management of sound archives and bird monitoring. In this work, we introduce a technique that using a two step process can first automatically detect all bird vocalizations and then, with the use of 'weakly' labelled recordings, classify them. Evaluations of our proposed method show that it achieves a correct classification of 61% when used in a synthetic dataset, and up to 89% when the synthetic dataset only consists of vocalizations larger than 1000 pixels. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: 11 pages, 1 figure

arXiv:1601.05449 [pdf, other]

Detailed temporal structure of communication networks in groups of songbirds

Authors: Dan Stowell, Lisa Gill, David Clayton

Abstract: Animals in groups often exchange calls, in patterns whose temporal structure may be influenced by contextual factors such as physical location and the social network structure of the group. We introduce a model-based analysis for temporal patterns of animal call timing, originally developed for networks of firing neurons. This has advantages over cross-correlation analysis in that it can correctly… ▽ More Animals in groups often exchange calls, in patterns whose temporal structure may be influenced by contextual factors such as physical location and the social network structure of the group. We introduce a model-based analysis for temporal patterns of animal call timing, originally developed for networks of firing neurons. This has advantages over cross-correlation analysis in that it can correctly handle common-cause confounds and provides a generative model of call patterns with explicit parameters for the influences between individuals. It also has advantages over standard Markovian analysis in that it incorporates detailed temporal interactions which affect timing as well as sequencing of calls. Further, a fitted model can be used to generate novel synthetic call sequences. We apply the method to calls recorded from groups of domesticated zebra finch (Taenopyggia guttata) individuals. We find that the communication network in these groups has stable structure that persists from one day to the next, and that "kernels" reflecting the temporal range of influence have a characteristic structure for a calling individual's effect on itself, its partner, and on others in the group. We further find characteristic patterns of influences by call type as well as by individual. △ Less

Submitted 20 January, 2016; originally announced January 2016.

arXiv:1509.05982 [pdf, other]

Denoising without access to clean data using a partitioned autoencoder

Authors: Dan Stowell, Richard E. Turner

Abstract: Training a denoising autoencoder neural network requires access to truly clean data, a requirement which is often impractical. To remedy this, we introduce a method to train an autoencoder using only noisy data, having examples with and without the signal class of interest. The autoencoder learns a partitioned representation of signal and noise, learning to reconstruct each separately. We illustra… ▽ More Training a denoising autoencoder neural network requires access to truly clean data, a requirement which is often impractical. To remedy this, we introduce a method to train an autoencoder using only noisy data, having examples with and without the signal class of interest. The autoencoder learns a partitioned representation of signal and noise, learning to reconstruct each separately. We illustrate the method by denoising birdsong audio (available abundantly in uncontrolled noisy datasets) using a convolutional autoencoder. △ Less

Submitted 22 September, 2015; v1 submitted 20 September, 2015; originally announced September 2015.

arXiv:1503.07150 [pdf, other]

Acoustic event detection for multiple overlap** similar sources

Authors: Dan Stowell, David Clayton

Abstract: Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, duratio… ▽ More Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns. We evaluate the method in a case study with over 3000 zebra finch calls. In comparison against a HMM-based method we find it more accurate at recovering acoustic events, and more robust for estimating calling rates. △ Less

Submitted 9 July, 2015; v1 submitted 24 March, 2015; originally announced March 2015.

Comments: Accepted for WASPAA 2015

arXiv:1411.3715 [pdf, other]

doi 10.1109/MSP.2014.2326181

Acoustic Scene Classification

Authors: Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, Mark D. Plumbley

Abstract: In this article we present an account of the state-of-the-art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different imple- mentations of its components. We then describe a range of different algorithms submitted for a… ▽ More In this article we present an account of the state-of-the-art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different imple- mentations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The dataset recorded for this purpose is presented, along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods. We use a baseline method that employs MFCCS, GMMS and a maximum likelihood criterion as a benchmark, and only find sufficient evidence to conclude that three algorithms significantly outperform it. We also evaluate the human classification accuracy in performing a similar classification task. The best performing algorithm achieves a mean accuracy that matches the median accuracy obtained by humans, and common pairs of classes are misclassified by both computers and humans. However, all acoustic scenes are correctly classified by at least some individuals, while there are scenes that are misclassified by all algorithms. △ Less

Submitted 13 November, 2014; originally announced November 2014.

Journal ref: IEEE Signal Processing Magazine 32(3) (May 2015) 16-34

arXiv:1405.6524 [pdf, other]

doi 10.7717/peerj.488

Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning

Authors: Dan Stowell, Mark D. Plumbley

Abstract: Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequ… ▽ More Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, with a random forest classifier. We demonstrate that MFCCs are of limited power in this context, leading to worse performance than the raw Mel spectral data. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. △ Less

Submitted 26 May, 2014; originally announced May 2014.

Journal ref: PeerJ 2:e488, 2014

arXiv:1311.4764 [pdf, other]

doi 10.1111/2041-210X.12223

Large-scale analysis of frequency modulation in birdsong databases

Authors: Dan Stowell, Mark D. Plumbley

Abstract: Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linea… ▽ More Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence it is important to consider high resolution signal processing techniques for analysis of FM in bird vocalisations. If such methods can be applied at big data scales, this offers a further advantage as large datasets become available. We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short timescales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. In order to find tools useful in practical analysis of large databases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. We find three methods which can robustly represent species-correlated FM attributes, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. Large-scale FM analysis can efficiently extract information useful for bioacoustic studies, in addition to measures more commonly used to characterise vocalisations. △ Less

Submitted 19 November, 2013; originally announced November 2013.

Journal ref: Methods in Ecology and Evolution, Volume 5, Issue 9, pages 901-912, September 2014

arXiv:1309.5275 [pdf, other]

An open dataset for research on audio field recording archives: freefield1010

Authors: Dan Stowell, Mark D. Plumbley

Abstract: We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptiv… ▽ More We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment. △ Less

Submitted 1 October, 2013; v1 submitted 20 September, 2013; originally announced September 2013.

arXiv:1302.3462 [pdf, other]

doi 10.1109/ICASSP.2013.6637691

Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering

Authors: Dan Stowell, Sašo Muševič, Jordi Bonada, Mark D. Plumbley

Abstract: Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segre… ▽ More Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segregation algorithm which uses a Markov renewal process model to track vocalisation patterns consisting of singing and silences. △ Less

Submitted 15 February, 2013; v1 submitted 14 February, 2013; originally announced February 2013.

Comments: Submitted to ICASSP 2013

arXiv:1302.0136 [pdf, other]

Maximum a posteriori estimation of piecewise arcs in tempo time-series

Authors: Dan Stowell, Elaine Chew

Abstract: In musical performances with expressive tempo modulation, the tempo variation can be modelled as a sequence of tempo arcs. Previous authors have used this idea to estimate series of piecewise arc segments from data. In this paper we describe a probabilistic model for a time-series process of this nature, and use this to perform inference of single- and multi-level arc processes from data. We descr… ▽ More In musical performances with expressive tempo modulation, the tempo variation can be modelled as a sequence of tempo arcs. Previous authors have used this idea to estimate series of piecewise arc segments from data. In this paper we describe a probabilistic model for a time-series process of this nature, and use this to perform inference of single- and multi-level arc processes from data. We describe an efficient Viterbi-like process for MAP inference of arcs. Our approach is score-agnostic, and together with efficient inference allows for online analysis of performances including improvisations, and can predict immediate future tempo trajectories. △ Less

Submitted 1 February, 2013; originally announced February 2013.

Comments: Submitted to postprint volume for Computer Music Modeling and Retrieval (CMMR) 2012

arXiv:1211.2972 [pdf, other]

Segregating event streams and noise with a Markov renewal process model

Authors: Dan Stowell, Mark D. Plumbley

Abstract: We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture… ▽ More We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via a synthetic experiment as well as an experiment to track a mixture of singing birds. △ Less

Submitted 13 November, 2012; originally announced November 2012.

ACM Class: I.5.1

Journal ref: Journal of Machine Learning Research, 14(Aug):2213-2238, 2013

Showing 1–45 of 45 results for author: Stowell, D