-
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics
Authors:
Julian C. Schäfer-Zimmermann,
Vlad Demartsev,
Baptiste Averly,
Kiran Dhanjal-Adams,
Mathieu Duteil,
Gabriella Gall,
Marius Faiß,
Lily Johnson-Ulrich,
Dan Stowell,
Marta B. Manser,
Marie A. Roch,
Ariana Strandburg-Peshkin
Abstract:
Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challeng…
▽ More
Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challenging, focusing on models conceived for computer vision, where the audio waveforms are engineered into spectrographic representations for training and inference. We improve the current state of deep learning in bioacoustics in two ways: First, we present the animal2vec framework: a fully interpretable transformer model and self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. Second, we openly publish MeerKAT: Meerkat Kalahari Audio Transcripts, a large-scale dataset containing audio collected via biologgers deployed on free-ranging meerkats with a length of over 1068h, of which 184h have twelve time-resolved vocalization-type classes, each with ms-resolution, making it the largest publicly-available labeled dataset on terrestrial mammals. Further, we benchmark animal2vec against the NIPS4Bplus birdsong dataset. We report new state-of-the-art results on both datasets and evaluate the few-shot capabilities of animal2vec of labeled training data. Finally, we perform ablation studies to highlight the differences between our architecture and a vanilla transformer baseline for human-produced sounds. animal2vec allows researchers to classify massive amounts of sparse bioacoustic data even with little ground truth information available. In addition, the MeerKAT dataset is the first large-scale, millisecond-resolution corpus for benchmarking bioacoustic models in the pretrain/finetune paradigm. We believe this sets the stage for a new reference point for bioacoustics.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images
Authors:
Rita Pucci,
Vincent J. Kalkman,
Dan Stowell
Abstract:
With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distrib…
▽ More
With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time-consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT), and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost, and gradient activity. We offer insights that we haven't yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that the ViT performs the best on inference speed and computational cost while the LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation
Authors:
Drew Priebe,
Burooj Ghani,
Dan Stowell
Abstract:
The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides…
▽ More
The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides in deep learning in recent years, the deployment of large neural networks on compact devices poses challenges due to memory and latency constraints. Our approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics. In particular, we employed the MobileNetV3-Small-Pi model to create compact yet effective student architectures to compare against the larger EcoVAD teacher model, a well-regarded voice detection architecture in eco-acoustic monitoring. The comparative analysis included examining various configurations of the MobileNetV3-Small-Pi derived student models to identify optimal performance. Additionally, a thorough evaluation of different distillation techniques was conducted to ascertain the most effective method for model selection. Our findings revealed that the distilled models exhibited comparable performance to the EcoVAD teacher model, indicating a promising approach to overcoming computational barriers for real-time ecological monitoring.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Auto deep learning for bioacoustic signals
Authors:
Giulio Tosato,
Abdelrahman Shehata,
Joshua Janssen,
Kees Kamp,
Pramatya Jati,
Dan Stowell
Abstract:
This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search a…
▽ More
This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search and hyperparameter tuning. Comparative analysis validates our hypothesis that the AutoKeras-derived model consistently outperforms traditional models like MobileNet, ResNet50 and VGG16. Our approach and findings underscore the transformative potential of automated deep learning for advancing bioacoustics research and models. In fact, the automated techniques eliminate the need for manual feature engineering and model design while improving performance. This study illuminates best practices in sampling, evaluation and reporting to enhance reproducibility in this nascent field. All the code used is available at https: //github.com/giuliotosato/AutoKeras-bioacustic
Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms.
△ Less
Submitted 26 December, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
ATGNN: Audio Tagging Graph Neural Network
Authors:
Shubhr Singh,
Christian J. Steinmetz,
Emmanouil Benetos,
Huy Phan,
Dan Stowell
Abstract:
Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough…
▽ More
Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Comparison between transformers and convolutional models for fine-grained classification of insects
Authors:
Rita Pucci,
Vincent J. Kalkman,
Dan Stowell
Abstract:
Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to differentiate. We consider the taxonomical class of Insecta. The identification of insects is essent…
▽ More
Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to differentiate. We consider the taxonomical class of Insecta. The identification of insects is essential in biodiversity monitoring as they are one of the inhabitants at the base of many ecosystems. Citizen science is doing brilliant work of collecting images of insects in the wild giving the possibility to experts to create improved distribution maps in all countries. We have billions of images that need to be automatically classified and deep neural network algorithms are one of the main techniques explored for fine-grained tasks. At the SOTA, the field of deep learning algorithms is extremely fruitful, so how to identify the algorithm to use? We focus on Odonata and Coleoptera orders, and we propose an initial comparative study to analyse the two best-known layer structures for computer vision: transformer and convolutional layers. We compare the performance of T2TViT, a fully transformer-base, EfficientNet, a fully convolutional-base, and ViTAE, a hybrid. We analyse the performance of the three models in identical conditions evaluating the performance per species, per morph together with sex, the inference time, and the overall performance with unbalanced datasets of images from smartphones. Although we observe high performances with all three families of models, our analysis shows that the hybrid model outperforms the fully convolutional-base and fully transformer-base models on accuracy performance and the fully transformer-base model outperforms the others on inference speed and, these prove the transformer to be robust to the shortage of samples and to be faster at inference time.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Few-shot bioacoustic event detection at the DCASE 2023 challenge
Authors:
Ines Nolasco,
Burooj Ghani,
Shubhr Singh,
Ester Vidaña-Vila,
Helen Whitehead,
Emily Grout,
Michael Emmerson,
Frants Jensen,
Ivan Kiskin,
Joe Morford,
Ariana Strandburg-Peshkin,
Lisa Gill,
Hanna Pamuła,
Vincent Lostanlen,
Dan Stowell
Abstract:
Few-shot bioacoustic event detection consists in detecting sound events of specified types, in varying soundscapes, while having access to only a few examples of the class of interest. This task ran as part of the DCASE challenge for the third time this year with an evaluation set expanded to include new animal species, and a new rule: ensemble models were no longer allowed. The 2023 few shot task…
▽ More
Few-shot bioacoustic event detection consists in detecting sound events of specified types, in varying soundscapes, while having access to only a few examples of the class of interest. This task ran as part of the DCASE challenge for the third time this year with an evaluation set expanded to include new animal species, and a new rule: ensemble models were no longer allowed. The 2023 few shot task received submissions from 6 different teams with F-scores reaching as high as 63% on the evaluation set. Here we describe the task, focusing on describing the elements that differed from previous years. We also take a look back at past editions to describe how the task has evolved. Not only have the F-score results steadily improved (40% to 60% to 63%), but the type of systems proposed have also become more complex. Sound event detection systems are no longer simple variations of the baselines provided: multiple few-shot learning methodologies are still strong contenders for the task.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Learning to detect an animal sound from five examples
Authors:
Inês Nolasco,
Shubhr Singh,
Veronica Morfi,
Vincent Lostanlen,
Ariana Strandburg-Peshkin,
Ester Vidaña-Vila,
Lisa Gill,
Hanna Pamuła,
Helen Whitehead,
Ivan Kiskin,
Frants H. Jensen,
Joe Morford,
Michael G. Emmerson,
Elisabetta Versace,
Emily Grout,
Haohe Liu,
Dan Stowell
Abstract:
Automatic detection and classification of animal sounds has many applications in biodiversity monitoring and animal behaviour. In the past twenty years, the volume of digitised wildlife sound available has massively increased, and automatic classification through deep learning now shows strong results. However, bioacoustics is not a single task but a vast range of small-scale tasks (such as indivi…
▽ More
Automatic detection and classification of animal sounds has many applications in biodiversity monitoring and animal behaviour. In the past twenty years, the volume of digitised wildlife sound available has massively increased, and automatic classification through deep learning now shows strong results. However, bioacoustics is not a single task but a vast range of small-scale tasks (such as individual ID, call type, emotional indication) with wide variety in data characteristics, and most bioacoustic tasks do not come with strongly-labelled training data. The standard paradigm of supervised learning, focussed on a single large-scale dataset and/or a generic pre-trained algorithm, is insufficient. In this work we recast bioacoustic sound event detection within the AI framework of few-shot learning. We adapt this framework to sound event detection, such that a system can be given the annotated start/end times of as few as 5 events, and can then detect events in long-duration audio -- even when the sound category was not known at the time of algorithm training. We introduce a collection of open datasets designed to strongly test a system's ability to perform few-shot sound event detections, and we present the results of a public contest to address the task. We show that prototypical networks are a strong-performing method, when enhanced with adaptations for general characteristics of animal sounds. We demonstrate that widely-varying sound event durations are an important factor in performance, as well as non-stationarity, i.e. gradual changes in conditions throughout the duration of a recording. For fine-grained bioacoustic recognition tasks without massive annotated training data, our results demonstrate that few-shot sound event detection is a powerful new method, strongly outperforming traditional signal-processing detection methods in the fully automated scenario.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Adaptive Representations of Sound for Automatic Insect Recognition
Authors:
Marius Faiß,
Dan Stowell
Abstract:
Insect population numbers and biodiversity have been rapidly declining with time, and monitoring these trends has become increasingly important for conservation measures to be effectively implemented. But monitoring methods are often invasive, time and resource intense, and prone to various biases. Many insect species produce characteristic sounds that can easily be detected and recorded without l…
▽ More
Insect population numbers and biodiversity have been rapidly declining with time, and monitoring these trends has become increasingly important for conservation measures to be effectively implemented. But monitoring methods are often invasive, time and resource intense, and prone to various biases. Many insect species produce characteristic sounds that can easily be detected and recorded without large cost or effort. Using deep learning methods, insect sounds from field recordings could be automatically detected and classified to monitor biodiversity and species distribution ranges. We implement this using recently published datasets of insect sounds (Orthoptera and Cicadidae) and machine learning methods and evaluate their potential for acoustic insect monitoring. We compare the performance of the conventional spectrogram-based audio representation against LEAF, a new adaptive and waveform-based frontend. LEAF achieved better classification performance than the mel-spectrogram frontend by adapting its feature extraction parameters during training. This result is encouraging for future implementations of deep learning technology for automatic insect sound recognition, especially as larger datasets become available.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
Full-Stack Bioacoustics: Field Kit to AI to Action (Workshop report)
Authors:
Dan Stowell,
Caitlin Black,
Florencia Noriega,
Sarab S. Sethi
Abstract:
Acoustic data (sound recordings) are a vital source of evidence for detecting, counting, and distinguishing wildlife. This domain of "bioacoustics" has grown in the past decade due to the massive advances in signal processing and machine learning, recording devices, and the capacity of data processing and storage. Numerous research papers describe the use of Raspberry Pi or similar devices for aco…
▽ More
Acoustic data (sound recordings) are a vital source of evidence for detecting, counting, and distinguishing wildlife. This domain of "bioacoustics" has grown in the past decade due to the massive advances in signal processing and machine learning, recording devices, and the capacity of data processing and storage. Numerous research papers describe the use of Raspberry Pi or similar devices for acoustic monitoring, and other research papers describe automatic classification of animal sounds by machine learning. But for most ecologists, zoologists, conservationists, the pieces of the puzzle do not come together: the domain is fragmented. In this Lorentz workshop we bridge this gap by bringing together leading exponents of open hardware and open-source software for bioacoustic monitoring and machine learning, as well as ecologists and other field researchers. We share skills while also building a vision for the future development of "bioacoustic AI".
This report contains an overview of the workshop aims and structure, as well as reports from the six groups.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Few-shot bioacoustic event detection at the DCASE 2022 challenge
Authors:
I. Nolasco,
S. Singh,
E. Vidana-Villa,
E. Grout,
J. Morford,
M. Emmerson,
F. Jensens,
H. Whitehead,
I. Kiskin,
A. Strandburg-Peshkin,
L. Gill,
H. Pamula,
V. Lostanlen,
V. Morfi,
D. Stowell
Abstract:
Few-shot sound event detection is the task of detecting sound events, despite having only a few labelled examples of the class of interest. This framework is particularly useful in bioacoustics, where often there is a need to annotate very long recordings but the expert annotator time is limited. This paper presents an overview of the second edition of the few-shot bioacoustic sound event detectio…
▽ More
Few-shot sound event detection is the task of detecting sound events, despite having only a few labelled examples of the class of interest. This framework is particularly useful in bioacoustics, where often there is a need to annotate very long recordings but the expert annotator time is limited. This paper presents an overview of the second edition of the few-shot bioacoustic sound event detection task included in the DCASE 2022 challenge. A detailed description of the task objectives, dataset, and baselines is presented, together with the main results obtained and characteristics of the submitted systems. This task received submissions from 15 different teams from which 13 scored higher than the baselines. The highest F-score was of 60% on the evaluation set, which leads to a huge improvement over last year's edition. Highly-performing methods made use of prototypical networks, transductive learning, and addressed the variable length of events from all target classes. Furthermore, by analysing results on each of the subsets we can identify the main difficulties that the systems face, and conclude that few-show bioacoustic sound event detection remains an open challenge.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Polyphonic sound event detection for highly dense birdsong scenes
Authors:
Alberto García Arroba Parrilla,
Dan Stowell
Abstract:
One hour before sunrise, one can experience the dawn chorus where birds from different species sing together. In this scenario, high levels of polyphony, as in the number of overlap** sound sources, are prone to happen resulting in a complex acoustic outcome. Sound Event Detection (SED) tasks analyze acoustic scenarios in order to identify the occurring events and their respective temporal infor…
▽ More
One hour before sunrise, one can experience the dawn chorus where birds from different species sing together. In this scenario, high levels of polyphony, as in the number of overlap** sound sources, are prone to happen resulting in a complex acoustic outcome. Sound Event Detection (SED) tasks analyze acoustic scenarios in order to identify the occurring events and their respective temporal information. However, highly dense scenarios can be hard to process and have not been studied in depth. Here we show, using a Convolutional Recurrent Neural Network (CRNN), how birdsong polyphonic scenarios can be detected when dealing with higher polyphony and how effectively this type of model can face a very dense scene with up to 10 overlap** birds. We found that models trained with denser examples (i.e., higher polyphony) learn at a similar rate as models that used simpler samples in their training set. Additionally, the model trained with the densest samples maintained a consistent score for all polyphonies, while the model trained with the least dense samples degraded as the polyphony increased. Our results demonstrate that highly dense acoustic scenarios can be dealt with using CRNNs. We expect that this study serves as a starting point for working on highly populated bird scenarios such as dawn chorus or other dense acoustic problems.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Computational bioacoustics with deep learning: a review and roadmap
Authors:
Dan Stowell
Abstract:
Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, a…
▽ More
Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, and to huge progress in informatics such as big data, signal processing and machine learning. Methods are inherited from the wider field of deep learning, including speech and image processing. However, the tasks, demands and data characteristics are often different from those addressed in speech or music analysis. There remain unsolved problems, and tasks for which evidence is surely present in many acoustic signals, but not yet realised. In this paper I perform a review of the state of the art in deep learning for computational bioacoustics, aiming to clarify key concepts and identify and analyse knowledge gaps. Based on this, I offer a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should aim to address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
Rank-based loss for learning hierarchical representations
Authors:
Ines Nolasco,
Dan Stowell
Abstract:
Hierarchical taxonomies are common in many contexts, and they are a very natural structure humans use to organise information. In machine learning, the family of methods that use the 'extra' information is called hierarchical classification. However, applied to audio classification, this remains relatively unexplored. Here we focus on how to integrate the hierarchical information of a problem to l…
▽ More
Hierarchical taxonomies are common in many contexts, and they are a very natural structure humans use to organise information. In machine learning, the family of methods that use the 'extra' information is called hierarchical classification. However, applied to audio classification, this remains relatively unexplored. Here we focus on how to integrate the hierarchical information of a problem to learn embeddings representative of the hierarchical relationships. Previously, triplet loss has been proposed to address this problem, however it presents some issues like requiring the careful construction of the triplets, and being limited in the extent of hierarchical information it uses at each iteration. In this work we propose a rank based loss function that uses hierarchical information and translates this into a rank ordering of target distances between the examples. We show that rank based loss is suitable to learn hierarchical representations of the data. By testing on unseen fine level classes we show that this method is also capable of learning hierarchically correct representations of the new classes. Rank based loss has two promising aspects, it is generalisable to hierarchies with any number of levels, and is capable of dealing with data with incomplete hierarchical labels.
△ Less
Submitted 11 February, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Guitar Effects Recognition and Parameter Estimation with Convolutional Neural Networks
Authors:
Marco Comunità,
Dan Stowell,
Joshua D. Reiss
Abstract:
Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, convolutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assemb…
▽ More
Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, convolutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assembled, with four sub-datasets consisting of monophonic or polyphonic samples and discrete or continuous settings values, for a total of about 250 hours of processed samples. Results were compared for networks trained and tested on the same or on a different sub-dataset. We found that discrete datasets could lead to equally high performance as continuous ones, whilst being easier to design, analyse and modify. Classification accuracy was above 80\%, with confusion matrices reflecting similarities in the effects timbre and circuits design. With parameter values between 0.0 and 1.0, the mean absolute error is in most cases below 0.05, while the root mean square error is below 0.1 in all cases but one.
△ Less
Submitted 6 December, 2020;
originally announced December 2020.
-
Short-term prediction of photovoltaic power generation using Gaussian process regression
Authors:
Yahya Al Lawati,
Jack Kelly,
Dan Stowell
Abstract:
Photovoltaic (PV) power is affected by weather conditions, making the power generated from the PV systems uncertain. Solving this problem would help improve the reliability and cost effectiveness of the grid, and could help reduce reliance on fossil fuel plants. The present paper focuses on evaluating predictions of the energy generated by PV systems in the United Kingdom Gaussian process regressi…
▽ More
Photovoltaic (PV) power is affected by weather conditions, making the power generated from the PV systems uncertain. Solving this problem would help improve the reliability and cost effectiveness of the grid, and could help reduce reliance on fossil fuel plants. The present paper focuses on evaluating predictions of the energy generated by PV systems in the United Kingdom Gaussian process regression (GPR). Gaussian process regression is a Bayesian non-parametric model that can provide predictions along with the uncertainty in the predicted value, which can be very useful in applications with a high degree of uncertainty. The model is evaluated for short-term forecasts of 48 hours against three main factors -- training period, sky area coverage and kernel model selection -- and for very short-term forecasts of four hours against sky area. We also compare very short-term forecasts in terms of cloud coverage within the prediction period and only initial cloud coverage as a predictor.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
Estimating & Mitigating the Impact of Acoustic Environments on Machine-to-Machine Signalling
Authors:
Amogh Matt,
Dan Stowell
Abstract:
The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine signals while decoding. Various methods have been studied to combat the effects of reverberation in speech and audio signals, but it is not c…
▽ More
The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine signals while decoding. Various methods have been studied to combat the effects of reverberation in speech and audio signals, but it is not clear how well they generalise to other sound types. We look at extending these models to facilitate machine-to-machine acoustic signalling. This research investigates dereverberation techniques to shortlist a single-channel reverberation suppression method through a pilot test. In order to apply the chosen dereverberation method a novel method of estimating acoustic parameters governing reverberation is proposed. The performance of the final algorithm is evaluated on quality metrics as well as the performance of a real machine-to-machine decoder. We demonstrate a dramatic reduction in error rate for both audible and ultrasonic signals.
△ Less
Submitted 13 August, 2019;
originally announced August 2019.
-
Efficient On-line Computation of Visibility Graphs
Authors:
Delia Fano Yela,
Florian Thalmann,
Vincenzo Nicosia,
Dan Stowell,
Mark Sandler
Abstract:
A visibility algorithm maps time series into complex networks following a simple criterion. The resulting visibility graph has recently proven to be a powerful tool for time series analysis. However its straightforward computation is time-consuming and rigid, motivating the development of more efficient algorithms. Here we present a highly efficient method to compute visibility graphs with the fur…
▽ More
A visibility algorithm maps time series into complex networks following a simple criterion. The resulting visibility graph has recently proven to be a powerful tool for time series analysis. However its straightforward computation is time-consuming and rigid, motivating the development of more efficient algorithms. Here we present a highly efficient method to compute visibility graphs with the further benefit of flexibility: on-line computation. We propose an encoder/decoder approach, with an on-line adjustable binary search tree codec for time series as well as its corresponding decoder for visibility graphs. The empirical evidence suggests the proposed method for computation of visibility graphs offers an on-line computation solution at no additional computation time cost. The source code is available online.
△ Less
Submitted 8 May, 2019;
originally announced May 2019.
-
Spectral Visibility Graphs: Application to Similarity of Harmonic Signals
Authors:
Delia Fano Yela,
Dan Stowell,
Mark Sandler
Abstract:
Graph theory is emerging as a new source of tools for time series analysis. One promising method is to transform a signal into its visibility graph, a representation which captures many interesting aspects of the signal. Here we introduce the visibility graph for audio spectra and propose a novel representation for audio analysis: the spectral visibility graph degree. Such representation inherentl…
▽ More
Graph theory is emerging as a new source of tools for time series analysis. One promising method is to transform a signal into its visibility graph, a representation which captures many interesting aspects of the signal. Here we introduce the visibility graph for audio spectra and propose a novel representation for audio analysis: the spectral visibility graph degree. Such representation inherently captures the harmonic content of the signal whilst being resilient to broadband noise. We present experiments demonstrating its utility to measure robust similarity between harmonic signals in real and synthesised audio data. The source code is available online.
△ Less
Submitted 20 June, 2019; v1 submitted 5 March, 2019;
originally announced March 2019.
-
End-to-End Probabilistic Inference for Nonstationary Audio Analysis
Authors:
William J. Wilkinson,
Michael Riis Andersen,
Joshua D. Reiss,
Dan Stowell,
Arno Solin
Abstract:
A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters…
▽ More
A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters. Further, we formulate this nonlinear model's state space representation, making it amenable to infinite-horizon Gaussian process regression with approximate inference via expectation propagation, which scales linearly in the number of time steps and quadratically in the state dimensionality. By doing so, we are able to process audio signals with hundreds of thousands of data points. We demonstrate, on various tasks with empirical data, how this inference scheme outperforms more standard techniques that rely on extended Kalman filtering.
△ Less
Submitted 27 April, 2019; v1 submitted 31 January, 2019;
originally announced January 2019.
-
Unifying Probabilistic Models for Time-Frequency Analysis
Authors:
William J. Wilkinson,
Michael Riis Andersen,
Joshua D. Reiss,
Dan Stowell,
Arno Solin
Abstract:
In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude and phase information, making time domain resynthesis straightforward. However, these models are still not widely used since they come at a high computational cos…
▽ More
In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude and phase information, making time domain resynthesis straightforward. However, these models are still not widely used since they come at a high computational cost, and because they are formulated in such a way that it can be difficult to interpret all the modelling assumptions. By showing their equivalence to Spectral Mixture Gaussian processes, we illuminate the underlying model assumptions and provide a general framework for constructing more complex models that better approximate real-world signals. Our interpretation makes it intuitive to inspect, compare, and alter the models since all prior knowledge is encoded in the Gaussian process kernel functions. We utilise a state space representation to perform efficient inference via Kalman smoothing, and we demonstrate how our interpretation allows for efficient parameter learning in the frequency domain.
△ Less
Submitted 12 February, 2019; v1 submitted 6 November, 2018;
originally announced November 2018.
-
NIPS4Bplus: a richly annotated birdsong audio dataset
Authors:
Veronica Morfi,
Yves Bas,
Hanna Pamuła,
Hervé Glotin,
Dan Stowell
Abstract:
Recent advances in birdsong detection and classification have approached a limit due to the lack of fully annotated recordings. In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. Statistical information about the r…
▽ More
Recent advances in birdsong detection and classification have approached a limit due to the lack of fully annotated recordings. In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. Statistical information about the recordings, their species specific tags and their temporal annotations are presented along with example uses. NIPS4Bplus could be used in various ecoacoustic tasks, such as training models for bird population monitoring, species classification, birdsong vocalisation detection and classification.
△ Less
Submitted 14 November, 2018; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Sparse Gaussian Process Audio Source Separation Using Spectrum Priors in the Time-Domain
Authors:
Pablo A. Alvarado,
Mauricio A. Álvarez,
Dan Stowell
Abstract:
Gaussian process (GP) audio source separation is a time-domain approach that circumvents the inherent phase approximation issue of spectrogram based methods. Furthermore, through its kernel, GPs elegantly incorporate prior knowledge about the sources into the separation model. Despite these compelling advantages, the computational complexity of GP inference scales cubically with the number of audi…
▽ More
Gaussian process (GP) audio source separation is a time-domain approach that circumvents the inherent phase approximation issue of spectrogram based methods. Furthermore, through its kernel, GPs elegantly incorporate prior knowledge about the sources into the separation model. Despite these compelling advantages, the computational complexity of GP inference scales cubically with the number of audio samples. As a result, source separation GP models have been restricted to the analysis of short audio frames. We introduce an efficient application of GPs to time-domain audio source separation, without compromising performance. For this purpose, we used GP regression, together with spectral mixture kernels, and variational sparse GPs. We compared our method with LD-PSDTF (positive semi-definite tensor factorization), KL-NMF (Kullback-Leibler non-negative matrix factorization), and IS-NMF (Itakura-Saito NMF). Results show that the proposed method outperforms these techniques.
△ Less
Submitted 21 November, 2018; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Automatic acoustic identification of individual animals: Improving generalisation across species and recording conditions
Authors:
Dan Stowell,
Tereza Petrusková,
Martin Šálek,
Pavel Linhart
Abstract:
Many animals emit vocal sounds which, independently from the sounds' function, embed some individually-distinctive signature. Thus the automatic recognition of individuals by sound is a potentially powerful tool for zoology and ecology research and practical monitoring. Here we present a general automatic identification method, that can work across multiple animal species with various levels of co…
▽ More
Many animals emit vocal sounds which, independently from the sounds' function, embed some individually-distinctive signature. Thus the automatic recognition of individuals by sound is a potentially powerful tool for zoology and ecology research and practical monitoring. Here we present a general automatic identification method, that can work across multiple animal species with various levels of complexity in their communication systems. We further introduce new analysis techniques based on dataset manipulations that can evaluate the robustness and generality of a classifier. By using these techniques we confirmed the presence of experimental confounds in situations resembling those from past studies. We introduce data manipulations that can reduce the impact of these confounds, compatible with any classifier. We suggest that assessment of confounds should become a standard part of future studies to ensure they do not report over-optimistic results. We provide annotated recordings used for analyses along with this study and we call for dataset sharing to be a common practice to enhance development of methods and comparisons of results.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Learning
Authors:
Veronica Morfi,
Dan Stowell
Abstract:
We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are "weakly labelled" having only a list of events present in each recording without any temporal information for training. Secondly, deep neural netw…
▽ More
We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are "weakly labelled" having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on two low-resource datasets that lack temporal labels.
△ Less
Submitted 26 October, 2018; v1 submitted 17 July, 2018;
originally announced July 2018.
-
Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge
Authors:
Dan Stowell,
Yannis Stylianou,
Mike Wood,
Hanna Pamuła,
Hervé Glotin
Abstract:
Assessing the presence and abundance of birds is important for monitoring specific species as well as overall ecosystem health. Many birds are most readily detected by their sounds, and thus passive acoustic monitoring is highly appropriate. Yet acoustic monitoring is often held back by practical limitations such as the need for manual configuration, reliance on example sound libraries, low accura…
▽ More
Assessing the presence and abundance of birds is important for monitoring specific species as well as overall ecosystem health. Many birds are most readily detected by their sounds, and thus passive acoustic monitoring is highly appropriate. Yet acoustic monitoring is often held back by practical limitations such as the need for manual configuration, reliance on example sound libraries, low accuracy, low robustness, and limited ability to generalise to novel acoustic conditions. Here we report outcomes from a collaborative data challenge showing that with modern machine learning including deep learning, general-purpose acoustic bird detection can achieve very high retrieval rates in remote monitoring data --- with no manual recalibration, and no pre-training of the detector for the target species or the acoustic conditions in the target environment. Multiple methods were able to attain performance of around 88% AUC (area under the ROC curve), much higher performance than previous general-purpose methods. We present new acoustic monitoring datasets, summarise the machine learning techniques proposed by challenge teams, conduct detailed performance evaluation, and discuss how such approaches to detection can be integrated into remote monitoring projects.
△ Less
Submitted 16 July, 2018;
originally announced July 2018.
-
Deep Learning for Audio Transcription on Low-Resource Datasets
Authors:
Veronica Morfi,
Dan Stowell
Abstract:
In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to…
▽ More
In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets. We evaluate three data-efficient approaches of training a stacked convolutional and recurrent neural network for the intermediate tasks. Our results show that different methods of training have different advantages and disadvantages.
△ Less
Submitted 11 July, 2018; v1 submitted 10 July, 2018;
originally announced July 2018.
-
Does k Matter? k-NN Hubness Analysis for Kernel Additive Modelling Vocal Separation
Authors:
Delia Fano Yela,
Dan Stowell,
Mark Sandler
Abstract:
Kernel Additive Modelling (KAM) is a framework for source separation aiming to explicitly model inherent properties of sound sources to help with their identification and separation. KAM separates a given source by applying robust statistics on the selection of time-frequency bins obtained through a source-specific kernel, typically the k-NN function. Even though the parameter k appears to be key…
▽ More
Kernel Additive Modelling (KAM) is a framework for source separation aiming to explicitly model inherent properties of sound sources to help with their identification and separation. KAM separates a given source by applying robust statistics on the selection of time-frequency bins obtained through a source-specific kernel, typically the k-NN function. Even though the parameter k appears to be key for a successful separation, little discussion on its influence or optimisation can be found in the literature. Here we propose a novel method, based on graph theory statistics, to automatically optimise $k$ in a vocal separation task. We introduce the k-NN hubness as an indicator to find a tailored k at a low computational cost. Subsequently, we evaluate our method in comparison to the common approach to choose k. We further discuss the influence and importance of this parameter with illuminating results.
△ Less
Submitted 6 April, 2018;
originally announced April 2018.
-
A Generative Model for Natural Sounds Based on Latent Force Modelling
Authors:
William J. Wilkinson,
Joshua D. Reiss,
Dan Stowell
Abstract:
Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but existing approaches don't incorporate prior knowledge about the physical behaviour of amplitude envelopes, such as exponential decay and feedback.…
▽ More
Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but existing approaches don't incorporate prior knowledge about the physical behaviour of amplitude envelopes, such as exponential decay and feedback. We use latent force modelling, a probabilistic learning paradigm that incorporates physical knowledge into Gaussian process regression, to model correlation across spectral subband envelopes. We augment the standard latent force model approach by explicitly modelling correlations over multiple time steps. Incorporating this prior knowledge strengthens the interpretation of the latent functions as the source that generated the signal. We examine this interpretation via an experiment which shows that sounds generated by sampling from our probabilistic model are perceived to be more realistic than those generated by similar models based on nonnegative matrix factorisation, even in cases where our model is outperformed from a reconstruction error perspective.
△ Less
Submitted 27 March, 2019; v1 submitted 2 February, 2018;
originally announced February 2018.
-
Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music
Authors:
Pablo A. Alvarado,
Dan Stowell
Abstract:
Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording. Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlap** between harmonics. We study whether the introduction of physically inspired Gaussian process (…
▽ More
Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording. Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlap** between harmonics. We study whether the introduction of physically inspired Gaussian process (GP) priors into audio content analysis models improves the extraction of patterns required for AMT. Audio signals are described as a linear combination of sources. Each source is decomposed into the product of an amplitude-envelope, and a quasi-periodic component process. We introduce the Matérn spectral mixture (MSM) kernel for describing frequency content of singles notes. We consider two different regression approaches. In the sigmoid model every pitch-activation is independently non-linear transformed. In the softmax model several activation GPs are jointly non-linearly transformed. This introduce cross-correlation between activations. We use variational Bayes for approximate inference. We empirically evaluate how these models work in practice transcribing polyphonic music. We demonstrate that rather than encourage dependency between activations, what is relevant for improving pitch detection is to learnt priors that fit the frequency content of the sound events to detect.
△ Less
Submitted 16 November, 2018; v1 submitted 19 May, 2017;
originally announced May 2017.
-
On-bird Sound Recordings: Automatic Acoustic Recognition of Activities and Contexts
Authors:
Dan Stowell,
Emmanouil Benetos,
Lisa F. Gill
Abstract:
We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal beh…
▽ More
We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal behaviour, and have the potential to be used for investigating specific activities (movement) and context (background) within which vocalisations occur. To facilitate this approach, we investigate the automatic annotation of such recordings through two different sound scene analysis paradigms: a scene-classification method using feature learning, and an event-detection method using probabilistic latent component analysis (PLCA). We analyse recordings made with Eurasian jackdaws (Corvus monedula) in both captive and field settings. Results are comparable with the state of the art in sound scene analysis; we find that the current recognition quality level enables scalable automatic annotation of audio logger data, given partial annotation, but also find that individual differences between animals and/or their backpacks limit the generalisation from one individual to another. we consider the interrelation of 'scenes' and 'events' in this particular task, and issues of temporal resolution.
△ Less
Submitted 16 December, 2016;
originally announced December 2016.
-
Bird detection in audio: a survey and a challenge
Authors:
Dan Stowell,
Mike Wood,
Yannis Stylianou,
Hervé Glotin
Abstract:
Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challen…
▽ More
Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challenge to address this need, to make possible the development of fully automatic algorithms for bird sound detection.
△ Less
Submitted 11 August, 2016;
originally announced August 2016.
-
Gaussian Processes for Music Audio Modelling and Content Analysis
Authors:
Pablo A. Alvarado,
Dan Stowell
Abstract:
Real music signals are highly variable, yet they have strong statistical structure. Prior information about the underlying physical mechanisms by which sounds are generated and rules by which complex sound structure is constructed (notes, chords, a complete musical score), can be naturally unified using Bayesian modelling techniques. Typically algorithms for Automatic Music Transcription independe…
▽ More
Real music signals are highly variable, yet they have strong statistical structure. Prior information about the underlying physical mechanisms by which sounds are generated and rules by which complex sound structure is constructed (notes, chords, a complete musical score), can be naturally unified using Bayesian modelling techniques. Typically algorithms for Automatic Music Transcription independently carry out individual tasks such as multiple-F0 detection and beat tracking. The challenge remains to perform joint estimation of all parameters. We present a Bayesian approach for modelling music audio, and content analysis. The proposed methodology based on Gaussian processes seeks joint estimation of multiple music concepts by incorporating into the kernel prior information about non-stationary behaviour, dynamics, and rich spectral content present in the modelled music signal. We illustrate the benefits of this approach via two tasks: pitch estimation, and inferring missing segments in a polyphonic audio recording.
△ Less
Submitted 10 June, 2016; v1 submitted 3 June, 2016;
originally announced June 2016.
-
Individual identity in songbirds: signal representations and metric learning for locating the information in complex corvid calls
Authors:
Dan Stowell,
Veronica Morfi,
Lisa F. Gill
Abstract:
Bird calls range from simple tones to rich dynamic multi-harmonic structures. The more complex calls are very poorly understood at present, such as those of the scientifically important corvid family (jackdaws, crows, ravens, etc.). Individual birds can recognise familiar individuals from calls, but where in the signal is this identity encoded? We studied the question by applying a combination of…
▽ More
Bird calls range from simple tones to rich dynamic multi-harmonic structures. The more complex calls are very poorly understood at present, such as those of the scientifically important corvid family (jackdaws, crows, ravens, etc.). Individual birds can recognise familiar individuals from calls, but where in the signal is this identity encoded? We studied the question by applying a combination of feature representations to a dataset of jackdaw calls, including linear predictive coding (LPC) and adaptive discrete Fourier transform (aDFT). We demonstrate through a classification paradigm that we can strongly outperform a standard spectrogram representation for identifying individuals, and we apply metric learning to determine which time-frequency regions contribute most strongly to robust individual identification. Computational methods can help to direct our search for understanding of these complex biological signals.
△ Less
Submitted 26 April, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Deductive Refinement of Species Labelling in Weakly Labelled Birdsong Recordings
Authors:
Veronica Morfi,
Dan Stowell
Abstract:
Many approaches have been used in bird species classification from their sound in order to provide labels for the whole of a recording. However, a more precise classification of each bird vocalization would be of great importance to the use and management of sound archives and bird monitoring. In this work, we introduce a technique that using a two step process can first automatically detect all b…
▽ More
Many approaches have been used in bird species classification from their sound in order to provide labels for the whole of a recording. However, a more precise classification of each bird vocalization would be of great importance to the use and management of sound archives and bird monitoring. In this work, we introduce a technique that using a two step process can first automatically detect all bird vocalizations and then, with the use of 'weakly' labelled recordings, classify them. Evaluations of our proposed method show that it achieves a correct classification of 61% when used in a synthetic dataset, and up to 89% when the synthetic dataset only consists of vocalizations larger than 1000 pixels.
△ Less
Submitted 23 March, 2016;
originally announced March 2016.
-
Detailed temporal structure of communication networks in groups of songbirds
Authors:
Dan Stowell,
Lisa Gill,
David Clayton
Abstract:
Animals in groups often exchange calls, in patterns whose temporal structure may be influenced by contextual factors such as physical location and the social network structure of the group. We introduce a model-based analysis for temporal patterns of animal call timing, originally developed for networks of firing neurons. This has advantages over cross-correlation analysis in that it can correctly…
▽ More
Animals in groups often exchange calls, in patterns whose temporal structure may be influenced by contextual factors such as physical location and the social network structure of the group. We introduce a model-based analysis for temporal patterns of animal call timing, originally developed for networks of firing neurons. This has advantages over cross-correlation analysis in that it can correctly handle common-cause confounds and provides a generative model of call patterns with explicit parameters for the influences between individuals. It also has advantages over standard Markovian analysis in that it incorporates detailed temporal interactions which affect timing as well as sequencing of calls. Further, a fitted model can be used to generate novel synthetic call sequences. We apply the method to calls recorded from groups of domesticated zebra finch (Taenopyggia guttata) individuals. We find that the communication network in these groups has stable structure that persists from one day to the next, and that "kernels" reflecting the temporal range of influence have a characteristic structure for a calling individual's effect on itself, its partner, and on others in the group. We further find characteristic patterns of influences by call type as well as by individual.
△ Less
Submitted 20 January, 2016;
originally announced January 2016.
-
Denoising without access to clean data using a partitioned autoencoder
Authors:
Dan Stowell,
Richard E. Turner
Abstract:
Training a denoising autoencoder neural network requires access to truly clean data, a requirement which is often impractical. To remedy this, we introduce a method to train an autoencoder using only noisy data, having examples with and without the signal class of interest. The autoencoder learns a partitioned representation of signal and noise, learning to reconstruct each separately. We illustra…
▽ More
Training a denoising autoencoder neural network requires access to truly clean data, a requirement which is often impractical. To remedy this, we introduce a method to train an autoencoder using only noisy data, having examples with and without the signal class of interest. The autoencoder learns a partitioned representation of signal and noise, learning to reconstruct each separately. We illustrate the method by denoising birdsong audio (available abundantly in uncontrolled noisy datasets) using a convolutional autoencoder.
△ Less
Submitted 22 September, 2015; v1 submitted 20 September, 2015;
originally announced September 2015.
-
Acoustic event detection for multiple overlap** similar sources
Authors:
Dan Stowell,
David Clayton
Abstract:
Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, duratio…
▽ More
Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns. We evaluate the method in a case study with over 3000 zebra finch calls. In comparison against a HMM-based method we find it more accurate at recovering acoustic events, and more robust for estimating calling rates.
△ Less
Submitted 9 July, 2015; v1 submitted 24 March, 2015;
originally announced March 2015.
-
Acoustic Scene Classification
Authors:
Daniele Barchiesi,
Dimitrios Giannoulis,
Dan Stowell,
Mark D. Plumbley
Abstract:
In this article we present an account of the state-of-the-art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different imple- mentations of its components. We then describe a range of different algorithms submitted for a…
▽ More
In this article we present an account of the state-of-the-art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce. Starting from a historical review of previous research in this area, we define a general framework for ASC and present different imple- mentations of its components. We then describe a range of different algorithms submitted for a data challenge that was held to provide a general and fair benchmark for ASC techniques. The dataset recorded for this purpose is presented, along with the performance metrics that are used to evaluate the algorithms and statistical significance tests to compare the submitted methods. We use a baseline method that employs MFCCS, GMMS and a maximum likelihood criterion as a benchmark, and only find sufficient evidence to conclude that three algorithms significantly outperform it. We also evaluate the human classification accuracy in performing a similar classification task. The best performing algorithm achieves a mean accuracy that matches the median accuracy obtained by humans, and common pairs of classes are misclassified by both computers and humans. However, all acoustic scenes are correctly classified by at least some individuals, while there are scenes that are misclassified by all algorithms.
△ Less
Submitted 13 November, 2014;
originally announced November 2014.
-
Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning
Authors:
Dan Stowell,
Mark D. Plumbley
Abstract:
Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequ…
▽ More
Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, with a random forest classifier. We demonstrate that MFCCs are of limited power in this context, leading to worse performance than the raw Mel spectral data. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain.
△ Less
Submitted 26 May, 2014;
originally announced May 2014.
-
Large-scale analysis of frequency modulation in birdsong databases
Authors:
Dan Stowell,
Mark D. Plumbley
Abstract:
Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linea…
▽ More
Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence it is important to consider high resolution signal processing techniques for analysis of FM in bird vocalisations. If such methods can be applied at big data scales, this offers a further advantage as large datasets become available.
We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short timescales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. In order to find tools useful in practical analysis of large databases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression.
We find three methods which can robustly represent species-correlated FM attributes, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information.
Large-scale FM analysis can efficiently extract information useful for bioacoustic studies, in addition to measures more commonly used to characterise vocalisations.
△ Less
Submitted 19 November, 2013;
originally announced November 2013.
-
An open dataset for research on audio field recording archives: freefield1010
Authors:
Dan Stowell,
Mark D. Plumbley
Abstract:
We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptiv…
▽ More
We introduce a free and open dataset of 7690 audio clips sampled from the field-recording tag in the Freesound audio archive. The dataset is designed for use in research related to data mining in audio archives of field recordings / soundscapes. Audio is standardised, and audio and metadata are Creative Commons licensed. We describe the data preparation process, characterise the dataset descriptively, and illustrate its use through an auto-tagging experiment.
△ Less
Submitted 1 October, 2013; v1 submitted 20 September, 2013;
originally announced September 2013.
-
Improved multiple birdsong tracking with distribution derivative method and Markov renewal process clustering
Authors:
Dan Stowell,
Sašo Muševič,
Jordi Bonada,
Mark D. Plumbley
Abstract:
Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segre…
▽ More
Segregating an audio mixture containing multiple simultaneous bird sounds is a challenging task. However, birdsong often contains rapid pitch modulations, and these modulations carry information which may be of use in automatic recognition. In this paper we demonstrate that an improved spectrogram representation, based on the distribution derivative method, leads to improved performance of a segregation algorithm which uses a Markov renewal process model to track vocalisation patterns consisting of singing and silences.
△ Less
Submitted 15 February, 2013; v1 submitted 14 February, 2013;
originally announced February 2013.
-
Maximum a posteriori estimation of piecewise arcs in tempo time-series
Authors:
Dan Stowell,
Elaine Chew
Abstract:
In musical performances with expressive tempo modulation, the tempo variation can be modelled as a sequence of tempo arcs. Previous authors have used this idea to estimate series of piecewise arc segments from data. In this paper we describe a probabilistic model for a time-series process of this nature, and use this to perform inference of single- and multi-level arc processes from data. We descr…
▽ More
In musical performances with expressive tempo modulation, the tempo variation can be modelled as a sequence of tempo arcs. Previous authors have used this idea to estimate series of piecewise arc segments from data. In this paper we describe a probabilistic model for a time-series process of this nature, and use this to perform inference of single- and multi-level arc processes from data. We describe an efficient Viterbi-like process for MAP inference of arcs. Our approach is score-agnostic, and together with efficient inference allows for online analysis of performances including improvisations, and can predict immediate future tempo trajectories.
△ Less
Submitted 1 February, 2013;
originally announced February 2013.
-
Segregating event streams and noise with a Markov renewal process model
Authors:
Dan Stowell,
Mark D. Plumbley
Abstract:
We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture…
▽ More
We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via a synthetic experiment as well as an experiment to track a mixture of singing birds.
△ Less
Submitted 13 November, 2012;
originally announced November 2012.