\copyrightclause

Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

\conference

IAL@ECML-PKDD’24: 8th Intl. Worksh. on Interactive Adaptive Learning

[] [] [] [] []

Towards Deep Active Learning in Avian Bioacoustics

Lukas Rauch IES, University of Kassel, Kassel, Germany IEE, Fraunhofer Insitute, Kassel, Germany    Denis Huseljic    Moritz Wirth    Jens Decke    Bernhard Sick    Christoph Scholz
(2024)
Abstract

Passive acoustic monitoring (PAM) in avian bioacoustics enables cost-effective and extensive data collection with minimal disruption to natural habitats. Despite advancements in computational avian bioacoustics, deep learning models continue to encounter challenges in adapting to diverse environments in practical PAM scenarios. This is primarily due to the scarcity of annotations, which requires labor-intensive efforts from human experts. Active learning (AL) reduces annotation cost and speed ups adaption to diverse scenarios by querying the most informative instances for labeling. This paper outlines a deep AL approach, introduces key challenges, and conducts a small-scale pilot study.

keywords:
Deep Active Learning \sepAvian Bioacoustics \sepPassive Acoustic Monitoring

1 Introduction

Avian diversity is a key indicator of environmental health. Passive acoustic monitoring (PAM) in avian bioacoustics leverages mobile autonomous recording units (ARUs) to gather large volumes of soundscape recordings with minimal disruption to avian habitats. While this method is cost-effective and minimally invasive, the analysis of these recordings is labor-intensive and requires expert annotation. Recent advancements in deep learning (DL) primarily process these passive recordings by classifying bird vocalizations. Particularly, feature embeddings from large bird sound classification models (e.g., Google’s Perch [1] or BirdNET [2]) have effectively enabled few-shot learning in scenarios with limited training data [3]. These state-of-the-art (SOTA) models are trained using supervised learning on nearly 10,000 bird species from multi-class focal recordings that isolate individual bird sounds. However, practical PAM scenarios involve processing diverse multi-label soundscapes with overlap** sounds and varying background noise. Proper feature embeddings for edge deployment necessitate fine-tuning, which relies on labeled training data that is both time-consuming and costly to obtain for soundscapes.

Deep active learning (AL) addresses this challenge by actively querying the most informative instances to maximize performance gains [4]. However, research on deep AL in avian bioacoustics is still limited, and the problem needs to be contextualized with comparable datasets [5]. Additionally, the domain presents unique practical challenges, including adapting models from focals to soundscapes (i.e., multi-class to multi-label) in imbalanced and highly diverse scenarios [6]. Consequently, we introduce the problem of deep AL in avian bioacoustics and propose an efficient fine-tuning approach for model deployment. Our contributions are:

Contributions 1. We introduce deep active learning (AL) to avian bioacoustics, highlighting challenges and proposing a practical framework. 2. We conduct an initial feasibility study based on the dataset collection Birdset [6], showcasing the benefits of deep AL. Additionally, we release the dataset and code.

2 Related Work

DL has enhanced bird species recognition from vocalizations in the context of biodiversity monitoring. Current SOTA approaches BirdNET [2], Google’s Perch [7, 1], and BirdSet [6] have set benchmarks in bird sound classification. While initial studies focused on model performance on focal recordings, research is increasingly shifting towards practical PAM scenarios [6]. In such environments, ARUs are proving effective for edge deployment for continuous soundscape analysis [8]. Research indicates that pre-trained models facilitate few-shot and transfer learning in data-scarce environments by providing valuable feature embeddings for rapid prototy** and efficient inference [3]. While deep AL is suited for quick model adaptation, its application in avian bioacoustics is still emerging. Bellafkir et al. [9] have integrated AL into edge-based systems for bird species identification, employing reliability scores and ensemble predictions to refine misclassifications through human feedback. This approach highlights the necessity for research into the application of deep AL and multi-label classification in avian bioacoustics. However, comparing these results is challenging because they utilize test datasets that are not publicly available and employ custom AL strategies [9].

3 Active Learning in Bird Sound Classification

Challenges and Motivation. In PAM, a feature vector 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X represents a D𝐷Ditalic_D-dimensional instance, originating from either a focal recording where 𝒳=𝒳\mathcal{X}=\mathcal{F}caligraphic_X = caligraphic_F, or a soundscape recording with 𝒳=𝒮𝒳𝒮\mathcal{X}=\mathcal{S}caligraphic_X = caligraphic_S. Focal recordings are extensively available on the citizen-science platform Xeno-Canto (XC) [10] with a global collection of over 800,000 recordings, making them particularly suitable for model training. Large-scale bird sound classification models (e.g., BirdNET[2]) are primarily trained on focals. These multi-class recordings feature isolated bird vocalizations where each instance 𝐱𝐱\mathbf{x}bold_x is associated with a class label y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, where 𝒴={1,,C}𝒴1𝐶\mathcal{Y}=\{1,...,C\}caligraphic_Y = { 1 , … , italic_C }. The focal data distribution is denoted as pFocal(𝐱,y)subscript𝑝Focal𝐱𝑦p_{\texttt{Focal}}(\mathbf{x},y)italic_p start_POSTSUBSCRIPT Focal end_POSTSUBSCRIPT ( bold_x , italic_y ). However, annotations from XC often come with weak labels, lacking precise vocalization timestamps. As noted by Van Merriënboer et al. [11]. As noted by Van Merriënboer et al. [11], evaluating on focals does not adequately reflect a model’s generalization performance in real-world PAM scenarios, rendering them unsuitable for assessing deployment capabilities. Soundscape recordings are passively recorded in specific regions, capturing the entire acoustic environment for PAM projects using static ARUs over extended periods. For instance, the High Sierra Nevada (HSN) [2] dataset includes long-duration soundscapes with precise labels and timestamps from multiple sites. These recordings are treated as multi-label tasks and are valuable for assessing model deployment in real-world PAM. Each instance 𝐱𝐱\mathbf{x}bold_x is associated with multiple class labels y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, represented by a one-hot encoded multi-label vector 𝐲=[y1,,yC][0,1]C𝐲subscript𝑦1subscript𝑦𝐶superscript01𝐶\mathbf{y}=[y_{1},\ldots,y_{C}]\in{[0,1]}^{C}bold_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ] ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. An instance can contain no bird sounds, represented by a zero-vector 𝐲=𝟎C𝐲0superscript𝐶\mathbf{y}=\mathbf{0}\in\mathbb{R}^{C}bold_y = bold_0 ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. Soundscapes’ limited scale and the extensive annotation effort make them less suitable for large-scale model training. Yet, we believe that they are ideal for fine-tuning and adaptation in practical environments. We denote the soundscape data distribution as pScape(𝐱,𝐲)subscript𝑝Scape𝐱𝐲p_{\texttt{Scape}}(\mathbf{x},\mathbf{y})italic_p start_POSTSUBSCRIPT Scape end_POSTSUBSCRIPT ( bold_x , bold_y ). The disparity in data distributions, pScape(𝐱,𝐲)pFocal(𝐱,y)subscript𝑝Scape𝐱𝐲subscript𝑝Focal𝐱𝑦p_{\texttt{Scape}}(\mathbf{x},\mathbf{y})\neq p_{\texttt{Focal}}(\mathbf{x},y)italic_p start_POSTSUBSCRIPT Scape end_POSTSUBSCRIPT ( bold_x , bold_y ) ≠ italic_p start_POSTSUBSCRIPT Focal end_POSTSUBSCRIPT ( bold_x , italic_y ), leads to a distribution shift that impacts the performance of SOTA bioacoustic models trained on focals when deployed in PAM, where only a few labeled soundscapes are available for training. Therefore, we propose using deep AL to efficiently adapt models to PAM scenarios.

Our approach. Our approach is detailed in Figure 1. We leverage the BirdSet dataset collection [6] to ensure comparability.

Refer to caption
Figure 1: Proposed deep AL cycle in avian bioacoustics with exemplary tasks from BirdSet[6].

We consider a multi-label classification problem, where we equip a model with a pre-trained feature extractor 𝐡𝝎:𝒳D:subscript𝐡𝝎𝒳superscript𝐷\mathbf{h}_{\boldsymbol{\omega}}:\mathcal{X}\to\mathbb{R}^{D}bold_h start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT with parameters 𝝎𝝎\boldsymbol{\omega}bold_italic_ω that maps the inputs 𝐱𝐱\mathbf{x}bold_x to feature embeddings 𝐡𝝎(𝐱)subscript𝐡𝝎𝐱\mathbf{h}_{\boldsymbol{\omega}}(\mathbf{x})bold_h start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( bold_x ). Additionally, we utilize a classification head 𝐟𝜽t:DC:subscript𝐟subscript𝜽𝑡superscript𝐷superscript𝐶\mathbf{f}_{\boldsymbol{\theta}_{t}}:\mathbb{R}^{D}\to\mathbb{R}^{C}bold_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT with parameters 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at cycle iteration t𝑡titalic_t that maps the feature embeddings 𝐡𝝎(𝐱)subscript𝐡𝝎𝐱\mathbf{h}_{\boldsymbol{\omega}}(\mathbf{x})bold_h start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( bold_x ) to class probabilities via the sigmoid function. The resulting class probabilities are denoted by 𝐩^=σ(𝐟𝜽t(𝐡𝝎(𝐱))\hat{\mathbf{p}}=\sigma(\mathbf{f}_{\boldsymbol{\theta}_{t}}(\mathbf{h}_{% \boldsymbol{\omega}}(\mathbf{x}))over^ start_ARG bold_p end_ARG = italic_σ ( bold_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( bold_x ) ), where 𝐩^C^𝐩superscript𝐶\hat{\mathbf{p}}\in\mathbb{R}^{C}over^ start_ARG bold_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT represents the probabilities for each class in a binary classification problem. We introduce a pool-based AL setting with an unlabeled pool 𝒰(t)𝒮𝒰𝑡𝒮{\mathcal{U}(t)\subseteq\mathcal{S}}caligraphic_U ( italic_t ) ⊆ caligraphic_S and a labeled pool data set (t)𝒮×𝒴𝑡𝒮𝒴{\mathcal{L}(t)\subseteq\mathcal{S}\times\mathcal{Y}}caligraphic_L ( italic_t ) ⊆ caligraphic_S × caligraphic_Y. The pool consists of soundscapes from PAM projects, allowing the model to adapt to the unique acoustic features of new sites and improve performance across various scenarios. During each cycle iteration t𝑡titalic_t, the query strategy compiles the most informative instances into a batch (t)𝒰(t)𝑡𝒰𝑡{\mathcal{B}(t)\subset\mathcal{U}(t)}caligraphic_B ( italic_t ) ⊂ caligraphic_U ( italic_t ) of size b𝑏bitalic_b. We represent an annotated batch as (t)𝒮×𝒴superscript𝑡𝒮𝒴\mathcal{B}^{*}(t)\in\mathcal{S}\times\mathcal{Y}caligraphic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) ∈ caligraphic_S × caligraphic_Y. We update the unlabeled pool 𝒰(t+1)=𝒰(t)(t)𝒰𝑡1𝒰𝑡𝑡\mathcal{U}(t{+}1)=\mathcal{U}(t)\setminus\mathcal{B}(t)caligraphic_U ( italic_t + 1 ) = caligraphic_U ( italic_t ) ∖ caligraphic_B ( italic_t ) and the labeled pool (t+1)=(t)(t)𝑡1𝑡superscript𝑡\mathcal{L}(t{+}1)=\mathcal{L}(t)\cup\mathcal{B}^{*}(t)caligraphic_L ( italic_t + 1 ) = caligraphic_L ( italic_t ) ∪ caligraphic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) by adding the annotated batch. At each iteration t𝑡titalic_t, the model 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is retrained using the binary cross entropy loss LBCE(𝐱,𝐲)subscript𝐿𝐵𝐶𝐸𝐱𝐲L_{BCE}(\mathbf{x,y)}italic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( bold_x , bold_y ), resulting in the updated model parameters 𝜽t+1subscript𝜽𝑡1\boldsymbol{\theta}_{t+1}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The process continues until a budget B𝐵Bitalic_B is exhausted.

4 Experiments

Setup. We employ Google’s Perch as the pre-trained feature extractor with a feature dimensionality of D=1280𝐷1280D=1280italic_D = 1280, following Ghani et al. [3]. Each iteration of the AL cycle involves initializing and training the last DNN layer for 200 epochs using the Rectified Adam optimizer [12] (batch size: 128, learning rate: 0.05, weight decay: 0.0001) with a cosine annealing scheduler [13]. The hyperparameters are empirically determined with convergence on random train samples as done in [14]. We utilize the HSN dataset [15] from BirdSet [6], consisting of 5,28052805,2805 , 280 5-second soundscape segments from the initial three days of recordings for our unlabeled pool, and 6,72067206,7206 , 720 segments from the last two days for testing. Initially, 10101010 instances are selected randomly, followed by 50 iterations of b=10𝑏10b{=}10italic_b = 10 acquisitions each, totaling a budget of B=510𝐵510{B}{=}510italic_B = 510. We benchmark against Random acquisitions and use Typiclust [16] and Badge[17] as diversity-based and hybrid strategies, respectively. As an uncertainty-based strategy, we employ the mean Entropy of all binary predictions. The effectiveness of each strategy is assessed by analyzing the learning curves through a collection of threshold-free metrics [6]: T1-Acc., class-based mean average precision (cmAP), and area under the receiver operating characteristic curve (AUROC). The metrics are computed on the test dataset post-training in each cycle, with learning curve improvements averaged over ten repetitions for consistency.

Results. We present the improvement curves for the metric collection in Figure 2. The results demonstrate that no single strategy is universally superior across all metrics. However, nearly all metrics show enhanced performance compared to Random. Notably, Typiclust displays strong performance across all metrics at the start of the deep AL cycle, supporting the findings of [16] that a diverse selection is beneficial at the cycle’s onset. However, its effectiveness diminishes over time when diversity becomes less crucial. Conversely, except for the AUROC metric where Entropy initially performs poorly but strongly improves over time, Entropy outperforms in all iterations for cmAP and T1-Acc, showing a consistent improvement over Random improvement of up to 15%.

Refer to caption
Figure 2: Improvement curves of deep AL selection strategies Badge, Entropy, and Typiclust over Random with the metric collection a) AUROC, b) cmAP and c) T1-Acc. The results are averaged over ten randomly initialized repetitions to ensure consistency.

5 Conclusion

In this work, we demonstrated the potential of deep active learning (AL) in computational avian bioacoustics. We showed how deep AL can be integrated into real-world passive acoustic monitoring by utilizing BirdSet, where a rapid model adaption through fine-tuning on soundscape recordings is advantageous for the identification of bird species. Our results indicate that employing selection strategies in deep AL enhances model performance and accelerates adaptation compared to random sampling. For future work, we aim to expand the implementation of deep AL in avian bioacoustics utilizing all datasets from the BirdSet dataset collection to provide more robust performance insights and explore more advanced query strategies [13, 18].

References

  • Hamer et al. [2023] J. Hamer, E. Triantafillou, B. Van Merriënboer, S. Kahl, H. Klinck, T. Denton, V. Dumoulin, BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics, CoRR (2023). URL: https://doi.org/10.48550/arXiv.2312.07439.
  • Kahl et al. [2021] S. Kahl, C. M. Wood, M. Eibl, H. Klinck, BirdNET: A deep learning solution for avian diversity monitoring, Ecological Informatics 61 (2021) 101236. URL: https://doi.org/10.1016/j.ecoinf.2021.101236.
  • Ghani et al. [2023] B. Ghani, T. Denton, S. Kahl, H. Klinck, Feature Embeddings from Large-Scale Acoustic Bird Classifiers Enable Few-Shot Transfer Learning, CoRR (2023). doi:https://10.48550/arXiv.2307.06292.
  • Decke et al. [2023] J. Decke, C. Gruhl, L. Rauch, B. Sick, DADO – Low-cost query strategies for deep active design optimization, in: 2023 International Conference on Machine Learning and Applications (ICMLA), IEEE, 2023, pp. 1611–1618.
  • Rauch et al. [2023] L. Rauch, R. Schwinger, M. Wirth, B. Sick, S. Tomforde, C. Scholz, Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers, CoRR (2023). URL: https://doi.org/10.48550/arXiv.2308.07121.
  • Rauch et al. [2024] L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, J. Lange, S. Kahl, B. Sick, S. Tomforde, C. Scholz, Birdset: A dataset and benchmark for classification in avian bioacoustics, CoRR (2024). doi:https://10.48550/arXiv.2403.10380.
  • Denton et al. [2022] T. Denton, S. Wisdom, J. R. Hershey, Improving Bird Classification with Unsupervised Sound Separation, in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 636–640. URL: https://doi.org/10.1109/ICASSP43922.2022.9747202.
  • Höchst et al. [2022] J. Höchst, H. Bellafkir, P. Lampe, M. Vogelbacher, M. Mühling, D. Schneider, K. Lindner, S. Rösner, D. G. Schabo, N. Farwig, B. Freisleben, Bird@Edge: Bird Species Recognition at the Edge, in: Networked Systems, volume 13464, Cham, 2022, pp. 69–86. URL: https://doi.org/10.1007/978-3-031-17436-0_6.
  • Bellafkir et al. [2023] H. Bellafkir, M. Vogelbacher, D. Schneider, M. Mühling, N. Korfhage, B. Freisleben, Edge-Based Bird Species Recognition via Active Learning, in: Networked Systems, volume 14067, Springer Nature Switzerland, Cham, 2023, pp. 17–34. doi:10.1007/978-3-031-37765-5_2.
  • Vellinga and Planqué [2015] W. Vellinga, R. Planqué, The xeno-canto collection and its relation to sound recognition and classification, CEUR-WS.org, 2015. URL: https://xeno-canto.org/.
  • Van Merriënboer et al. [2024] B. Van Merriënboer, J. Hamer, V. Dumoulin, E. Triantafillou, T. Denton, Birds, Bats and beyond: Evaluating generalization in bioacoustic models, CoRR (2024).
  • Liu et al. [2019] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning rate and beyond, in: International Conference on Learning Representations, 2019.
  • Huseljic et al. [2024] D. Huseljic, P. Hahn, M. Herde, L. Rauch, B. Sick, Fast fishing: Approximating bait for efficient and scalable deep active image classification, CoRR (2024). doi:https://10.48550/arXiv.2404.08981.
  • Huseljic et al. [2023] D. Huseljic, M. Herde, P. Hahn, B. Sick, Role of hyperparameters in deep active learning, in: Workshop on Interactive Adaptive Learning @ ECML PKDD, 2023, pp. 19–24.
  • Kahl et al. [2022] S. Kahl, C. M. Wood, P. Chaon, M. Z. Peery, H. Klinck, A collection of fully-annotated soundscape recordings from the western united states, 2022. URL: https://doi.org/10.5281/zenodo.7050014.
  • Hacohen et al. [2022] G. Hacohen, A. Dekel, D. Weinshall, Active learning on a budget: Opposite strategies suit high and low budgets, in: International Conference on Machine Learning, 2022.
  • Ash et al. [2020] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, A. Agarwal, Deep batch active learning by diverse, uncertain gradient lower bounds, in: International Conference on Learning Representations, 2020.
  • Rauch et al. [2023] L. Rauch, M. Aßenmacher, D. Huseljic, M. Wirth, B. Bischl, B. Sick, Activeglae: A benchmark for deep active learning with transformers, in: Machine Learning and Knowledge Discovery in Databases: Research Track, Springer Nature Switzerland, 2023, p. 55–74. URL: https://doi.org/10.1007/978-3-031-43412-9_4.