Search | arXiv e-print repository

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Authors: Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi

Abstract: In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoust… ▽ More In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks. △ Less

Submitted 13 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: Camera Ready version as submitted to ICASSP SASB Workshop 2024. 5 pages, 2 figures, 3 tables

arXiv:2305.17191 [pdf, ps, other]

MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations

Authors: Calum Heggan, Tim Hospedales, Sam Budgett, Mehrdad Yaghoobi

Abstract: Contrastive self-supervised learning has gained attention for its ability to create high-quality representations from large unlabelled data sets. A key reason that these powerful features enable data-efficient learning of downstream tasks is that they provide augmentation invariance, which is often a useful inductive bias. However, the amount and type of invariances preferred is not known apriori,… ▽ More Contrastive self-supervised learning has gained attention for its ability to create high-quality representations from large unlabelled data sets. A key reason that these powerful features enable data-efficient learning of downstream tasks is that they provide augmentation invariance, which is often a useful inductive bias. However, the amount and type of invariances preferred is not known apriori, and varies across different downstream tasks. We therefore propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner. Our multi-task representation provides a strong and flexible feature that benefits diverse downstream tasks. We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance on all of them △ Less

Submitted 26 January, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: Last author version accepted to InterSpeech23. 5 pages

arXiv:2210.01725 [pdf, other]

MEDFAIR: Benchmarking Fairness for Medical Imaging

Authors: Yongshuo Zong, Yongxin Yang, Timothy Hospedales

Abstract: A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria t… ▽ More A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria to assess fairness. Second, existing bias mitigation algorithms are developed under different settings, e.g., datasets, model selection strategies, backbones, and fairness metrics, making a direct comparison and evaluation based on existing results impossible. In this work, we introduce MEDFAIR, a framework to benchmark the fairness of machine learning models for medical imaging. MEDFAIR covers eleven algorithms from various categories, nine datasets from different imaging modalities, and three model selection criteria. Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings. We evaluate fairness from various perspectives and make recommendations for different medical application scenarios that require different ethical principles. Our framework provides a reproducible and easy-to-use entry point for the development and evaluation of future bias mitigation algorithms in deep learning. Code is available at https://github.com/ys-zong/MEDFAIR. △ Less

Submitted 17 February, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted to ICLR 2023

arXiv:2204.02121 [pdf, other]

MetaAudio: A Few-Shot Audio Classification Benchmark

Authors: Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi

Abstract: Currently available benchmarks for few-shot learning (machine learning with few training examples) are limited in the domains they cover, primarily focusing on image classification. This work aims to alleviate this reliance on image-based benchmarks by offering the first comprehensive, public and fully reproducible audio based alternative, covering a variety of sound domains and experimental setti… ▽ More Currently available benchmarks for few-shot learning (machine learning with few training examples) are limited in the domains they cover, primarily focusing on image classification. This work aims to alleviate this reliance on image-based benchmarks by offering the first comprehensive, public and fully reproducible audio based alternative, covering a variety of sound domains and experimental settings. We compare the few-shot classification performance of a variety of techniques on seven audio datasets (spanning environmental sounds to human-speech). Extending this, we carry out in-depth analyses of joint training (where all datasets are used during training) and cross-dataset adaptation protocols, establishing the possibility of a generalised audio few-shot classification algorithm. Our experimentation shows gradient-based meta-learning methods such as MAML and Meta-Curvature consistently outperform both metric and baseline methods. We also demonstrate that the joint training routine helps overall generalisation for the environmental sound databases included, as well as being a somewhat-effective method of tackling the cross-dataset/domain setting. △ Less

Submitted 10 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: 9 pages with 1 figure and 2 main results tables. V1 Preprint

arXiv:2007.02190 [pdf, other]

BézierSketch: A generative model for scalable vector sketches

Authors: Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song

Abstract: The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we… ▽ More The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark. △ Less

Submitted 14 July, 2020; v1 submitted 4 July, 2020; originally announced July 2020.

Comments: Accepted as poster at ECCV 2020

arXiv:2003.01063 [pdf, other]

Unlimited Resolution Image Generation with R2D2-GANs

Authors: Marija Jegorova, Antti Ilari Karjalainen, Jose Vazquez, Timothy M. Hospedales

Abstract: In this paper we present a novel simulation technique for generating high quality images of any predefined resolution. This method can be used to synthesize sonar scans of size equivalent to those collected during a full-length mission, with across track resolutions of any chosen magnitude. In essence, our model extends Generative Adversarial Networks (GANs) based architecture into a conditional r… ▽ More In this paper we present a novel simulation technique for generating high quality images of any predefined resolution. This method can be used to synthesize sonar scans of size equivalent to those collected during a full-length mission, with across track resolutions of any chosen magnitude. In essence, our model extends Generative Adversarial Networks (GANs) based architecture into a conditional recursive setting, that facilitates the continuity of the generated images. The data produced is continuous, realistically-looking, and can also be generated at least two times faster than the real speed of acquisition for the sonars with higher resolutions, such as EdgeTech. The seabed topography can be fully controlled by the user. The visual assessment tests demonstrate that humans cannot distinguish the simulated images from real. Moreover, experimental results suggest that in the absence of real data the autonomous recognition systems can benefit greatly from training with the synthetic data, produced by the R2D2-GANs. △ Less

Submitted 2 March, 2020; originally announced March 2020.

Comments: Accepted to 2020 IEEE OCEANS (Singapore)

arXiv:1910.06750 [pdf, other]

Full-Scale Continuous Synthetic Sonar Data Generation with Markov Conditional Generative Adversarial Networks

Authors: Marija Jegorova, Antti Ilari Karjalainen, Jose Vazquez, Timothy Hospedales

Abstract: Deployment and operation of autonomous underwater vehicles is expensive and time-consuming. High-quality realistic sonar data simulation could be of benefit to multiple applications, including training of human operators for post-mission analysis, as well as tuning and validation of autonomous target recognition (ATR) systems for underwater vehicles. Producing realistic synthetic sonar imagery is… ▽ More Deployment and operation of autonomous underwater vehicles is expensive and time-consuming. High-quality realistic sonar data simulation could be of benefit to multiple applications, including training of human operators for post-mission analysis, as well as tuning and validation of autonomous target recognition (ATR) systems for underwater vehicles. Producing realistic synthetic sonar imagery is a challenging problem as the model has to account for specific artefacts of real acoustic sensors, vehicle altitude, and a variety of environmental factors. We propose a novel method for generating realistic-looking sonar side-scans of full-length missions, called Markov Conditional pix2pix (MC-pix2pix). Quantitative assessment results confirm that the quality of the produced data is almost indistinguishable from real. Furthermore, we show that bootstrap** ATR systems with MC-pix2pix data can improve the performance. Synthetic data is generated 18 times faster than real acquisition speed, with full user control over the topography of the generated data. △ Less

Submitted 18 February, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: 6 pages, 6 figures. Accepted to ICRA2020. 2020 IEEE International Conference on Robotics and Automation

arXiv:1906.06196 [pdf, other]

Factorized Higher-Order CNNs with an Application to Spatio-Temporal Emotion Estimation

Authors: Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy Hospedales, Maja Pantic

Abstract: Training deep neural networks with spatio-temporal (i.e., 3D) or multidimensional convolutions of higher-order is computationally challenging due to millions of unknown parameters across dozens of layers. To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters. Alternatively, new convolut… ▽ More Training deep neural networks with spatio-temporal (i.e., 3D) or multidimensional convolutions of higher-order is computationally challenging due to millions of unknown parameters across dozens of layers. To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters. Alternatively, new convolutional blocks, such as MobileNet, can be directly designed for efficiency. In this paper, we unify these two approaches by proposing a tensor factorization framework for efficient multidimensional (separable) convolutions of higher-order. Interestingly, the proposed framework enables a novel higher-order transduction, allowing to train a network on a given domain (e.g., 2D images or N-dimensional data in general) and using transduction to generalize to higher-order data such as videos (or (N+K)-dimensional data in general), capturing for instance temporal dynamics while preserving the learnt spatial information. We apply the proposed methodology, coined CP-Higher-Order Convolution (HO-CPConv), to spatio-temporal facial emotion analysis. Most existing facial affect models focus on static imagery and discard all temporal information. This is due to the above-mentioned burden of training 3D convolutional nets and the lack of large bodies of video data annotated by experts. We address both issues with our proposed framework. Initial training is first done on static imagery before using transduction to generalize to the temporal domain. We demonstrate superior performance on three challenging large scale affect estimation datasets, AffectNet, SEWA, and AFEW-VA. △ Less

Submitted 31 March, 2020; v1 submitted 14 June, 2019; originally announced June 2019.

Comments: IEEE CVPR 2020

Showing 1–8 of 8 results for author: Hospedales, T