MAST: Multiscale Audio Spectrogram Transformers

Ghosh, Sreyan; Seth, Ashish; Umesh, S.; Manocha, Dinesh

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2211.01515 (eess)

[Submitted on 2 Nov 2022 (v1), last revised 18 May 2023 (this version, v2)]

Title:MAST: Multiscale Audio Spectrogram Transformers

Authors:Sreyan Ghosh, Ashish Seth, S. Umesh, Dinesh Manocha

View PDF

Abstract:We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. We use a pyramid structure that allows early layers of MAST operating at a high temporal resolution but low embedding space to model simple low-level acoustic information and deeper temporally coarse layers to model high-level acoustic information with high-dimensional embeddings. We also extend our approach to present a new Self-Supervised Learning (SSL) method called SS-MAST, which calculates a symmetric contrastive loss between latent representations from a student and a teacher encoder, leveraging patch-drop, a novel audio augmentation approach that we introduce. In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark, achieving state-of-the-art results on keyword spotting in Speech Commands. Additionally, our proposed SS-MAST achieves an absolute average improvement of 2.6% over the previously proposed SSAST.

Comments:	ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2211.01515 [eess.AS]
	(or arXiv:2211.01515v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2211.01515

Submission history

From: Sreyan Ghosh [view email]
[v1] Wed, 2 Nov 2022 23:34:12 UTC (20,954 KB)
[v2] Thu, 18 May 2023 01:35:55 UTC (38,928 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MAST: Multiscale Audio Spectrogram Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MAST: Multiscale Audio Spectrogram Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators