Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

Shi, Xuan; Cooper, Erica; Yamagishi, Junichi

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2107.11506 (eess)

[Submitted on 24 Jul 2021 (v1), last revised 24 Dec 2021 (this version, v2)]

Title:Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

Authors:Xuan Shi, Erica Cooper, Junichi Yamagishi

View PDF

Abstract:Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these can be repurposed for the task of learning and evaluating a musical instrument sound embedding space that can support unseen instruments. Borrowing from state-of-the-art ASV techniques, we construct a musical instrument recognition model that uses a SincNet front-end, a ResNet architecture, and an angular softmax objective function. Experiments on the NSynth and RWC datasets show our model's effectiveness in terms of equal error rate (EER) for unseen instruments, and ablation studies show the importance of data augmentation and the angular softmax objective. Experiments also show the benefit of using a CQT-based filterbank for initializing SincNet over a Mel filterbank initialization. Further complementary analysis of the learned embedding space is conducted with t-SNE visualizations and probing classification tasks, which show that including instrument family labels as a multi-task learning target can help to regularize the embedding space and incorporate useful structure, and that meaningful information such as playing style, which was not included during training, is contained in the embeddings of unseen instruments.

Comments:	Accepted by the IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2107.11506 [eess.AS]
	(or arXiv:2107.11506v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2107.11506

Submission history

From: Xuan Shi [view email]
[v1] Sat, 24 Jul 2021 01:41:45 UTC (1,796 KB)
[v2] Fri, 24 Dec 2021 05:40:04 UTC (1,785 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators