A Stem-Agnostic Single-Decoder System
for Music Source Separation Beyond Four Stems

Abstract

Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (vdbo) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on vdbo stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs.

1 Introduction

Music source separation (MSS) is the task of separating a musical audio mixture into its constituent components, commonly referred to as stems. The releases of DSD100 [1] and MUSDB18 [2, 3], both being four-stem MSS datasets, have defined a de-facto standard, with nearly every major work since relying on the four-stem setup [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. While this has significantly improved the comparability and reproducibility of the task, it has also disproportionately favored the four-stem vocals, bass, drum, and others (vdbo) setup. Very few works have tackled MSS beyond the vdbo setup, each relying on datasets with significant limitations: [20] relied on MedleyDB [21, 22], whose stem ontology is somewhat unfriendly to source separation; [23] relied on the synthetically generated Slakh dataset [24]; and [11, 18] relied on proprietary data inaccessible to other research groups, limiting reproducibility. The recently released MoisesDB [25], a multitrack source separation dataset, attempts to address these limitations, particularly in terms of stem availability and taxonomy. This aims at broadening the task beyond vdbo based on publicly available data. However, to the best of our knowledge, while MoisesDB was used in the 2023 Sound Demixing Challenge (SDX) [26], no published system has utilized MoisesDB for source separation beyond vdbo yet.

In this work, we propose Banquet, a query-based source separation model that can separate an arbitrary number of stems using just one set of stem-agnostic encoder and decoder, and a pretrained feature extractor [27]. Our model was adapted from the cinematic audio source separation Bandit model [28], which was itself adapted from the music source separation Bandsplit RNN model [17]. Bandit significantly reduces the complexity of Bandsplit RNN by adopting a common-encoder approach with stem-specific decoders. In this work, we take the complexity reduction further by switching to a query-based setup, using only one decoder shared amongst all possible stems. Performance evaluation on the MoisesDB dataset demonstrated separation performance above oracle for drum and bass, state-of-the-art for guitar and piano, and at least $7.4\text{\,}\mathrm{d}\mathrm{B}$ SNR for vocals. Our system additionally provided support for fine-level stem extraction currently available only in a few contemporary MSS systems.¹¹1Code and other supplementary materials can be found at github.com/kwatcharasupat/query-bandit.

2 Related Work

As mentioned earlier, nearly every major MSS work since 2017 have relied on the four-stem vdbo setup. Early systems [4, 29, 6], including Open-Unmix [8], were usually time-frequency (TF) masking models with LSTM forming the core of the systems, with some experimenting with densely-connected convolutional systems [5, 12]. Beginning with Wave-U-Net [7], the U-Net architecture became a popular choice for MSS, with notable models such as Demucs [9, 10, 14, 18], Spleeter [11], ByteSep [13], and KUIELab-MDX-Net [15] all being some variations of a U-Net. More recently, Bandsplit RNN [17] became one of the few state-of-the-art systems to not rely on a U-Net setup. This was followed by the Bandsplit RoPE Transformer model [19] top** the leaderboard of SDX 2023 [26].

Of these MSS systems, very few offer separation functionality beyond the vdbo setup. At the time of writing, a commercial system by Moises provides four possible 5-stem setups and one possible 6-stem setup. Their 5-stem setups consist of vdbo and either of guitar, piano, wind, or string, but not multiple at once. The 6-stem setup supports the separation of vdbo, acoustic guitar, and electric guitar. Spleeter [11], which is also used in multiple commercial systems, supports 5-stem separation with vdbo and piano. HT-Demucs [18] supports a 6-stem setup, with vdbo, piano, and guitar.

2.1 Conditional source separation

The systems mentioned above were mostly designed with either stem-specific models, stem-specific decoders, or a shared decoder with predetermined outputs. As a result, these systems are not particularly amenable to the addition of new stems, especially if these new stems have limited data availability. Below we review some of the common approaches for conditional source separation that may be useful for extending existing systems beyond vdbo.

Meseguer-Brocal and Peeters [30] were likely amongst the first to attempt a conditioned U-Net for source separation using a single decoder. They used multiple feature-wise linear modulation (FiLM) [31] layers within the encoder to perform MSS in a vdbo setup with one decoder. Slizovskaia et al. [32] used a similar setup with FiLM layers either throughout the encoder, at the bottleneck layer, or at the final decoder layer. The systems in [32] were tested on the 13-instrument URMP dataset [33], with up to 4 active instruments in any recording, but all performed poorly in terms of mean signal-to-distortion ratio (SDR). Lin et al. [34] proposed a joint separation-transcription U-Net system, which performed well for string and brass instruments in URMP, but struggled on woodwind instruments. The system in [34] used FiLM conditioning throughout the encoder with a query embedding from another convolutional model, and across all skip connections with transcription embeddings. Wang et al. [20] proposed a U-Net with FiLM conditioning only at the bottleneck layer. The system in [20] was able to support a substantial number of stems beyond vdbo with the caveat that its reported performance is significantly below contemporary models for vdbo stems. Liu et al. [35, 36] utilized a FiLM-conditioned version of a Res-U-Net[13] to perform language-queried source separation, but its performance was never evaluated on any commonly used MSS dataset. Liu et al. [35, 36] applied FiLM to nearly every single layer of the network, significantly increasing the computational complexity of the system. We surmise that the apparent need for multiple conditioning in a U-Net is probably due to the nature of its information flow [37], which may require a significant number of information streams to be conditioned to achieve acceptable performance.

In a different direction, source separation systems relying on audio embedding “distances” have also been developed, notably with Le Roux et al. in [38, 23, 39]. In 2018, Kumar et al. [40] presented an early work using Euclidean audio embedding distance from a “query” embedding to inform music source separation. A similar system using a Gaussian mixture model posterior in lieu of standard distance was proposed in [38]. Hierarchical masking [23] was later utilized to allow the extraction of stems at multiple levels of specificity. More recently, source separation systems with audio embedding in a low-dimensional hyperbolic space have been developed to allow music [39] and speech [41] source separation with some degrees of control on the specificity of the extraction.

3 Proposed system

The overview of the proposed Banquet system is shown in Fig. 1. The system is a single-encoder single-decoder system that takes in a mixture signal $\mathbf{x}$ and a query signal $\mathbf{q}$ , and extracts stem estimates $\hat{\mathbf{s}}$ from the mixture signal of the “same” stem type as the query signal using a complex-valued TF mask. This is done by (i) encoding the mixture into a subband-level time-varying embedding tensor $\mathbf{\Upsilon}$ , (ii) encoding the query into a single vector representation $\tilde{\mathbf{z}}$ , (iii) adapting the mixture embedding, conditioned on the query, into a stem-specific embedding $\mathbf{\Lambda}=\mathcal{Q}(\mathbf{\Upsilon};\tilde{\mathbf{z}})$ , then (iv) decoding the modulated embedding tensor to a TF mask $\mathbf{M}$ that is used to obtain the source estimate. The source separator in this work is an adaptation of Bandit [28], itself an adaptation of Bandsplit RNN [17]. There is only one decoder shared across all possible stems.

Refer to caption — Figure 1: Overview of the Banquet System.

3.1 Bandit encoder

The encoder module of the system used in this work is the musical variant of the Bandit encoder in [28], with $B=64$ bands. Specifically, given an input mixture $\mathbf{x}\in\mathbb{R}^{C\times N}$ with $C$ channels and $N$ samples, a short-time Fourier transform (STFT) of $\mathbf{x}$ is computed to obtain $\mathbf{X}\in\mathbb{C}^{C\times F\times T}$ with a frame size of $N_{\text{FFT}}=2F=2048$ and $75\text{\,}\mathrm{\char 37\relax}$ overlap. The STFT is then split into overlap** subbands as detailed in [28]. Each of the subbands is then viewed as a real-valued tensor in $\mathbb{R}^{2CF\times T}$ , passed through a layer norm and an affine transformation with $D=128$ neurons to obtain $\mathbf{V}_{b}\in\mathbb{R}^{D\times T}$ . These tensors are then stacked to obtain $\mathbf{V}\in\mathbb{R}^{D\times B\times T}$ . TF modeling is then applied on $\mathbf{V}$ to obtain $\mathbf{\Upsilon}$ using 8 pairs of residual gated recurrent units (GRUs), the first of each pair operating along the time axis and the second along the band axis.

Note that this TF modeling is the only part of the model that is recurrent across either the time or the subband axes. The rest of the encoder and the decoder operate in a subband-wise manner identically for any time frame.

3.2 Query encoding

To obtain the query embedding, a PaSST model [27] trained on the OpenMIC-2018 dataset [42] is used. The 20 instruments in the OpenMIC dataset span all except the other coarse-level classes of MoisesDB. For compatibility with the PaSST model, each query signal is downmixed to mono and downsampled to $32\text{\,}\mathrm{kHz}$ before being fed to PaSST. Although the query feature extractor could, in theory, be jointly trained with the rest of the system, preliminary experiments showed that this can result in considerable instability during training, especially if the query feature extractor is not at least pretrained. Due to the size and complexity of the PaSST model, the query feature extractor is fully frozen in this work. The embedding from the PaSST variant used is a time series with a feature dimension of $\tilde{D}=784$ . The embedding is averaged over time and linearly down-projected to obtain $\tilde{\mathbf{z}}\in\mathbb{R}^{D}$ .

3.3 Query-based adaptation

In the original Bandit system [28], each stem was estimated through a dedicated decoder. As a result, $\mathbf{\Upsilon}$ typically contains information from all stems, with most of the “separation” occurring within each of the decoders. This is evident in the fact that the encoder of a Bandit system trained on the cinematic audio Divide and Remaster (DnR) dataset [43] could be successfully used in a 4-stem MSS on the MUSDB18-HQ dataset [2] with separation quality on par with Open-Unmix [28].

In this work, however, only a single decoder is responsible for mask estimation for any stem. As a result, the query-based adaptation $\mathcal{Q}\colon\mathbb{R}^{D\times B\times T}\times\mathbb{R}^{D}\mapsto% \mathbb{R}^{D\times B\times T}$ has an important role in filtering out irrelevant information from $\mathbf{\Upsilon}$ , or at least “hinting” to the decoder the nature of the stem being queried. A single FiLM layer is used to map from the mixture embedding to the stem-specific embedding, that is,

\mathbf{\Lambda}[d,b,t]=\mbox{\boldmath{$\upgamma$}}[d]\cdot\mathbf{\Upsilon}[% d,b,t]+\mbox{\boldmath{$\upbeta$}}[d],\ \forall d,b,t.

(1)

where modulating variables $\mbox{\boldmath{$\upgamma$}},\mbox{\boldmath{$\beta$}}\in\mathbb{R}^{D}$ are obtained from a two-layer nonlinear affine map of $\mathbf{\tilde{z}}$ . This is similar to the conditioning method used in [20].

Crucially, note that the modulating variables are not subband-specific. Due to the nature of the TF modeling module within the encoder, features of $\mathbf{\Upsilon}$ are already aligned across subbands and time frames. Moreover, BSRNN-like models only contain one stream of information flow, with a clear bottleneck, thus lending itself to the global conditioning mechanism significantly more than, for example, U-Net-style models in [20, 35, 36].

3.4 Bandit decoder

The decoder used is identical in structure to that in [28]. The major difference is that there is only one stem-agnostic decoder. Given a conditioned embedding tensor $\mathbf{\Lambda}$ , the embedding tensor is split into subband-level representation $\mathbf{\Lambda}_{b}=\mathbf{\Lambda}[\colon,b,\colon]$ . Each $\mathbf{\Lambda}_{b}$ is passed through a layer norm and a gated linear unit (GLU) to obtain a real-valued tensor $\mathbb{R}^{2CF_{b}\times T}$ which is then viewed as a complex-valued tensor $\mathbf{M}_{b}\in\mathbb{C}^{C\times F_{b}\times T}$ . Frequency-domain overlap-add is then applied to obtain the full-band mask using

\displaystyle\mathbf{M}[c,f,t]=\sum_{b=0}^{B-1}\dfrac{\mathbf{W}[b,f]\cdot% \mathbf{M}_{b}[c,f-\min\mathfrak{F}_{b},t]}{\sum_{k=0}^{B-1}\mathbf{W}[k,f]}

(2)

Finally, the source estimates are then obtained using elementwise masking $\hat{\mathbf{S}}=\mathbf{X}\circ\mathbf{M}$ .

3.5 Loss function

The loss function used in this work is the multichannel version of the L1SNR loss proposed in [28]. The contribution for each sample of the loss function is given by

	$\displaystyle\mathcal{L}(\hat{\mathbf{s}};\mathbf{s})$	$\displaystyle=\mathcal{D}(\hat{\mathbf{s}};\mathbf{s})+\mathcal{D}(\Re\hat{% \mathbf{S}};\Re\mathbf{S})+\mathcal{D}(\Im\hat{\mathbf{S}};\mathbf{S}),$		(3)
	$\displaystyle\mathcal{D}(\hat{\mathbf{y}};\mathbf{y})$	$\displaystyle=10\log_{10}\dfrac{\\|\operatorname{vec}(\hat{\mathbf{y}}-\mathbf{% y})\\|_{1}+\epsilon}{\\|\operatorname{vec}(\mathbf{y})\\|_{1}+\epsilon},$		(4)

where $\hat{\mathbf{s}}=\operatorname{iSTFT}(\hat{\mathbf{S}})$ , $\mathbf{s}$ and $\mathbf{S}$ are defined similarly for the ground truth, $\operatorname{vec}(\cdot)$ is the vectorization operator, and $\epsilon=10^{-3}$ for stability.

4 Data and experimental setup

This work utilizes the MoisesDB dataset [25], which consists of 240 songs from 47 artists, in stereo format at $44.1\text{\,}\mathrm{kHz}$ . MoisesDB defined their stem ontology with more than 30 fine-level classes, which are then grouped into 11 coarse-level classes [25, Table 2]. Due to the lack of official splits for MoisesDB, we performed a five-fold split²²2The splits are available in the repository. Note that not all stems contain a sufficient number of data points to be split into a five-fold validation setup. As a result, some stems are only present in a subset of folds. on the dataset stratified by genres. The first three splits are used as the training set, the fourth as the validation set, and the last as the test set.

4.1 Query extraction

For each possible stem of each song, a 10-second chunk of the clean audio of the same stem is extracted as the query signal. This is done by computing a time series of onset strength for each stem and then aggregating the mean onset strength for each 10-second sliding window with a hop size of 512 samples. The 10-second window with the strongest average onset is taken as the query signal. A t-SNE plot of the query embedding is shown in Fig. 2. While clusters can be clearly seen amongst related stems, it can also be seen that there are varying degrees of non-separability of the embedding between fine-level stems.

4.2 Training

Each model was trained using one NVIDIA H100 Tensor Core GPU (80 GB) for up to 150 epochs, unless otherwise stated. A training epoch consists of 8192 mixture-query pairs with a batch size of 4. An Adam optimizer with an initial learning rate of ${10}^{-3}$ was used, with a decay factor of 0.98 per epoch.

In the default sampling strategy, a random song is chosen, a random trainable stem for that song is chosen as the target stem, then a random chunk of $6\text{\,}\mathrm{s}$ is chosen. If the current target chunk has an RMS below $-36\text{\,}\mathrm{d}\mathrm{B}\mathrm{F}\mathrm{S}$ , a new random chunk is chosen for up to 10 more trials. Otherwise, the threshold is dropped to $-48\text{\,}\mathrm{d}\mathrm{B}\mathrm{F}\mathrm{S}$ for another 10 trials. If a suitable chunk is still not found, the next random chunk is chosen regardless of RMS. A pre-extracted query of the same stem is then randomly chosen from the available pool of songs, including the song of the mixture.

4.3 Testing and inference

During testing and inference, each track is split into 6-s segments with a hop size of $0.5\text{\,}\mathrm{s}$ , as per [17]. The estimated stems were then reconstructed into a full track using time-domain overlap-add with a Hann window. The Banquet models are tested in two scenarios: one using a query from a different song, and another using a query from the same song (SSQ). In different-song querying, the query song for each stem is randomly chosen from another song within the test split that contains the stem. When possible, the query song is chosen so that it is from the same genre as the mixture song but from a different artist. Otherwise, a song from any genre with a different artist is chosen.

4.4 Evaluation metrics

In this work, we report the full-track multichannel signal-to-noise ratio (SNR)³³3Signal-to-interference ratio (SIR) and signal-to-artifact ratio (SAR) were not computed as the number of the constituent stems can be large, making the required subspace projection intractable and/or unreliable. It is also unclear if coarse-level ground truth or fine-level ground truth should be used for such a projection. See [44, 45, 46] for background. as the main metric. Specifically, for a test signal $\hat{\mathbf{s}}$ and a reference signal $\mathbf{s}$ , both in $\mathbb{R}^{C\times N}$ , the SNR is computed by

\displaystyle\operatorname{SNR}(\hat{\mathbf{y}};\mathbf{y})=10\log_{10}\left(% {\|\mathbf{s}\|_{F}^{2}}/{\|\hat{\mathbf{s}}-\mathbf{s}\|_{F}^{2}}\right).

(5)

5 Results and discussion

In this section, we provide the results and discussion of our experiments. Section 5.1 discusses pretraining of the Bandit/Banquet encoder. Section 5.2 trials the use of the query-based setup on a subset of vocals, drums, and bass stems. Section 5.3 extends the system to include fine-level stems from guitar and piano families. Finally, Section 5.4 attempts to perform extraction on all possible fine-level stems with sufficient data.

5.1 Encoder pretraining

Preliminary experiments indicated that encoder pretraining is an important step to stabilize the training of the query-based model, especially as the number of query stems grows. The encoder pretraining is done with a common-encoder multi-decoder setup similar to [28] with a vdbo setup for 100 epochs. The vdbo decoders were discarded and the encoder was used for subsequent experiments. The performance of the pretrained model is shown in Table 1, with performance above oracle ideal ratio mask (IRM) for drums and bass, and on par with HT-Demucs for vocals.⁴⁴4All coarse-level results for oracle methods, HT-Demucs, and Spleeter were recomputed only on the test set using song-wise results from github.com/moises-ai/moises-db. The song-wise results were missing for five of the songs (as of 6 April 2024), two of these belong in the test set, thus the aggregates were computed over 46 songs instead of 48 songs.

Table 1: Median SNR of the models trained on the vdbo setup, evaluated on the test set of MoisesDB.

Model Median SNR (dB):	Vocals	Drums	Bass	Other
Bandit [28]	$9.1$	$9.9$	$10.6$	$6.4$
HT-Demucs [18]	$9.1$	$11.0$	$12.2$	$7.3$
Spleeter [11]	$7.4$	$6.6$	$6.8$	$5.0$
Oracle IRM	$10.3$	$9.2$	$8.8$	$7.6$

5.2 Learning to separate from queries

As a first step to verify the ability of the model to perform in a query-based setup, a Banquet model is trained to extract only lead female singer, lead male singer, drums, and bass stems, referred to as the q:vdb setup. We experimented with training the entire model from scratch, using a frozen pretrained encoder (FE), and using a trainable pretrained encoder (TE). While the frozen-encoder setup did not demonstrate any sign of overfitting during the training, the trainable-encoder system demonstrated (very slight) overfitting. As a result, an additional setup with data augmentation (DA) was attempted with the trainable encoder setup. We used simple stem-wise within-song random gain (up to $\pm$ $6\text{\,}\mathrm{d}\mathrm{B}$ ), random time shifting, polarity inversion, and channel swap**.

The results are shown in Table 2. All three variants with pretrained encoder provided better performance than the model trained from scratch, except for drums in the trainable-encoder model without DA being $0.1\text{\,}\mathrm{d}\mathrm{B}$ lower. Thus, for all subsequent experiments, the encoder is always pretrained. Without DA, there was no clear benefit to unfreezing the encoder. However, in a trainable encoder system with DA, slight to moderate improvements were observed across all but the male vocal stem. Note, however, that allowing full-model training significantly increases the number of trainable parameters from 13.5 M to 24.9 M thus the computational cost and training time also increases accordingly. The performances of the drums and bass stems are on par or better than the dedicated-stem setup in Table 1. Generally, the models perform better on female vocals than on male vocals.

Table 2: Median SNR of Banquet models on the q:vdb setup, evaluated with different-song queries.⁶⁶6Median results for the same-song query and different-song query are within

0.2\text{\,}\mathrm{dB}

of each other.

Pretrained Enc.	FE	DA	Female Vox	Male Vox	Drums	Bass
N	N	N	$8.3$	$7.2$	$9.4$	$9.4$
Y	Y	N	$9.8$	$7.6$	$9.9$	$10.2$
Y	N	N	$9.8$	$8.0$	$9.3$	$9.8$
Y	N	Y	$10.2$	$8.0$	$10.1$	$10.8$

5.3 Extending to guitar and piano

Amongst systems that tackled MSS beyond four stems, the next two stems beyond vdbo are usually guitar and piano, due to their high prevalence within pop/rock music. The set of possible queries is thus extended from q:vdb to also include acoustic guitar, clean electric guitar, distorted electric guitar, grand piano, and electric piano. This is referred to as the q:vdbgp setup. Due to the significantly lower number of available training data for guitar and piano stems, we also experimented with a balanced sampling (BS) strategy. In this strategy, a random stem is first chosen as the target stem, then a random song containing that stem is chosen. The remainder of the sampling process is the same as the default. This strategy ensures that every stem has a similar number of training pairs, but distorts the “natural” distribution of stem occurrences.

For comparability with existing systems, the inference outputs of fine-level stems in this setup were added together to form their respective coarse-level predictions.⁷⁷7The ground truth signals for are the full coarse-level tracks, e.g. vocals ground truth include contributions from background vocals even if we do not have background vocals in the predictions. Coarse-level results are shown in Table 3. Fine-level results for trainable-encoder models are shown in Table 4.

At the coarse level, most variants of Banquet continue to perform above the oracle IRM for drums and bass. With the default-sampling trainable encoder systems, the Banquet performed better than HT-Demucs on guitar and piano. Without DA, balanced sampling generally did not lead to consistent improvements for guitar and piano. With balanced sampling and DA on a trainable-encoder model, however, slight gains in median SNRs of guitar and piano were observed, albeit at the cost of vocals and drum SNRs.

At the fine level, the model performance follows a similar trend to that of the coarse level. Drums and bass continue to perform above the oracle IRM, while both lead vocals performed close to the IRM. Guitar and piano performances are still well below IRM. Interestingly, it appears that querying with excerpts from the same or different track did not affect the model performance for most stems except for electric piano. This is likely due to both the small sample size of electric piano limiting generalizability, and the highly diverse set of possible timbres thus the intertwined nature of both the query embedding and the target audio with other keyboard instruments. The ability of the model to query with stems from different tracks is a double-edged sword, however, since this also means that the model is somewhat insensitive to fine differences in timbre between different renditions of the “same” instruments. This could potentially limit its usefulness when applied to a scenario where multiple target stems have very similar timbres.

Table 3: Coarse-level performance of the Banquet models with different-song queries on the q:vdbgp setup

Model	FE	DA	BS	Vox	Lead Vox	Drums	Bass	Guitar	Piano
Banquet	Y	N	N	$8.0$	$7.9$	$9.8$	$10.5$	$2.3$	$0.8$
			Y	$7.9$	$7.7$	$9.6$	$10.5$	$2.2$	$0.9$
	N	N	N	$7.4$	$8.0$	$9.6$	$10.6$	$3.0$	$2.3$
			Y	$7.6$	$7.7$	$9.3$	$10.2$	$2.9$	$2.5$
		Y	N	$7.8$	$7.9$	$10.1$	$10.9$	$3.2$	$2.2$
			Y	$7.6$	$7.9$	$9.5$	$11.0$	$3.3$	$2.5$
HT-Demucs (vdbgpo)				$8.9$	—	$11.6$	$12.4$	$2.4$	$1.7$
Spleeter (vdbpo)				$7.0$	—	$6.9$	$6.7$	—	$0.7$
Oracle IRM				$10.0$	—	$9.6$	$7.8$	$5.2$	$5.0$

Bold: best Banquet model and/or best non-oracle model.

Table 4: Model performance on the q:vdbgp setup fine-level stems.

				Female Vox			Male Vox			Drums			Bass			Acoust. Gtr.			Clean E. Gtr.			Dist. E. Gtr.			Grand Piano			E. Piano
FE	DA	BS	SSQ	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3	Q1	Q2	Q3
N	N	N	N	$5.5$	$9.6$	$13.2$	$6.7$	$7.9$	$10.0$	$8.0$	$9.6$	$11.6$	$7.9$	$9.9$	$12.0$	$0.9$	$1.8$	$3.6$	$0.2$	$0.7$	$2.4$	$0.9$	$2.4$	$5.3$	$0.7$	$2.3$	$2.9$	$0.0$	$0.6$	$0.7$
			Y	$5.6$	$9.6$	$13.2$	$6.7$	$7.9$	$10.0$	$8.0$	$9.6$	$11.6$	$7.9$	$9.9$	$12.0$	$0.9$	$1.8$	$3.7$	$0.2$	$0.9$	$2.6$	$0.9$	$2.4$	$5.3$	$0.7$	$2.2$	$3.0$	$0.0$	$0.8$	$1.5$
		Y	N	$6.1$	$9.6$	$13.1$	$6.8$	$7.7$	$9.7$	$7.8$	$9.3$	$11.3$	$7.6$	$10.0$	$11.5$	$0.8$	$1.8$	$3.6$	$0.2$	$0.8$	$2.5$	$1.0$	$2.5$	$5.4$	$0.8$	$2.5$	$3.1$	$-0.1$	$0.7$	$0.8$
			Y	$6.1$	$9.6$	$13.1$	$6.8$	$7.7$	$9.7$	$7.8$	$9.3$	$11.3$	$7.6$	$10.0$	$11.5$	$0.8$	$1.8$	$3.7$	$-0.0$	$0.9$	$2.7$	$1.2$	$2.5$	$5.4$	$0.8$	$2.5$	$3.1$	$-0.6$	$0.8$	$1.8$
	Y	N	N	$5.5$	$10.1$	$13.0$	$6.9$	$7.9$	$10.2$	$8.5$	$10.1$	$12.3$	$8.4$	$10.7$	$13.2$	$1.2$	$1.7$	$4.5$	$0.2$	$0.9$	$3.0$	$0.9$	$2.8$	$4.7$	$0.8$	$2.8$	$3.2$	$0.1$	$0.5$	$0.9$
			Y	$5.5$	$10.1$	$13.1$	$6.9$	$7.9$	$10.2$	$8.5$	$10.1$	$12.3$	$8.4$	$10.7$	$13.2$	$1.2$	$1.7$	$4.6$	$0.2$	$1.1$	$2.7$	$0.9$	$2.8$	$4.7$	$0.8$	$2.4$	$3.1$	$-0.1$	$0.6$	$0.9$
		Y	N	$5.5$	$10.1$	$13.5$	$6.5$	$7.8$	$10.0$	$8.3$	$9.5$	$11.8$	$8.4$	$10.3$	$12.1$	$1.1$	$1.7$	$3.9$	$0.0$	$0.4$	$2.7$	$0.9$	$3.0$	$4.9$	$0.8$	$2.6$	$3.2$	$0.2$	$0.5$	$0.9$
			Y	$5.5$	$10.1$	$13.5$	$6.5$	$7.8$	$10.0$	$8.3$	$9.5$	$11.8$	$7.8$	$10.3$	$12.1$	$1.0$	$1.7$	$3.9$	$0.3$	$0.6$	$2.7$	$0.6$	$3.0$	$4.8$	$0.8$	$2.5$	$3.2$	$0.6$	$0.9$	$2.1$

FE: frozen encoder, DA: data augmentation, BS: balanced sampling, SSQ: same-song query, Q1: lower quartile, Q2: median, Q3: upper quartile

5.4 Extending beyond guitar and piano

The results for the q:vdbgp setup demonstrated that the model is able to learn to extract 5 additional stems. In this experiment, we extend the set of possible queries to include all remaining stems with at least one data point per fold: effects, pitched percussion, organs & electronic organs, synth pad, synth lead, string section, brass, and reeds. Additionally, bass is now broken up into bass guitar and bass synth. This is referred to as the q:all setup. Although these are all fine-level stems as defined by MoisesDB, some of these classes are more specific than others. For example, brass is a fine-level stem despite possibly including trumpets, trombones, horns, and tuba. The experimental setups are similar to that of Setup B.⁸⁸8It is perhaps worth noting that BS and DA models for q:all were significantly more unstable during training than for the q:vdbgp setup, despite being identical architecturally. When this happens, we discard the collapsed model and restart the training from scratch until we have a model that completes the entire training run with nonsilent output for most stems. No TE+DA+BS system was stable enough to finish the training run without collapse.

The same-song query results⁹⁹9Note that when the FE and the TE+DA systems have SNR concentrated at $0\text{\,}\mathrm{dB}$ for the long-tail stems, these are indicators of the model outputting very soft, practically silent output. In general, a model yielding negative SNR for a particular stem might be more desirable than a model that has collapsed for a particular stem. for the models trained in q:all are shown in Fig. 3. The performances of the model trained on q:all on the stems from the q:vdbgp setup are similar to those in Table 4, with the exception of the significant drop in performance for the balanced-sampled trainable-encoder model. Amongst the newly added stems, there are significant variations in performance, but they are all still very weak in terms of SNR, with no sample performing above $5\text{\,}\mathrm{d}\mathrm{B}$ SNR. For organs, background vocals, and both synth stems, the trainable-encoder model yielded the better upper quartile and maximum performance, but is also very unreliable. Unfortunately, balanced sampling on a trainable encoder model only worsened the performance. DA on a trainable-encoder model with default sampling slightly improved the lower quartile performance, but is also accompanied by lower maximum and upper quartile performance. Frozen-encoder system collapsed for most long-tail stems in default sampling, but balanced sampling interestingly was more stable and performed the best for bass synth, pitched percussion, reeds, and brass. Evidently, the classical tradeoffs are at play here; allowing the model more flexibility with a trainable encoder also comes with a higher risk of model collapse or unreliable performance. More surprisingly, the fact that even a frozen encoder trained on a vdbo setup was able to function at all beyond q:vdb indicates that the embedding space of a Bandit encoder already contains information that is partially generalizable beyond vdbo, as also observed in [28].

The results of the long-tail stems are somewhat unsurprising given that the genre distribution in MoisesDB skewed heavily toward pop, rock, and singer-songwriter. In addition to the low track counts, these long-tail instruments also tend to have infrequent active segments and relatively softer levels within a song. In fact, of the long-tail stems, reeds and pitched percussion are the only ones with median RMS above $-35\text{\,}\mathrm{d}\mathrm{B}\mathrm{F}\mathrm{S}$ . Analysis of the SNR distribution shows that the model performance is quite correlated to the track-level RMS of the target signal (Spearman’s $\rho$ between $0.78$ and $0.81$ ). This is likely due to a combination of low data availability and the inherent difficulty associated with cleanly extracting these “supporting” stems when there are significant spectral overlaps from more prominent co-occurring stems. In light of the recently published analysis in [47], we may have been too conservative with our DA setup. In particular, we made a conscious choice to only perform gain augmentation close to the original levels, instead of significantly amplifying softer stems. Whether the latter may improve the result at all will have to be addressed in future work. Moreover, given that [34] saw partial success with the predominantly classical instrumentation of URMP, there may also be an opportunity for a much more aggressive cross-dataset DA.

6 Conclusion

In this work, Banquet, a stem-agnostic single-decoder query-based source separation system was proposed to address MSS beyond the vdbo stems. At 24.9 M trainable parameters, this highly modularized model with a single stream of information flow provided strong performance for vocals, drums, and bass; outperformed significantly more complex HT-Demucs on guitar and piano; and provided a proof-of-concept for extractions of additional long-tail and/or fine-grained stems at no additional complexity. While there remains room for improvements for long-tail stems with low data availability, this work demonstrated the opportunity for further research on single-decoder systems toward supporting a large and diverse set of stems.

7 Acknowledgments

This work was supported by the Cyber-Infrastructure Resource Awards, the Institute for Data Engineering and Science (IDEaS), Georgia Institute of Technology.

K. N. Watcharasupat was separately supported by the American Association of University Women (AAUW) International Fellowship, and the IEEE Signal Processing Society Scholarship Program.

The authors would like to thank Yiwei Ding and Chih-Wei Wu for their assistance with the project.

References

[1] A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 Signal Separation Evaluation Campaign,” in Proceedings of the 13th International Conference onLatent Variable Analysis and Signal Separation, P. Tichavský, M. Babaie-Zadeh, O. J. Michel, and N. Thirion-Moreau, Eds. Cham: Springer International Publishing, 2017, pp. 323–332.
[2] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017.
[3] ——, “MUSDB18-HQ - an uncompressed version of MUSDB18,” Aug. 2019.
[4] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA: IEEE, Mar. 2017, pp. 261–265.
[5] N. Takahashi and Y. Mitsufuji, “Multi-Scale multi-band densenets for audio source separation,” in Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY: IEEE, Oct. 2017, pp. 21–25.
[6] F. R. Stöter, A. Liutkus, and N. Ito, “The 2018 Signal Separation Evaluation Campaign,” in Proceedings of the 2018 International Conference on Latent Variable Analysis and Signal Separation, 2018, pp. 293–305.
[7] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, 2018, pp. 334–340.
[8] F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-Unmix - A Reference Implementation for Music Source Separation,” Journal of Open Source Software, vol. 4, no. 41, p. 1667, Sep. 2019.
[9] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed,” 2019.
[10] ——, “Music Source Separation in the Waveform Domain,” Nov. 2019.
[11] R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: A fast and efficient music source separation tool with pre-trained models,” Journal of Open Source Software, vol. 5, no. 50, p. 2154, Jun. 2020.
[12] N. Takahashi and Y. Mitsufuji, “D3Net: Densely connected multidilated DenseNet for music source separation,” Mar. 2021.
[13] Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference, Online, 2021, pp. 342–349.
[14] A. Défossez, “Hybrid Spectrogram and Waveform Source Separation,” in Proceedings of the 2021 Music Demixing Workshop, Nov. 2021.
[15] M. Kim, W. Choi, J. Chung, D. Lee, and S. Jung, “KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing,” in Proceedings of the Music Demixing Workshop, Nov. 2021.
[16] Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music Demixing Challenge 2021,” Frontiers in Signal Processing, vol. 1, p. 808395, Jan. 2022.
[17] Y. Luo and J. Yu, “Music Source Separation With Band-Split RNN,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1893–1901, 2023.
[18] S. Rouard, F. Massa, and A. Défossez, “Hybrid Transformers for Music Source Separation,” in Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Institute of Electrical and Electronics Engineers (IEEE), May 2023.
[19] W.-T. Lu, J.-C. Wang, Q. Kong, and Y.-N. Hung, “Music Source Separation with Band-Split RoPE Transformer,” Sep. 2023.
[20] Y. Wang, D. Stoller, R. M. Bittner, and J. Pablo Bello, “Few-Shot Musical Source Separation,” in Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE, May 2022, pp. 121–125.
[21] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. Bello, “MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research,” in Proceedings of the 15th Conference of the International Society for Music Information Retrieval, 2014.
[22] R. M. Bittner, J. Wilkins, H. Yip, and J. P. Bello, “MedleyDB 2.0 : New Data and a System for Sustainable Data Collection,” in Extended Abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference, 2016.
[23] E. Manilow, G. Wichern, and J. Le Roux, “Hierarchical Musical Instrument Separation,” in Proceedings of the 21st International Society for Music Information Retrieval Conference, 2020.
[24] E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity,” in Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, USA: IEEE, Oct. 2019, pp. 45–49.
[25] I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl, “MoisesDB: A Dataset for Source Separation Beyond 4-Stems,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, Milan, Italy, 2023, pp. 619–626.
[26] G. Fabbro, S. Uhlich, C.-H. Lai, W. Choi, M. Martínez-Ramírez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues, F.-R. Stöter, A. Défossez, Y. Luo, J. Yu, D. Chakraborty, S. Mohanty, R. Solovyev, A. Stempkovskiy, T. Habruseva, N. Goswami, T. Harada, M. Kim, J. H. Lee, Y. Dong, X. Zhang, J. Liu, and Y. Mitsufuji, “The Sound Demixing Challenge 2023 – Music Demixing Track,” Feb. 2024.
[27] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient Training of Audio Transformers with Patchout,” in Proceedings of Interspeech 2022. ISCA, Sep. 2022, pp. 2753–2757.
[28] K. N. Watcharasupat, C.-W. Wu, Y. Ding, I. Orife, A. J. Hipple, P. A. Williams, S. Kramer, A. Lerch, and W. Wolcott, “A Generalized Bandsplit Neural Network for Cinematic Audio Source separation,” IEEE Open Journal of Signal Processing, vol. 5, pp. 73–81, 2023.
[29] N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseLSTM: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation,” in Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC). Tokyo: IEEE, Sep. 2018, pp. 106–110.
[30] G. Meseguer-Brocal and G. Peeters, “Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations,” in Proceeedings of the 20th International Society for Music Information Retrieval Conference, Nov. 2019, pp. 159–165.
[31] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual Reasoning with a General Conditioning Layer,” Dec. 2017.
[32] O. Slizovskaia, G. Haro, and E. Gómez, “Conditioned Source Separation for Music Instrument Performances,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2083–2095, 2021.
[33] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, Feb. 2019.
[34] L. Lin, Q. Kong, J. Jiang, and G. Xia, “A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis,” Aug. 2021.
[35] X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Separate What You Describe: Language-Queried Audio Source Separation,” in Proceedings of Interspeech 2022. ISCA, Sep. 2022, pp. 1801–1805.
[36] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, R. Xia, Y. Wang, M. D. Plumbley, and W. Wang, “Separate Anything You Describe,” Oct. 2023.
[37] S. Lee and I. V. Bajic, “Information Flow Through U-Nets,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). Nice, France: IEEE, Apr. 2021, pp. 812–816.
[38] P. Seetharaman, G. Wichern, S. Venkataramani, and J. Le Roux, “Class-conditional Embeddings for Music Source Separation,” in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE, May 2019, pp. 301–305.
[39] D. Petermann, G. Wichern, A. Subramanian, and J. L. Roux, “Hyperbolic Audio Source Separation,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece: IEEE, Jun. 2023, pp. 1–5.
[40] R. Kumar, Y. Luo, and N. Mesgarani, “Music Source Activity Detection and Separation Using Deep Attractor Network,” in Proceedings of Interspeech 2018. ISCA, Sep. 2018, pp. 347–351.
[41] D. Petermann and M. Kim, “Hyperbolic Distance-Based Speech Separation,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul, Korea, Republic of: IEEE, Apr. 2024, pp. 1191–1195.
[42] E. J. Humphrey, S. Durand, and B. McFee, “OpenMIC-2018: An open dataset for multiple instrument recognition,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, 2018, pp. 438–444.
[43] D. Petermann, G. Wichern, Z.-Q. Wang, and J. Le Roux, “The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2023.
[44] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[45] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or Well Done?” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 626–630.
[46] R. Scheibler, “SDR - Medium Rare With Fast Computations,” in Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 701–705.
[47] C.-B. Jeon, G. Wichern, F. G. Germain, and J. Le Roux, “Why does music source separation benefit from cacophony?” in Proceedings of the 1st Workshop on on Explainable Machine Learning for Speech and Audio, Feb. 2024.

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems