Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

Abstract

Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly formulate mono upmixing as unconditional generation and stereo upmixing as conditional generation, where the stereo signals serve as conditions. We provide evidence that our proposed methodology, when decoded to stereo, matches a strong commercial stereo widener in subjective ratings. Overall, our work presents direct upmixing to Ambisonic format as a strong and promising approach to neural upmixing. A discussion on limitations is also provided.

1 Introduction

Channel upmixing is a technique that enables better audio playback of fewer channels audio on systems equipped with more channels. It includes two primary categories: converting mono to stereo and stereo to surround. Mono-to-stereo upmixing allows monophonic content to be enjoyed on stereo playback devices like headphones and stereo speakers, while stereo-to-surround upmixing enables stereo content to be experienced on home theater or surround sound systems.

Recent advancements in microcontroller production, head-tracking technology, and head-related transfer functions (HRTFs) have facilitated spatial content playback on consumer headphones using channel-agnostic playback formats [1, 2]. These formats encode audio information within a spatial audio field and utilize signal processing techniques on the playback device to convert the audio into the necessary channel format. This method provides a more adaptable and versatile audio playback experience, as the audio can be tailored to the specific playback system and the listener’s preferences. The trend of channel-agnostic formats is on the rise, with popular music and video streaming platforms like Apple Music and YouTube adopting this technology.

Despite the growing popularity of channel-agnostic playback, research on channel-agnostic upmixing is limited. A similar approach involves using source separation methods to isolate individual elements or tracks and position them at specific locations within the spatial audio field [3, 4, 5, 6]. However, the quality of separation in current models is a constraint, as they often can only separate a limited number of predetermined tracks111Models such as [7, 8] can separate undetermined number of tracks, yet their performances lack significantly behind their predetermined counterparts, thereby is currently unusable for neural upmixing.[9] and introduce noticeable artifacts after upmixing [10, 11]. Moreover, the manual placement of elements in the spatial audio field necessitates human expertise, limiting the practicality of this approach. This limitation highlights the necessity for further research and advancement in channel-agnostic upmixing to enhance the quality and effectiveness of spatial audio playback on consumer headphones.

In response to this need, we propose Ambisonizer, a novel paradigm that leverages spherical harmonics to create channel-agnostic neural upmixing. By leveraging the Ambisonic format, we directly generate an Ambisonic first-order upmix from a mono sound file. Under mono-to-any upmixing scenarios, we treat the problem as unconditional generation; under stereo-to-any upmixing scenarios, we downmix the stereo signal into mono, and treat the stereo signal as a spatial condition to the generation process. To the best of our knowledge, our work proposes the first framework that allows for such mono-to-any and stereo-to-any neural upmixing.

Through subjective evaluations, we demonstrate that both mono-to-any and stereo-to-any Ambisonizer generation results, when downmixed to stereo, match a strong commercial mono-to-stereo upmixing baseline, with the added benefit of being channel-agnostic. We also discuss the limitations of using the Ambisonic B-format as a middle format for channel-agnostic upmixing. To facilitate future research, our code, model artifacts, and data generation pipeline will be open-sourced soon at https://ambisonizer.netlify.app.

2 The Ambisonic format

2.1 First-Order Ambisonics

Ambisonics is a multichannel format designed to capture and reproduce the spatial characteristics of sound fields. The encoding of audio information into the Ambisonic format typically begins with first-order Ambisonics, which utilizes four spherical harmonic channels: W𝑊Witalic_W, X𝑋Xitalic_X, Y𝑌Yitalic_Y, and Z𝑍Zitalic_Z (known collectively as the Ambisonic B-format). The W𝑊Witalic_W channel is also considered as the zeroth-order Ambisonics, which represents the omnidirectional component of the sound field, proportional to the acoustic pressure p(t)𝑝𝑡p(t)italic_p ( italic_t ), similar to how an omnidirectional microphone captures to capture sound from all directions. X𝑋Xitalic_X, Y𝑌Yitalic_Y and Z𝑍Zitalic_Z channels capture sound from X𝑋Xitalic_X-axis, Y𝑌Yitalic_Y-axis and Z𝑍Zitalic_Z-axis correspondingly in a similar fashion as 8-figure microphones. Together, these channels encode both the magnitude and the directional information of the sound at any given point within the sound field. If elevation information is not needed during decoding, the Z𝑍Zitalic_Z channel information may be omitted in first-order Ambisonics [12].

Ambisonic decoders are used for rendering the Ambisonic format to specific speaker layouts. Accurate and efficient implementations of decoders are an active area of research [13]. However, it is currently hard for any Ambisonic decoders to recreate the sound field perfectly. The finite number of spherical harmonics coefficients used in Ambisonics commonly lead to truncation artifacts, affecting the accuracy of sound field recreation. As a result, many decoders prioritize either physical or perceptual accuracy. For instance, an In-Phase decoder could greatly reduce localization artifacts but may not provide the best physical accuracy compared to other methods [14].

2.2 Higher Order Ambisonics

The first-order Ambisonic B-format represents the sound field using only 4 channels, which provides limited spatial information. This limitation places a greater burden on decoders, which are responsible for rendering the audio to the desired playback channel configuration. As a result, perceptual deficiencies, such as poor localization accuracy and coloration, may occur [15]. To mitigate these issues, Higher-Order Ambisonics (HOA) [16] has been introduced, allowing for higher spatial resolution in Ambisonic data. In HOA, the spherical harmonics are arranged symmetrically, centering around the z𝑧zitalic_z-rotationally symmetric component for each order. The components to the left of the center represent sine-based horizontal components, while the components to the right represent cosine-based horizontal components. An example illustration for 4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order Ambisonic is illustrated in Figure 1.

Refer to caption
Figure 1: Higher Order Ambisonic (4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT order)[17]

3 The Ambisonizer Model

The Ambisonizer model is based on the intuition that existing mono-to-stereo and stereo-to-surround upmixing models implicitly generate spherical harmonics. This is because these models require an inherent understanding of spherical harmonics to successfully upmix audio to a higher number of channels. Consequently, we propose directly upmixing the input signal to Ambisonic B-format, which enables more explicit way of generating spherical harmonics. This approach allows the utilization of existing Ambisonic decoders to render the upmixed audio to specific channel layouts, providing flexibility and compatibility with various playback systems.

Figure 2 illustrates the overall framework of the proposed Ambisonizer model, with input Y=(YL,YR)𝑌subscript𝑌𝐿subscript𝑌𝑅Y=(Y_{L},Y_{R})italic_Y = ( italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) and output Y=(YW,YX,YY)superscript𝑌subscript𝑌𝑊subscript𝑌𝑋subscript𝑌𝑌Y^{\prime}=(Y_{W},Y_{X},Y_{Y})italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). We start by averaging the two channels to obtain Ymono=12(YL+YR)subscript𝑌𝑚𝑜𝑛𝑜12subscript𝑌𝐿subscript𝑌𝑅Y_{mono}=\frac{1}{2}(Y_{L}+Y_{R})italic_Y start_POSTSUBSCRIPT italic_m italic_o italic_n italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ), which we posit is equal to YWsubscript𝑌𝑊Y_{W}italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT. YWsubscript𝑌𝑊Y_{W}italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT serves as input to the audio encoder, and the stereo signal Y𝑌Yitalic_Y is used as input to the spatial information encoder, deriving Z𝑍Zitalic_Z and ZCsubscript𝑍𝐶Z_{C}italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT correspondingly. The results are combined and processed using transformer encoder layers, then passed through a decoder to obtain YXsubscript𝑌𝑋Y_{X}italic_Y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and YYsubscript𝑌𝑌Y_{Y}italic_Y start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, thereby obtaining the complete first-order Ambisonics (excluding elevation information YZsubscript𝑌𝑍Y_{Z}italic_Y start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT). YWsubscript𝑌𝑊Y_{W}italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, YXsubscript𝑌𝑋Y_{X}italic_Y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, and YYsubscript𝑌𝑌Y_{Y}italic_Y start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT directly form the Ambisonic B-format W𝑊Witalic_W, X𝑋Xitalic_X, and Y𝑌Yitalic_Y channels, which can be directly utilized for decoding.

Refer to caption
Figure 2: The Ambisonizer model architecture. Blue blocks denote encoders, and green block denotes the decoder. Best viewed in color.

3.1 Audio and Spatial Encoders

The audio and spatial encoders in our model are inspired by the designs presented in [18] and [19]. Given a raw waveform as an input, the audio encoder first applies a 1D convolution with a kernel size of 7 and 32 output channels. The output then passes through a series of residual convolution blocks, each consisting of a residual-connected 1D convolution followed by a downsampling block. After each downsampling operation, the number of channels is doubled to capture more abstract features. The residual block is repeated four times with strides of (2,4,5,8)2458(2,4,5,8)( 2 , 4 , 5 , 8 ). The final output of the audio encoder is obtained by passing the feature map through a two-layer LSTM and a 1D convolution with a kernel size of 7.

The spatial encoder is additionally treated as a variational autoencoder (VAE) [20], with its output reparameterized to follow a unit Gaussian distribution, 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). This reparameterization allows for direct sampling from the Gaussian distribution when stereo information is not provided, enabling the unconditional generation of Ambisonic audio from mono input.

3.2 Bottleneck

The bottleneck of our model consists of a series of transformer encoder layers [21]. We concatenate the latent representation Z𝑍Zitalic_Z obtained from the audio encoder and the spatial embedding ZCsubscript𝑍𝐶Z_{C}italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT obtained from the spatial encoder. The concatenated vector is then fed into the transformer encoder layers. We use 8 attention heads and 8 transformer encoder layers to capture complex dependencies and learn a rich representation of the audio and spatial information.

3.3 Decoder

The decoder of our model mirrors the structure of the encoder but with transposed 1D convolutions instead of strided convolutions. This design choice allows the decoder to gradually upsample the bottleneck representation and generate the output Ambisonic audio.

To further enhance the decoder’s ability to capture cross-channel dependencies, we introduce channel attention mechanisms [22] before each residual block. Channel attention allows the model to adaptively weigh the importance of different channels at each stage of the decoding process. This is achieved by learning a set of channel-wise weights that are applied to the feature maps. The channel attention mechanism can be formulated as follows:

Fatt=σ(W2(ReLU(W1(Favg))))Fsubscript𝐹𝑎𝑡𝑡direct-product𝜎subscript𝑊2ReLUsubscript𝑊1subscript𝐹𝑎𝑣𝑔𝐹F_{att}=\sigma(W_{2}(\text{ReLU}(W_{1}(F_{avg}))))\odot Fitalic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ) ) ) ) ⊙ italic_F (1)

where Fattsubscript𝐹𝑎𝑡𝑡F_{att}italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT is the attended feature map, F𝐹Fitalic_F is the input feature map, Favgsubscript𝐹𝑎𝑣𝑔F_{avg}italic_F start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT is the channel-wise average pooled feature map, W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are learnable linear layers, σ𝜎\sigmaitalic_σ is the sigmoid function, and direct-product\odot is element-wise multiplication.

3.4 Loss Function

We use the Evidence Lower Bound (ELBO) loss to train the Ambisonizer model, where reconstruction loss is balanced with a regularization term to encourage the learned distribution of the latent variables produced by the spatial encoder to follow unit Gaussian 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). It is defined as:

ELBO=subscriptELBOabsent\displaystyle\mathcal{L}_{\text{ELBO}}=caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT = 𝔼q(ZCY)[logp(YZC)]subscript𝔼𝑞conditionalsubscript𝑍𝐶𝑌delimited-[]𝑝conditional𝑌subscript𝑍𝐶\displaystyle-\mathbb{E}_{q(Z_{C}\mid Y)}[\log p(Y\mid Z_{C})]- blackboard_E start_POSTSUBSCRIPT italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_Y ∣ italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ]
+𝕂𝕃(q(ZCY)p(ZC))𝕂𝕃conditional𝑞conditionalsubscript𝑍𝐶𝑌𝑝subscript𝑍𝐶\displaystyle+\mathbb{KL}\big{(}q(Z_{C}\mid Y)\,\|\,p(Z_{C})\big{)}+ blackboard_K blackboard_L ( italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) ∥ italic_p ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ) (2)

where Y𝑌Yitalic_Y is the input stereo audio, ZCsubscript𝑍𝐶Z_{C}italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the spatial embedding, q(ZCY)𝑞conditionalsubscript𝑍𝐶𝑌q(Z_{C}\mid Y)italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) is the approximate posterior distribution learned by the encoder, p(Y|ZC)𝑝conditional𝑌subscript𝑍𝐶p(Y|Z_{C})italic_p ( italic_Y | italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) is the likelihood of the input given the spatial embedding, and p(ZC)𝑝subscript𝑍𝐶p(Z_{C})italic_p ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) is the prior distribution of the spatial embedding. 𝕂𝕃(q(ZCY)p(ZC))𝕂𝕃conditional𝑞conditionalsubscript𝑍𝐶𝑌𝑝subscript𝑍𝐶\mathbb{KL}(q(Z_{C}\mid Y)\,\|\,p(Z_{C}))blackboard_K blackboard_L ( italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) ∥ italic_p ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ) is the Kullback-Leibler divergence between the approximate posterior q(ZCY)𝑞conditionalsubscript𝑍𝐶𝑌q(Z_{C}\mid Y)italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) and the prior p(ZC)𝑝subscript𝑍𝐶p(Z_{C})italic_p ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ).

For the reconstruction term, we use a combination of a multi-resolution Short-Time Fourier Transform (STFT) loss, and a mean squared error (L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) loss on the waveform. The multi-resolution STFT loss [23] is used to capture the time-frequency characteristics of the generated audio. It is computed by taking the STFT of the generated and target Ambisonic audio at multiple resolutions and comparing their magnitudes. The STFT loss is defined as:

STFT=r=1R|STFTr(y^)||STFTr(y)|1subscript𝑆𝑇𝐹𝑇superscriptsubscript𝑟1𝑅subscriptnormsubscriptSTFT𝑟^𝑦subscriptSTFT𝑟𝑦1\mathcal{L}_{STFT}=\sum_{r=1}^{R}\left\||\text{STFT}_{r}(\hat{y})|-|\text{STFT% }_{r}(y)|\right\|_{1}caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_F italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∥ | STFT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) | - | STFT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y ) | ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (3)

where y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the generated Ambisonic channels, y𝑦yitalic_y is the target Ambisonic channels, STFTrsubscriptSTFT𝑟\text{STFT}_{r}STFT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the STFT operation at resolution r𝑟ritalic_r, and R𝑅Ritalic_R is the total number of resolutions. We follow [23] and select R=3𝑅3R=3italic_R = 3, under the FFT sizes of [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ] with window sizes of [240,600,1200]2406001200[240,600,1200][ 240 , 600 , 1200 ] and hop sizes of [50,120,240]50120240[50,120,240][ 50 , 120 , 240 ].

The L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss on the waveform is used to ensure that the generated audio closely matches the target audio in the time domain. It is defined as:

2=y^y22subscript2superscriptsubscriptnorm^𝑦𝑦22\mathcal{L}_{2}=\left\|\hat{y}-y\right\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ over^ start_ARG italic_y end_ARG - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

Using a combination of time domain and time-frequency domain loss can allow the model to capture both magnitude and phase information effectively, while not overfit to low frequencies [24, 19]. Since phase information is crucial for Ambisonic audio, we scale the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss by a factor of 10 to emphasize its importance. The final loss function is defined as:

=STFT+102+𝕂𝕃(q(ZCY)p(ZC))subscript𝑆𝑇𝐹𝑇10subscript2𝕂𝕃conditional𝑞conditionalsubscript𝑍𝐶𝑌𝑝subscript𝑍𝐶\mathcal{L}=\mathcal{L}_{STFT}+10\cdot\mathcal{L}_{2}+\mathbb{KL}\big{(}q(Z_{C% }\mid Y)\,\|\,p(Z_{C})\big{)}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_F italic_T end_POSTSUBSCRIPT + 10 ⋅ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + blackboard_K blackboard_L ( italic_q ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_Y ) ∥ italic_p ( italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ) (5)

4 Experiments

4.1 Synthesizing first-order Ambisonic data

To train the Ambisonizer model, we require data pairs consisting of Y=(YL,YR)𝑌subscript𝑌𝐿subscript𝑌𝑅Y=(Y_{L},Y_{R})italic_Y = ( italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) and Y=(YW,YX,YY)superscript𝑌subscript𝑌𝑊subscript𝑌𝑋subscript𝑌𝑌Y^{\prime}=(Y_{W},Y_{X},Y_{Y})italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). We synthesize Ysuperscript𝑌Y^{\prime}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT directly using Ambisonic impulse response (IR) datasets. Mono sound sources are encoded with the first-order Ambisonic IR without Z𝑍Zitalic_Z channel, denoted as (IRW,IRX,IRY)𝐼subscript𝑅𝑊𝐼subscript𝑅𝑋𝐼subscript𝑅𝑌(IR_{W},IR_{X},IR_{Y})( italic_I italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_I italic_R start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_I italic_R start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ), to obtain Ysuperscript𝑌Y^{\prime}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. For a given sound source S𝑆Sitalic_S at azimuth θ𝜃\thetaitalic_θ, we convolve it with IRW𝐼subscript𝑅𝑊IR_{W}italic_I italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, IRX𝐼subscript𝑅𝑋IR_{X}italic_I italic_R start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and IRY𝐼subscript𝑅𝑌IR_{Y}italic_I italic_R start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT and calculate its contribution to Ysuperscript𝑌Y^{\prime}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:

SW=IRWSsubscript𝑆𝑊tensor-product𝐼subscript𝑅𝑊𝑆S_{W}=IR_{W}\otimes Sitalic_S start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = italic_I italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ⊗ italic_S (6)
SX=IRXScos(θ)3subscript𝑆𝑋direct-producttensor-product𝐼subscript𝑅𝑋𝑆𝜃3S_{X}=IR_{X}\otimes S\odot\cos(\theta)\sqrt{3}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = italic_I italic_R start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⊗ italic_S ⊙ roman_cos ( italic_θ ) square-root start_ARG 3 end_ARG (7)
SY=IRYSsin(θ)3subscript𝑆𝑌direct-producttensor-product𝐼subscript𝑅𝑌𝑆𝜃3S_{Y}=IR_{Y}\otimes S\odot\sin(\theta)\sqrt{3}italic_S start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = italic_I italic_R start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ⊗ italic_S ⊙ roman_sin ( italic_θ ) square-root start_ARG 3 end_ARG (8)

where tensor-product\otimes represents the convolution operation. To form Y𝑌Yitalic_Y, we treat YLsubscript𝑌𝐿Y_{L}italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and YRsubscript𝑌𝑅Y_{R}italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as two virtual sound sources positioned at azimuths of θLsubscript𝜃𝐿\theta_{L}italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and θRsubscript𝜃𝑅\theta_{R}italic_θ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, respectively. With Y𝑌Yitalic_Y at θ𝜃\thetaitalic_θ decoded using the polygon decoding method [25]:

Y=YW+YXcos(θ)+YYsin(θ)𝑌subscript𝑌𝑊subscript𝑌𝑋𝜃subscript𝑌𝑌𝜃Y=Y_{W}+Y_{X}\cos(\theta)+Y_{Y}\sin(\theta)italic_Y = italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_cos ( italic_θ ) + italic_Y start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT roman_sin ( italic_θ ) (9)

To satisfy 12(YL+YR)=YW12subscript𝑌𝐿subscript𝑌𝑅subscript𝑌𝑊\frac{1}{2}(Y_{L}+Y_{R})=Y_{W}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) = italic_Y start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, θLsubscript𝜃𝐿\theta_{L}italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and θRsubscript𝜃𝑅\theta_{R}italic_θ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT need to be π𝜋\piitalic_π radians apart. We enforce this during training and by setting θL=18πsubscript𝜃𝐿18𝜋\theta_{L}=\frac{1}{8}\piitalic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 end_ARG italic_π and setting θR=θL+π=98πsubscript𝜃𝑅subscript𝜃𝐿𝜋98𝜋\theta_{R}=\theta_{L}+\pi=\frac{9}{8}\piitalic_θ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + italic_π = divide start_ARG 9 end_ARG start_ARG 8 end_ARG italic_π. By synthesizing the Ambisonic data in this manner, we can generate a dataset of input-output pairs (Y,Y)𝑌superscript𝑌(Y,Y^{\prime})( italic_Y , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) suitable for training the Ambisonizer model.

4.2 Source Datasets

4.2.1 Ambisonic IR Datasets

To generate synthetic spatial harmonics, we utilize Ambisonic IR datasets from well-established sources, namely OpenAIR [26], Motus [27], and the C4DM RIR database [28]. These datasets have been widely used in the research community for spatial audio applications and provide a diverse range of acoustic environments, including concert halls, studios, outdoor spaces, and indoor spaces with varying furniture layouts. To introduce variability and prevent overfitting to specific IR lengths, we randomly truncate each IR set to a duration between 0.3 and 1 second. This range is chosen based on the typical reverberation times encountered in real-world environments [29]. By incorporating this randomization step, we ensure that our synthetic spatial harmonics are representative of a wide range of acoustic conditions that are realistic and may present in real-world recordings.

The truncation process is performed using a uniform random distribution to select the IR length within the specified range. This approach guarantees an unbiased sampling of IR durations while maintaining the integrity of the spatial information. Additionally, we apply a fadeout window to the truncated IRs to prevent abrupt endings and ensure a smooth transition. The fadeout length is randomly selected between 0.05 and 0.3 seconds.

4.2.2 Sound Sources Datasets

For the sound sources, we employ the MUSDB18-HQ dataset [30], which consists of high-quality, multi-track recordings at 44.1 kHz of various musical genres. This dataset provides a diverse set of instruments and vocals, making it well-suited for generating artificial spatial mixes. To create these mixes, we randomly place each track in a virtual space by assigning azimuth values drawn from a uniform distribution between π𝜋-\pi- italic_π and π𝜋\piitalic_π radians. For stereo tracks, we introduce a source width parameter, which is randomly selected from a range of 0 to π𝜋\piitalic_π radians. The left and right channels of the stereo track are placed symmetrically around the assigned azimuth, with the source width determining their angular separation.

To further enhance the variability and realism of the generated mixes, we apply a set of common audio augmentations to each track using the audiomentations library222https://github.com/iver56/audiomentations. These augmentations include:

  • Gain adjustment: The gain of each track is randomly adjusted within a range of -10 dB to +10 dB with a probability of 70%.

  • Air absorption: A random air absorption effect is applied to simulate the frequency-dependent attenuation of sound over distance, with distances ranging from 0.1 to 10 meters and a probability of 70%.

  • Seven-band parametric equalizer: A seven-band parametric equalizer is applied to each track with random gain values between -12 dB and 12 dB for each band and a probability of 70%.

  • Gain transition: Smooth gain transitions in the range of -24 dB to 6 dB are introduced within each track with a duration between 0.2 and 6.0 seconds and a probability of 70% to simulate dynamic volume automation in the mix.

4.3 Experimental Setup

We synthesized a training dataset consisting of 40 hours of audio at 44.1 kHz using the aforementioned synthesis pipeline. A random 10% of this dataset was selected as a validation set for checkpoint selection during training. For both training and validation, we randomly cropped 120K samples (2.72 seconds) from each song, applied random gain, and fed the resulting data pairs into the model. The Ambisonizer model was trained for 800K steps with a batch size of 32, using the Adam optimizer with a cosine annealing schedule. The maximum and minimum learning rates were set to 5e-5 and 1e-7, respectively. We specified the latent representations Z𝑍Zitalic_Z and ZCsubscript𝑍𝐶Z_{C}italic_Z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to have 64 dimensions.

5 Results

After careful consideration, we have concluded that there is currently no established objective evaluation framework for our proposed Ambisonizer model. To the best of our knowledge, we were unable to find any existing Ambisonic upmixing baselines against which we could compare our model’s performance, and there are no effective methods to objectively assess the plausibility of generated Ambisonic recordings. Works in the Ambisonic field tend to not include objective evaluations as well [31, 32, 33], other than in the context of discussing decoding errors [14].

One potential approach to evaluation could be to decode the generated Ambisonic audio to stereo and then apply the objective evaluation methods available in the stereo audio domain [34]. However, this approach is not feasible for two reasons. Firstly, our proposed Ambisonizer model aims to generate a realistic Ambisonic sound field, which lacks the ability to encode creative yet unrealistic audio sources that are commonly found in popular stereo and surround recordings [1, 35]. Secondly, introducing additional encoding and decoding stages that are not part of the Ambisonizer model itself would confound any metrics derived from the audio obtained after decoder rendering, making it difficult to isolate the model’s performance [36, 37].

Given the creative and subjective nature of our task, we believe that the most appropriate way to measure the model’s performance is through subjective testing [38, 39, 40]. To make the evaluation more accessible, we decode the Ambisonic recordings to stereo and compare our model’s output against a strong commercial mono-to-stereo baseline. This approach allows for a more direct comparison of the perceived audio quality and spatial attributes, while still providing insight into the effectiveness of our Ambisonizer model in generating realistic Ambisonic signals. We detail our approach and findings below.

5.1 Baseline: Waves PS-22

The demand for mono-to-stereo conversion is commercially significant. Introduced in 2012, the Waves PS-22 Stereo Maker (interface shown in Figure 3) has become a popular choice for this purpose. Many products rely on the Haas effect [41], which employs a short delay between the left and right signals to simulate a stereo field. However, this approach can lead to phase issues, resulting in cancellations between the left and right signals when they are downmixed to mono. In contrast, the PS-22 algorithm relies on a frequency distribution approach, assigning different frequencies to various stereo panning positions to achieve the stereo effect. This method avoids the phase cancellation issues associated with the Haas effect. As our Ambisonizer models require no human intervention, we employ the default settings of the Waves PS-22 in our experiments to ensure a consistent and reproducible approach to mono-to-stereo conversion.

Refer to caption
Figure 3: Waves PS-22 Stereo Maker[42]

5.2 Subjective Evaluation

Refer to caption
Figure 4: Subjective rating results. ’All’ setting is calculated by aggregating all individual sets; error bars are calculated with a 95% confidence interval.

To form a comprehensive evaluation, we randomly selected 10-second audio samples from YouTube videos representing six genres: African333https://www.youtube.com/watch?v=-1OmnxGo9pg, Classical444https://www.youtube.com/watch?v=Sm4JaV6Xz0M, Electro555https://www.youtube.com/watch?v=-GHtdr81VnE, Hiphop666https://www.youtube.com/watch?v=u_wwfo4bs1o, Jazz777https://www.youtube.com/watch?v=T75eWVt2OHI and Rock888https://www.youtube.com/watch?v=syem-TmPTSo. While the MUSDB18 dataset prominently features the last four genres, the first two are not well-represented. We compared the performance of conditional and unconditional generation against the baseline, original stereo, and mono downmixed from stereo by collecting mean opinion scores (MOS) from participants. Each participant rated the five settings within each set, enabling direct comparison of the audio clips. To control for inherent fluctuations in subjective ratings, we followed the approach of [38] and filtered out all set ratings where the mono downmix received a higher score than the original stereo recordings. The loudness of the audio clips was normalized to -24 LUFS to avoid loudness bias [43]. For the Ambisonizer model settings, we conducted a grid search at 1-degree intervals for the left speaker position and 10-degree intervals for the distance between the left and right speakers. Same as during training, we used the regular polygon decoding method [25] for a given speaker position. To account for the centered bias, as most stereo mixes have balanced left and right signals, we selected the two decoding speaker positions with minimal root-mean-squared (RMS) difference. A total of 25 participants provided 6 sets of ratings each. After applying the filtering rules, 13 sets were discarded from the analysis.

The results are illustrated in Figure 4. We additionally conduct a statistical significance test on all subjective ratings using pair-wise Wilcoxon signed-rank tests with p=0.05𝑝0.05p=0.05italic_p = 0.05. The results of the statistical significance test are shown in Table 1.

Baseline Cond. Uncond. Stereo
Mono Y Y Y Y
Baseline N N Y
Cond. N Y
Uncond. Y
Table 1: Pairwise statistical significant test on all individual sets aggregated. N indicates no significant difference, while Y indicates a significant difference. Cond. and Uncond. stands for conditional and unconditional settings.

Our analysis reveals that the baseline, conditional, and unconditional Ambisonizer models demonstrate statistically significant improvements over the mono downmix when considering the aggregated set. However, they perform worse compared to the stereo setting. No significant difference is observed among the baseline, conditional, and unconditional methods. Furthermore, we notice substantial fluctuations between each set. The baseline method shows strong performance on African and Hip-hop sets but underperforms compared to the Ambisonizer models on Classical, Electro, Jazz, and Rock settings.

6 Discussions

The results above demonstrate that the proposed Ambisonizer and its underlying paradigm are comparable to a strong commercial baseline. However, there remains a gap between the proposed approaches and the original stereo mixes. We hypothesize that this gap is due to three factors: 1) inherent limitations of the Ambisonic format, 2) decoding artifacts, and 3) the difficulty of the task itself. In this section, we present a detailed discussion of these aspects.

Inherent limitations of the Ambisonic format. As mentioned in Section 5, the Ambisonic format encodes audio in a realistic acoustic space and, therefore, cannot recreate more creative stereo and spatial effects. Upon listening to the produced renderings of the Ambisonizer model’s output, we found that its perceived stereo field is narrower compared to the baseline and original stereo mix. We believe this issue may be addressed by develo** additional decoder models, which take a reference input as a style guide during the decoding process.

Decoding artifacts. As Ambisonic decoding research progresses, we expect the decoding results to continue improving, thereby reducing the impact of decoding artifacts on the overall performance.

Difficulty of the task itself. The subjective nature of the task makes modeling what constitutes a plausible upmixing result challenging. By upmixing to the Ambisonic format, we implicitly enforce that a plausible result should, at the very least, be acoustically coherent, meaning that all audio sources are in the same acoustic space. While some literature supports this notion [44], it may not hold true for specific genres, such as electronic dance music (EDM).

Furthermore, we acknowledge two key limitations of our work. We posit that the worse performance on African and Hip-hop sets is likely due to the fact that these genres are percussion-heavy, and using a convolution-based audio decoder effectively sets a window, which makes modeling percussive attacks more difficult, as observed in source separation tasks [19]. Additionally, we note that the lack of objective evaluation methods makes our work less meaningful due to the volatile nature of subjective ratings. By presenting our work and findings, we hope to inspire research into channel-agnostic upmixing paradigms, which would drive the development of objective evaluation methods and benchmarks. Despite these limitations, we believe the strong performance of the Ambisonizer model positions the Ambisonic format as a valid and promising intermediate representation for channel-agnostic upmixing.

7 Conclusions

We introduce Ambisonizer, a novel paradigm for channel-agnostic neural upmixing using spherical harmonics. By leveraging the Ambisonic format, our model enables the generation of first-order Ambisonic audio from mono or stereo input, allowing for mono-to-any and stereo-to-any upmixing. Through subjective evaluations, we demonstrated that the Ambisonizer model’s output, when downmixed to stereo, is comparable to a strong commercial mono-to-stereo baseline. We also identified limitations in the Ambisonic format itself, decoding artifacts, and the inherent difficulty of the task. We believe that the strong performance of the Ambisonizer model positions the Ambisonic format as a promising intermediate representation for channel-agnostic upmixing.

References

  • [1] F. Rumsey, Spatial audio.   Routledge, 2012.
  • [2] I. d. V. Bosman, O. O. Buruk, K. Jorgensen, and J. Hamari, “The effect of audio on the experience in virtual reality: a sco** review,” Behaviour & Information Technology, vol. 43, no. 1, pp. 165–199, 2024.
  • [3] M. Lagrange, L. G. Martins, and G. Tzanetakis, “Semi-automatic mono to stereo up-mixing using sound source formation,” in Audio Engineering Society Convention 122.   Audio Engineering Society, 2007.
  • [4] D. Fitzgerald, “Upmixing from mono-a source separation approach,” in 2011 17th International Conference on Digital Signal Processing (DSP).   IEEE, 2011, pp. 1–7.
  • [5] H. Shim, J. S. Abel, and K.-M. Sung, “Stereo music source separation for 3-d upmixing,” in Audio Engineering Society Convention 127.   Audio Engineering Society, 2009.
  • [6] K. M. Ibrahim and M. Allam, “Primary-ambient source separation for upmixing to surround sound systems,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 431–435.
  • [7] L. Lin, Q. Kong, J. Jiang, and G. Xia, “A unified model for zero-shot music source separation, transcription and synthesis,” arXiv preprint arXiv:2108.03456, 2021.
  • [8] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Zero-shot audio source separation through query-based learning from weakly-labeled data,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 4, 2022, pp. 4441–4449.
  • [9] I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl, “Moisesdb: A dataset for source separation beyond 4-stems,” arXiv preprint arXiv:2307.15913, 2023.
  • [10] E. Cano, D. FitzGerald, and K. Brandenburg, “Evaluation of quality of sound source separation algorithms: Human perception vs quantitative metrics,” in 2016 24th European Signal Processing Conference (EUSIPCO).   IEEE, 2016, pp. 1758–1762.
  • [11] J. Pons, S. Pascual, G. Cengarle, and J. Serrà, “Upsampling artifacts in neural audio synthesis,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 3005–3009.
  • [12] D. Arteaga, “Introduction to ambisonics,” Escola Superior Politècnica Universitat Pompeu Fabra: Barcelona, Spain, pp. 6–8, 2015.
  • [13] B. Peters, B. Lappalainen, and A. Fiori, “Auralising acoustic architecture,” in Proceedings of Computer Aided Architectural Design in Europe Conference (eCAADe), Digital Perception of Space-Cyber-Physical Systems (VR, AR)-Design strategies, vol. 38, no. 1, 2020.
  • [14] D. Murillo, F. Fazi, and M. Shin, Evaluation of ambisonics decoding methods with experimental measurements, 2014.
  • [15] L. McCormack and S. Delikaris-Manias, “Parametric first-order ambisonic decoding for headphones utilising the cross-pattern coherence algorithm,” in EAA Spatial Audio Signal Processing Symposium, 2019, pp. 173–178.
  • [16] J. Daniel and S. Moreau, “Further study of sound field coding with higher order ambisonics,” in Audio Engineering Society Convention 116.   Audio Engineering Society, 2004.
  • [17] Eigenbeam Datasheet, MH Acoustics, Summit, NJ, January 2023, available online: https://mhacoustics.com/sites/default/files/Eigenbeam%20Datasheet_R01A.pdf. [Online]. Available: https://mhacoustics.com/sites/default/files/Eigenbeam%20Datasheet_R01A.pdf
  • [18] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009.02095, 2020.
  • [19] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  • [20] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [22] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [23] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6199–6203.
  • [24] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
  • [25] F. Zotter, M. Frank, F. Zotter, and M. Frank, “Ambisonic amplitude panning and decoding in higher orders,” Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality, pp. 53–98, 2019.
  • [26] D. T. Murphy and S. Shelley, “Openair: An interactive auralization web resource and database,” in Audio Engineering Society Convention 129.   Audio Engineering Society, 2010.
  • [27] G. Götz, S. J. Schlecht, and V. Pulkki, “A dataset of higher-order ambisonic room impulse responses and 3d models measured in a room with varying furniture,” in 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA).   IEEE, 2021, pp. 1–8.
  • [28] R. Stewart and M. Sandler, “Database of omnidirectional and b-format room impulse responses,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2010, pp. 165–168.
  • [29] H. Kuttruff, Room acoustics.   Crc Press, 2016.
  • [30] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “Musdb18-a corpus for music separation,” 2017.
  • [31] S. Moreau, J. Daniel, and S. Bertet, “3d sound field recording with higher order ambisonics–objective measurements and validation of a 4th order spherical microphone,” in 120th Convention of the AES, 2006, pp. 20–23.
  • [32] S. Berge and N. Barrett, “A new method for b-format to binaural transcoding,” in 40th Int. Conf. of AES, Tokyo, Japan, 2010.
  • [33] N. Barrett, “The perception, evaluation and creative application of high order ambisonics in contemporary music practice,” Ircam Musical Research Residency report, 2012.
  • [34] C. J. Steinmetz and J. D. Reiss, “auraloss: Audio focused loss functions in pytorch,” in Digital music research network one-day workshop (DMRN+ 15), 2020.
  • [35] D. Gibson, The art of mixing: a visual guide to recording, engineering, and production.   routledge, 2019.
  • [36] E. Bates, M. Gorzel, L. Ferguson, H. O’Dwyer, and F. M. Boland, “Comparing ambisonic microphones–part 1,” in Audio Engineering Society Conference: 2016 AES International Conference on Sound Field Control.   Audio Engineering Society, 2016.
  • [37] M. Narbutt, A. Allen, J. Skoglund, M. Chinen, and A. Hines, “Ambiqual-a full reference objective quality metric for ambisonic spatial audio,” in 2018 tenth international conference on quality of multimedia experience (QoMEX).   IEEE, 2018, pp. 1–6.
  • [38] J. Serrà, D. Scaini, S. Pascual, D. Arteaga, J. Pons, J. Breebaart, and G. Cengarle, “Mono-to-stereo through parametric stereo generation,” arXiv preprint arXiv:2306.14647, 2023.
  • [39] S. Bech and N. Zacharov, Perceptual audio evaluation-Theory, method and application.   John Wiley & Sons, 2007.
  • [40] M. Schoeffler, F.-R. Stöter, B. Edler, and J. Herre, “Towards the next generation of web-based experiments: A case study assessing basic audio quality following the itu-r recommendation bs. 1534 (mushra),” in 1st Web Audio Conference, 2015, pp. 1–6.
  • [41] T. Liu and D. Yan, “Identification of fake stereo audio,” arXiv preprint arXiv:2104.09832, 2021.
  • [42] Waves Audio, “Ps22 stereo maker plugin,” https://www.waves.com/plugins/ps22-stereo-maker, 2023, accessed: insert-date-of-access.
  • [43] E. Vickers, “The loudness war: Background, speculation, and recommendations,” in Audio Engineering Society Convention 129.   Audio Engineering Society, 2010.
  • [44] R. Hepworth-Sawyer and J. Hodgson, Mixing music.   Taylor & Francis, 2016.