Showing 1–2 of 2 results for author: Hogg, A O T

Search v0.5.6 released 2020-02-24

arXiv:2312.16763 [pdf, other]

eess.AS cs.SD

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

Authors: Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, Patrick A. Naylor

Abstract: This paper studies modulation spectrum features ($Φ$) and mel-frequency cepstral coefficients ($Ψ$) in joint speaker diarization and identification (JSID). JSID is important as speaker diarization on its own to distinguish speakers is insufficient for many applications, it is often necessary to identify speakers as well. Machine learning models are set up using convolutional neural networks (CNNs)… ▽ More This paper studies modulation spectrum features ($Φ$) and mel-frequency cepstral coefficients ($Ψ$) in joint speaker diarization and identification (JSID). JSID is important as speaker diarization on its own to distinguish speakers is insufficient for many applications, it is often necessary to identify speakers as well. Machine learning models are set up using convolutional neural networks (CNNs) on $Φ$ and recurrent neural networks $\unicode{x2013}$ long short-term memory (LSTMs) on $Ψ$, then concatenating into fully connected layers. Experiment 1 shows models on both $Φ$ and $Ψ$ have better diarization error rates (DERs) than models on either alone; a CNN on $Φ$ has DER 29.09\%, compared to 27.78\% for a LSTM on $Ψ$ and 19.44\% for a model on both. Experiment 1 also investigates aleatoric uncertainties and shows the model on both $Φ$ and $Ψ$ has mean entropy 0.927~bits (out of 4~bits) for correct predictions compared to 1.896~bits for incorrect predictions which, along with entropy histogram shapes, shows the model helpfully indicates where it is uncertain. Experiment 2 investigates epistemic uncertainties as well as aleatoric using Monte Carlo dropout (MCD). It compares models on both $Φ$ and $Ψ$ with models trained on x-vectors ($X$), before applying Kalman filter smoothing on epistemic uncertainties for resegmentation and model ensembles. While the two models on $X$ (DERs 10.23\% and 9.74\%) outperform those on $Φ$ and $Ψ$ (DER 17.85\%) after their individual Kalman filter smoothing, combining them using a Kalman filter smoothing method improves the DER to 9.29\%. Aleatoric uncertainties are higher for incorrect predictions. Both Experiments show models on $Φ$ do not distinguish overlap** speakers as well as anticipated. However, Experiment 2 shows model ensembles do better with overlap** speakers than individual models do. △ Less

Submitted 30 December, 2023; v1 submitted 27 December, 2023; originally announced December 2023.

Comments: 12 pages, 7 figures
arXiv:2306.05812 [pdf, other]

eess.AS cs.CV cs.HC cs.LG cs.SD eess.SP

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Authors: Aidan O. T. Hogg, Mads Jenkins, He Liu, Isaac Squires, Samuel J. Cooper, Lorenzo Picinali

Abstract: An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a hig… ▽ More An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, spherical harmonic (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions). △ Less

Submitted 27 February, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 15 pages, 9 figures, Preprint (Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing on the 15 Feb 2024)

Search v0.5.6 released 2020-02-24