\interspeechcameraready\name

Ali Aliyev1

System Description for the Displace Speaker Diarization Challenge 2023

Abstract

This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.

Index Terms: speech recognition, speaker diarization, speaker verification

1 Introduction

Diarization is the process of separating speech belonging to different speakers. In diarization algorithms, we usually find segments with speech in the audio signals, then obtain a numerical representation (features) for each segment and then cluster the segments based on those features. We should also take into consideration that the error of each step directly affects the error of the next step.

Diarization of Speaker and Language in Conversational Environments Challenge [1] addresses the problem of separating voices by speaker and by language. The peculiarity of this challenge is that, unlike other such competitions, the same speakers speak two different languages, namely English and Hindi. It is this feature that makes this challenge unique among other similar diarization challenges. In our solution, we did not use Hindi during the training, but in spite of this, we achieved good results. This challenge consists of two tracks:

  • Track-1: Speaker diarization in multilingual scenarios.

  • Track-2: Language diarization in multi-speaker settings.

In the following sections, we will describe our solution for speaker diarization track.

2 System description

Usually, all algorithms for speaker diarization consist of three parts:

  1. 1.

    Voice activity detector

  2. 2.

    Feature extractor

  3. 3.

    Clustering algorithm

2.1 Voice activity detector

VAD is present in almost all diarization algorithms [2, 3], because segments with noise or other extraneous sounds can lead to an error in the following steps. The clustering algorithm, which independently selects the number of clusters (speakers) in the audio files, may make a false positive prediction and create a cluster for an extra speaker that is not actually there.

We selected the pre-trained Silero VAD v4111https://github.com/snakers4/silero-vad model, which is one of the most accurate open-source solution for the speech activity detection task. This model achieves a ROC-AUC score equal to 0.9 on the Libryparty dataset [4] and 0.99 on the AVA speech activity dataset [5]. Although the v4 version of the model achieves better scores on these datasets, while v3 achieves 0.87 and 0.93 respectively, the switch from v3 to v4 did not significantly affect our metrics.

However, after experiments, we were not completely satisfied with the results of Silero VAD on this task, so we decided to test another solution as well. We took WebRTC VAD222https://github.com/wiseman/py-webrtcvad, which achieves 0.81 on the Libryparty dataset and 0.66 on the AVA speech activity dataset.

2.2 Feature extractor for speaker recognition

Our feature extractor was originally trained for speaker verification, but it is just as well suited for the speaker diarization task.

2.2.1 Training data

Like most pipelines for speaker verification, our neural network uses the VoxCeleb2 [6] dataset, which contains 1,092,009 utterances and 5,994 speakers, as its main training dataset. But, since the main goal of this competition is to apply speaker diarization systems in a multilingual environment, the feature extractor should also be trained in two languages. However, we did not have enough data in Hindi. To solve this problem, we took advantage of CNN’s training feature for speaker verification, where training in two languages, improves metrics in most other languages. This ability to adapt for other languages is also supported in other works [7]. So we took Common Voice Corpus 12.0 [8] for Russian and combined this dataset with VoxCeleb2.

In the end, we got a dataset with 2600 hours of speech in English and 229 hours of speech in Russian.

2.2.2 Model architecture and training

We used Resnet [9] models as the basic architecture. Namely, Resnet-34 and Resnet-293. And as input data, we used fixed 2 stimes2second2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG segments, which were randomly cut from utterances from our dataset. From which we further extract 80-dimensional MEL f-banks with a window length of 25 mstimes25millisecond25\text{\,}\mathrm{ms}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG and 10 mstimes10millisecond10\text{\,}\mathrm{ms}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG stride length. For data augmentation, we used the Music, Speech, and Noise Corpus (MUSAN) [10] to add noise, music, and other extraneous sounds and reverberation from the Room Impulse Response and Noise Database (RIR) [11]. AAM-Softmax Loss [12] was used to train the model. It took us 18 hours to train Resnet-34 and 97 hours to train Resnet-293 on 8 NVIDIA Tesla A100 40 GB GPUs. Each model has been trained during 150 epochs.

2.3 Overlapped speech detection

One of the significant problems of all speaker diarization systems is speech segments where there are voices of two or more speakers. In such segments, the feature extractor produces incorrect embeddings due to the presence of two or more voices. Usually additional classifiers are used to detect such segments, but we went the other way. Instead of using additional detection methods [13, 14], we divided all the segments after VAD into additional subsegments, where from each of them we extract the features and only then we clustered them. To get subsegments, we will use the sliding window technique with a window length of 2 stimes2second2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG and 0.4 stimes0.4second0.4\text{\,}\mathrm{s}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG stride length, it is with these parameters that we achieved the best results.

2.4 Clustering

We used Spectral Clustering [15] as a clustering algorithm. Because it can work under more difficult conditions than other clustering methods, such as k-means etc. Here’s how it works:

  1. 1.

    Calculating of a similarity matrix for our embeddings from the feature extractor. We use cosine similarity as a similarity measure.

  2. 2.

    Calculating of a Laplacian matrix from a similarity matrix.

  3. 3.

    Then we’re solving a standard eigenvalue problem for a real symmetric matrix to calculate eigenvectors and eigenvalues of Laplacian matrix.

  4. 4.

    To solve the problem of determining the k𝑘kitalic_k (number of clusters), we used a heuristic method [16, pp. 410–411] based on eigenvalues.

  5. 5.

    After we have computed k𝑘kitalic_k, we can now apply k-means clustering to the first k𝑘kitalic_k eigenvectors from the previous steps.

3 Challenge dataset

The dataset for this contest consists of 3 parts:

  • Development dataset with ground truth labels. This part contains 27 audio files in wav format and annotations in rttm format. The total duration of the utterances is 15 hours and 45 minutes, and the most of the files are 30 minutes long, and some are about an hour long. Usually all files are single-channel, but one file, namely M043.wav, was in stereo for some reason. So we fixed that by converting it to a mono channel audio file. The maximum number of speakers found in the ground truth files was 4.

  • Phase 1 evaluation dataset contains 20 audio files in wav format, but without annotation. The total duration of the files is 11 hours and 24 minutes. Most of the files are also 30 minutes long, and some are about an hour long. As in the previous part of the dataset, this one also has one file (M053.wav) recorded in stereo channel.

  • At the time of writing, this article phase 2 evaluation dataset was not yet available for participants.

Since this contest contains two tracks, each track has its own annotations, but both tracks are using the same audio files.

4 Experiments results

In this section, we will present the results of our tests with different parts of our algorithm.

As a metric to calculate error, organizers use a metric called diarization error rate (DER). This error rate is the sum of the following values:

  • Speaker error (SE) - percentage of scored time for which the wrong speaker ID is assigned for a speech segment.

  • False alarm speech (FA) - percentage of scored time where non-speech segment was incorrectly marked as a segment which contains speech.

  • Missed speech (MS) - percentage of scored time where a segment with speech was incorrectly marked as non-speech segment.

Based on the above, we can draw the following conclusions. The speaker error is directly affected by the feature extractor and the clustering algorithm. And false alarm speech and missed speech depend on the quality of the VAD. The closer the error is to 0, the better for us. DER may also exceed 100, since it is the sum of several errors.

Note that the authors of the competition use an implementation called dscore333https://github.com/nryant/dscore to calculate DER metric, so depending on the version of the implementation used to calculate the metrics, the numbers may be slightly different.

4.1 Voice activity detector

The tables below will show all three types of error, on the basis of which the DER is formed. However, primarily, to evaluate the quality of VAD performance, we need to look at false alarm speech (FA) and missed speech (MS). All VAD experiments were performed under identical conditions, with a window length of 2 stimes2second2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG and 0.4 stimes0.4second0.4\text{\,}\mathrm{s}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG stride length for the feature extractor (Resnet-34) and with spectral clustering.

4.1.1 Silero VAD

As we can see in Table 1, Silero VAD 4.0 performs a slightly better than the previous version on this competition dataset. But according to the available information from the authors of Silero VAD, the fourth version of the model is 3.4% better in the AVA Spoken Activity Dataset and 6.1% better in the Libryparty dataset. However, the gain on the displace2023_dev dataset was only 0.6%.

Table 1: The results of different versions of Silero VAD, with the same parameters on the displace2023_dev dataset.
VAD version MS FA SE DER
Silero-VAD 3.1 17.417.417.417.4 5.45.45.45.4 4.14.14.14.1 26.926.926.926.9
Silero-VAD 4.0 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8

In Table 2 we can see that we had to lower the threshold a lot in order to improve our metrics. This is due to the fact that Silero VAD is trained on a multilingual dataset that does not include Hindi, and since in addition to English the speakers in this dataset also speak Hindi, this worsens the accuracy of VAD. And because we lowered the threshold, it led to a lot of false-positive and false-negative segments. And in some examples there were so many of them, that the algorithm simply returns only one segment with timestamps of the beginning and end of the audio file. Apparently, there is some kind of algorithm that combines the overlap** segments.

Table 2: Silero VAD 4.0 results with different thresholds on the displace2023_dev dataset.
Threshold MS FA SE DER
0.15 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8
0.25 25.425.425.425.4 2.82.82.82.8 3.03.03.03.0 31.231.231.231.2
0.50 30.530.530.530.5 2.12.12.12.1 2.62.62.62.6 35.235.235.235.2
0.75 35.435.435.435.4 1.71.71.71.7 2.42.42.42.4 39.539.539.539.5

4.1.2 WebRTC VAD

WebRTC VAD uses a concept such as aggressiveness instead of threshold. This parameter affects the sensitivity level of non-speech segments filtering. There are 4 levels of aggressiveness from 0 to 3, where 0 is the least sensitive and 3 is the most sensitive. Table 3 shows that the more aggressive the WebRTC VAD is, the more speech we start to skip in the dataset.

It accepts audio segments with durations: 10, 20 and 30 mstimes30millisecond30\text{\,}\mathrm{ms}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG. As we see from Table 4, the best result is obtained if we divide the original audio signal into 20 mstimes20millisecond20\text{\,}\mathrm{ms}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG segments. The tests were made using zero level aggressiveness.

Table 3: WebRTC VAD results with different level of aggressiveness on the displace2023_dev dataset.
Aggressiveness MS FA SE DER
0 19.219.219.219.2 4.54.54.54.5 3.73.73.73.7 27.427.427.427.4
1 20.020.020.020.0 4.34.34.34.3 3.63.63.63.6 27.927.927.927.9
2 21.621.621.621.6 4.04.04.04.0 3.43.43.43.4 29.029.029.029.0
3 28.728.728.728.7 2.82.82.82.8 3.13.13.13.1 34.634.634.634.6
Table 4: WebRTC VAD results with different lengths of input segments on the displace2023_dev dataset.
Duration MS FA SE DER
10 19.319.319.319.3 4.54.54.54.5 3.73.73.73.7 27.527.527.527.5
20 19.119.119.119.1 4.54.54.54.5 3.73.73.73.7 27.327.327.327.3
30 19.219.219.219.2 4.54.54.54.5 3.73.73.73.7 27.427.427.427.4

As we see with WebRTC VAD DER is equals to 27.3%, and with Silero VAD DER is 26.8% on the development part of the dataset. We could say that the correlation of results will be the same if we compare these two algorithms with each other on the evaluation part of our dataset. But it turned out to be the opposite. On the evaluation part of our dataset, our final algorithm with Silero VAD, DER was 28.2%, and with WebRTC VAD DER dropped to 27.4%. These results were achieved using Resnet-293 and spectral clustering. Unfortunately, more detailed metrics are not available, since we do not have ground truth files for the evaluation part of the dataset and the first phase of the competition was already closed.

4.2 Feature extractor

We used Silero VAD and spectral clustering in the all following experiments. From the Table 5 we see that bilingual learning does increase the accuracy of our neural network in other languages. Even though we trained on a combination of English and Russian dataset, it still showed some gain. The maximum gain would most likely be from training by using English and Hindi.

Table 5: Comparison of Resnet-34 trained on Voxceleb-2 alone and on Voxceleb-2+Common Voice Russian corpus on the displace2023_dev dataset.
Dataset MS FA SE DER
Voxceleb-2 17.317.317.317.3 5.45.45.45.4 4.24.24.24.2 26.926.926.926.9
Combined dataset 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8

As you know, one of the main datasets for diarization is the voxconverse [17] dataset, and on this dataset we achieved the best results (DER 7.2 %) using a sliding window length of 1.5 stimes1.5second1.5\text{\,}\mathrm{s}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG with 0.75 stimes0.75second0.75\text{\,}\mathrm{s}start_ARG 0.75 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG stride length. But in this task, the best results were 2 stimes2second2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG segments with 0.4 stimes0.4second0.4\text{\,}\mathrm{s}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG steps. We can observe this on Table 6. Perhaps the choice of window length depends on the linguistic features of each language and the speed of pronunciation of words.

Table 6: Comparison of sliding window parameters for feeding data to the Resnet-34 input on the displace2023_dev dataset.
Parameters MS FA SE DER
size=1.5, step=0.40 17.317.317.317.3 5.45.45.45.4 4.84.84.84.8 27.527.527.527.5
size=1.5, step=0.50 17.317.317.317.3 5.45.45.45.4 4.94.94.94.9 27.627.627.627.6
size=1.5, step=0.75 17.317.317.317.3 5.45.45.45.4 5.15.15.15.1 27.827.827.827.8
size=2.0, step=0.40 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8
size=2.0, step=0.50 17.317.317.317.3 5.45.45.45.4 4.24.24.24.2 26.926.926.926.9
size=2.0, step=0.75 17.317.317.317.3 5.45.45.45.4 4.34.34.34.3 27.027.027.027.0

Initially we used Resnet-34 in our solution, but then we decided to increase the size of our model to Resnet-293 in order to maximize results from the feature extractor part of our algorithm. However, as you can see in Table 7, we could not achieve serious improvements in the metrics. This is most likely due to the fact that the accuracy of our VAD to solve the problem for this dataset is insufficient. The VAD we chose was trained on other languages, so when we used it with Hindi, the results were worse than with the languages used during the training. And since we extract features from segments passed after the VAD, this also affects the results of the feature extractor. Changing to a better VAD, could also improve the results of the feature extractor.

Table 7: Comparison of Resnet-34 and Resnet-293 trained on Voxceleb-2+Common Voice Russian corpus on the displace2023_dev dataset.
Model MS FA SE DER
Resnet-34 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8
Resnet-293 17.317.317.317.3 5.45.45.45.4 3.83.83.83.8 26.526.526.526.5

4.3 Clustering

As we can see in Table 8, spectral clustering works better than agglomerative hierarchical clustering (AHC) [18]. For AHC, we used cosine similarity with a threshold value equal to 0.5 to calculate the distance between samples, group average linkage as a linkage function, and silhouette score to determine the optimal number of clusters.

Table 8: Comparison of spectral clustering with agglomerative hierarchical clustering on the displace2023_dev dataset. Resnet-34 and Silero VAD were used.
MS FA SE DER
SC 17.317.317.317.3 5.45.45.45.4 4.14.14.14.1 26.826.826.826.8
AHC 17.317.317.317.3 5.45.45.45.4 16.216.216.216.2 38.938.938.938.9

4.4 Final submission

In our final submission, we decided to use a combination of Web-RTC VAD with aggressiveness equal to 0 and with segments length 20 mstimes20millisecond20\text{\,}\mathrm{ms}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG, Resnet-293 with a sliding window length of 2 stimes2second2\text{\,}\mathrm{s}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG and 0.4 stimes0.4second0.4\text{\,}\mathrm{s}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG stride length, and spectral clustering. Table 9 below shows the results on the displace 2023 development and phase-1 evaluation dataset.

Table 9: Final submission results for development and evaluation phase 1 datasets
Dataset MS FA SE DER
displace2023_dev 19.119.119.119.1 4.54.54.54.5 3.53.53.53.5 27.127.127.127.1
displace2023_eval_1 X X X 27.427.427.427.4

5 Conclusions

In this paper, we have described our approach for speaker diarization. Our proposed method using a combination of Web-RTC VAD, Resnet-293 and spectral clustering achieves good results, but the VAD part of the algorithm needs further improvements. Our final submission DER was 27.1% and 27.4%, on development and phase-1 evaluation parts of the dataset, respectively.

References

  • [1] S. Baghel, S. Ramoji, Sidharth, R. H, P. Singh, S. Jain, P. R. Chowdhuri, K. Kulkarni, S. Padhi, D. Vijayasenan, and S. Ganapathy, “Displace challenge: Diarization of speaker and language in conversational environments,” 2023. [Online]. Available: https://arxiv.longhoe.net/abs/2303.00830
  • [2] T. J. Park, N. R. Koluguri, F. Jia, J. Balam, and B. Ginsburg, “NeMo Open Source Speaker Diarization System,” in Proc. Interspeech 2022, 2022, pp. 853–854.
  • [3] Y. Dissen, F. Kreuk, and J. Keshet, “Self-supervised Speaker Diarization,” in Proc. Interspeech 2022, 2022, pp. 4013–4017.
  • [4] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  • [5] S. Chaudhuri, J. Roth, D. P. W. Ellis, A. Gallagher, L. Kaver, R. Marvin, C. Pantofaru, N. Reale, L. Guarino Reid, K. Wilson, and Z. Xi, “AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies,” in Proc. Interspeech 2018, 2018, pp. 1239–1243.
  • [6] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
  • [7] F. Tadele, J. Wei, K. Honda, R. Zhang, and W. Yang, “Effect of language mixture on speaker verification: An investigation with amharic, english, and mandarin chinese,” in Artificial Intelligence and Security, X. Sun, X. Zhang, Z. Xia, and E. Bertino, Eds.   Cham: Springer International Publishing, 2022, pp. 243–256.
  • [8] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [10] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” 2015. [Online]. Available: https://arxiv.longhoe.net/abs/1510.08484
  • [11] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 5220–5224.
  • [12] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4685–4694.
  • [13] L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
  • [14] K. Boakye, B. Trueba-Hornero, O. Vinyals, and G. Friedland, “Overlapped speech detection for improved speaker diarization in multiparty meetings,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4353–4356.
  • [15] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, vol. 14, 2001.
  • [16] U. Luxburg, “A tutorial on spectral clustering,” vol. 17, no. 4, p. 395–416, dec 2007. [Online]. Available: https://doi.org/10.1007/s11222-007-9033-z
  • [17] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” in INTERSPEECH, 2020.
  • [18] F. Nielsen, Introduction to HPC with MPI for Data Science, ser. Undergraduate Topics in Computer Science.   Springer, 2016. [Online]. Available: https://doi.org/10.1007/978-3-319-21903-5