\setcode

utf8

Speech Representation Analysis
based on Inter- and Intra-Model Similarities

Abstract

Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm – Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts¹¹1A concept represents a coherent fragment of knowledge, such as “a class containing certain objects as elements, where the objects have certain properties”[1]. We made the code publicly available for facilitating further research, we publicly released our code²²2.https://github.com/QCRIVoice/XSSL_speech .

Index Terms— Self-Supervised Learning, Speech Models, Inter- and Intra- Similarities

1 Introduction

Self-supervised Speech models like Wav2Vec2 [2] and HuBERT [3] have shown remarkable advancements in a variety of speech processing tasks, including speech recognition, emotion recognition, speaker verification, and language identification [4, 5, 6] among others. This significant advancement over supervised state-of-the-art methods and the opaqueness of these models has sparked interest in understanding and exploring their internal mechanisms.

Several studies have aimed to understand the information these models capture about different properties such as speaker characteristics [7, 8, 9, 10], paralinguistic aspects [11, 12], articulatory features [13], acoustic-linguistic elements [14], as well as accent features [15] among others. Moreover, studies like [14, 16] have also shown how better model understanding can lead to efficient fine-tuning strategies for downstream tasks.

A widely used interpretation technique includes training supervised classifiers, aka probing classifiers [17, 8, 18], based on the learned representations of given models, to predict various task properties. This methodology has found application in various studies and showed the ability of representations from different models to capture distinct properties. Additionally, similarity-based methods are used to find associations at the frame-, phoneme-, and word-levels. These methods utilizes metrics such as projected-weighted canonical correlation analysis ( $pwcca$ ) [19] and mutual information without training classifiers. However, the effectiveness of this approach is limited by the need for annotated data, requiring precise boundary alignment, accurate phoneme transcription, coupled with consistent word alignment, to ensure valid and reliable analysis and results.

In this study, we introduce inter- and intra-model similarity measures to understand contextual representations within speech models. Instead of focusing on a specific category or property of information, we focus on exploring both inter- and intra-similarity across a spectrum of models. We investigate localization/distributivity³³3Does every single neuron encode a single concept or all concepts are spread across multiple neurons? [20] properties in these models. We adopted a set of $5$ distinct similarity measures, to explore the SSL models for localization/distributivity behavior at individual neurons, layers, and attention mechanisms levels. This comprehensive range of metrics allows us to capture the nuances in the patterns embedded within the models, offering a granular view of their structural and functional dynamics.

Our in-depth analysis reaffirms prior discoveries without the need for external data or defined tasks. Moreover, our findings also reveal noteworthy insights: (i) Speech SSL model neurons exhibit higher intra-model similarity than inter-model similarity. (ii) Information encapsulated by neurons from one layer can be represented as a linear combination of other layers. Models have similar representation subspaces but different localized neuron concepts. (iii) Lower and adjacent layers demonstrate a high degree of similarity across diverse models. (iv) The training objective has a greater impact on representation similarity than the size of the model architecture. (v) Finally, we show how the similarity analysis can motivate efficient finetuning for ASR, where freezing the bottom layers of models still maintains comparable performance to finetuning the full network while reducing the finetuning time.

Refer to caption — Fig. 1: Comparison of Heatmap Similarities Between HuBERT and Wav2Vec2.0 Models: $\text{neu-neu}_{\text{sm}}$ , $\text{neu-lay}_{\text{sm}}$ , and $pwcca$ Similarities. Model boundaries are highlighted in yellow, and noteworthy similarities are encircled in green.

2 Methodology

We analyzed $M$ pretrained speech SSL models for both localized and distributed information using various widely accepted similarity metrics [21], capturing different notations at individual neurons, layers, and attention levels. We propose to remove any dependency on external labels or boundary annotation by utilizing only frame-level representation for the study.

For each model $m$ , ( $m\in M$ ), we extracted frame level representations $h^{(m)}_{l}$ $\in\mathbb{R}^{d_{m}}$ , where $d_{m}=\{768,1024\}$ , is indicative of number of neurons, and attention weights $\alpha^{(m)}_{l}$ at each layer $l$ . We then exploit the extracted neuron/layer-level representation and attention weight to find inter- and intra-similarities at various levels of granularities.

2.1 Neuron-level Similarity

We adopted two different similarity measures: (i) neuron-neuron similarity ( $\text{neu-neu}_{\text{sm}}$ ) – similarity between pairs of individual neurons, and (ii) neuron-layer similarity ( $\text{neu-lay}_{\text{sm}}$ ) – similarity between a neuron in one model with a layer in another.

For a given neuron $h^{(m)}_{l}[k]$ of a model $m$ , and a layer $l$ , $\text{neu-neu}_{\text{sm}}$ is defined as the maximum correlation between $h^{(m)}_{l}[k]$ and another neuron $h^{(m^{\prime})}_{l}[k^{\prime}]$ of layer $l^{\prime}$ of another model $m^{\prime}$ :

\displaystyle\widetilde{\text{neu-neu}_{\text{sm}}}(h^{(m)}_{l}[k],h^{(m^{% \prime})}_{l^{\prime}})

\displaystyle=\max_{k^{\prime}}\rho(h^{(m^{\prime})}_{l^{\prime}}[k^{\prime}],% h^{(m)}_{l}[k])

(1)

Then, we average over all neurons in layer $l$ of the model $m$ ,

	$\displaystyle\text{neu-neu}_{\text{sm}}(h^{(m)}_{l},h^{(m^{\prime})}_{l^{% \prime}})$	$\displaystyle=$		(2)
	$\displaystyle\frac{1}{d_{m}}\times\sum_{k}\widetilde{\text{neu-neu}_{\text{sm}% }}(h^{(m)}_{l}[k],h^{(m^{\prime})}_{l^{\prime}}).$			(2)

where $\rho$ is the Pearson correlation.

$\text{neu-neu}_{\text{sm}}$ is designed to assess the localization of information, reflecting higher values when two layers exhibit pairs of neurons that demonstrate similar behavioral patterns.

In contrast, $\text{neu-neu}_{\text{sm}}$ assesses how can a neuron $h^{(m)}_{l}[k]$ be expressed as linear regression of neurons of another layer $l^{\prime}$ of another model $m^{\prime}$ , and measures the quality of regression fit which is defined as:

\displaystyle\widetilde{\text{neu-lay}_{\text{sm}}}(h^{(m)}_{l}[k],h^{(m^{% \prime})}_{l^{\prime}})

\displaystyle=lstsq(h^{(m^{\prime})}_{l^{\prime}},h^{(m)}_{l}[k]).r

(3)

$lstsq$ denotes linear least-squares, and $r$ represents the associated r-value. As before, this is extended to the layer level:

\displaystyle\text{neu-lay}_{\text{sm}}(h^{(m)}_{l},h^{(m^{\prime})}_{l^{% \prime}})

\displaystyle=\frac{1}{d_{m}}\sum_{k}\widetilde{\text{neu-lay}_{\text{sm}}}(h^% {(m)}_{l}[k],h^{(m^{\prime})}_{l^{\prime}})

(4)

$\text{neu-lay}_{\text{sm}}$ reflects how the localized information in a particular neurons in $m$ are distributed across the the layers of models $m^{\prime}$ .

2.2 Representation-level Similarity

For layer-level representation analysis, we focus on canonical correlation analysis (CCA) similarities measures. Despite previous work that focuses on using projected-weighted CCA similarity ( $pwcca$ )[14, 16] and singular vector CCA ( $svcca$ )[22], we focus on examining similarities among frame representations instead of frame representation with other information such as phonemes, words, and boundaries, which typically require extensive annotation and linguistic expertise. These similarity measures underscore the distributive nature of information across layers which highlights scenarios where two layers exhibit similar behaviors across all their neurons, emphasizing the collective patterns rather than relying solely on individual neuron matching.

2.3 Attention-level Similarity

Similar to $\text{neu-neu}_{\text{sm}}$ , $\text{attention}_{\text{sm}}$ similarity identifies the most “correlated” other attention heads within the model $m$ and across $m^{\prime}$ . This measure captures the behavior similarity indicating the focus alignment. Given two attention heads, $\alpha_{l}^{m}[k]$ and $\alpha_{l^{\prime}}^{m^{\prime}}[k^{\prime}]$ , we calculate their similarity based on their Pearson correlation, then we average over the heads in layer $l$ as in Section 2.1.

3 Experimental Setup

SSL Models

We adopt widely used self-supervised speech models, HuBERT (hub) and Wav2Vec2.0 (w2v)⁴⁴4Available here: https://huggingface.co/collections/facebook as reported in Table 1. Both models share similar architectures. The encoder network consists of blocks of temporal convolution layers with $512$ channels, and the convolutions in each block have strides and kernel sizes that compress about $25$ ms of $16$ kHz audio every $20$ ms. The context network consists of ${12\text{ (base)}}$ and $24$ (large) blocks with model dimension ${768\text{ (base)}}$ and $1024$ (large) and attention heads of ${12\text{ (base)}}$ and $16$ (large). The underlying difference in the models lies in their training objectives; w2v undergoes training through Contrastive Predictive Coding (CPC) loss, employing masking techniques, thereby classifying it as a contrastive model. On the other hand, hub, follows a different approach by attempting to predict discrete targets of masked regions using Cross-Entropy (CE) loss, classifying the model as a predictive model.

Dataset

We use the extensively employed TIMIT dataset [23] in research, studies on phone recognition, phone segmentation, and speaker recognition. TIMIT comprises $5.4$ hours of clean data manually transcribed. Despite its limited size, the dataset features a diverse set of approximately $630$ speakers delivering phonetically rich sentences, rendering it favorable for our task. For our task focuses on studying similarities, we exclusively utilize the official training set. Given that we employ frame-level embeddings in this context, each of the selected models yields over $700K$ embeddings from each layer (each $20$ ms corresponds to an embedding).

Models	(Abbreviation)	Training Data
HuBERT (BASE)	hub-base	Librispeech 960hrs
HuBERT (LARGE)	hub-large	Libri-Light
Wav2Vec 2.0 (BASE)	w2v-base	Librispeech 960hrs
Wav2Vec 2.0 (LARGE)	w2v-large	Libri-Light

Table 1: Examined Pretrained Speech SSL Models

4 Analysis & Discussion

A. Neuron Intra-Model Similarity.

Figure 1 illustrates heatmaps showing similarities between different neurons and layers across various models using $\text{neu-neu}_{\text{sm}}$ , $\text{neu-lay}_{\text{sm}}$ , and $pwcca$ similarities. The $\text{neu-neu}_{\text{sm}}$ reveals a distinctive diagonal pattern within each model, indicating that neurons within a specific model $m$ and layer $l$ tend to exhibit similarity to their counterparts neurons in adjacent layers within the same model $m$ . However, individual neurons are very different when comparing a given model $m$ individual neurons to other models neurons. This observation suggests that neurons exhibit significantly higher intra-model similarities than inter-model similarities. A similar pattern was found in contextualized language models [21]. Furthermore, the identified similarity pattern in lower layer neurons is consistent across all examined models, potentially linked to their proximity to CNN feature extraction layers equivalent to spectrogram features, as demonstrated in previous work [24].

B. Layer Inter-Model Similarity.

While individual pairs exhibit distinct characteristics across different models, $\text{neu-lay}_{\text{sm}}$ in Figure 1 reveals strong inter-model similarity, suggesting that the representations of different models converge to similar subspaces. Furthermore, the results also suggest that the individual neurons of a model can be represented as a linear combination of neurons from other layers of the model. These cross-model similarities are also observed using representation-level similarities $pwcca$ , and $svcca$ .

C. Models within the same family behaves similarly.

Notably, $\text{neu-lay}_{\text{sm}}$ similarity reveals that neurons concepts in the top layers of hub based models (base and large) are less similar to lower layers, and vice versa, which supports the fact that the lower layer captures different fine-grained concepts, whereas the higher layers are capturing more abstract information as seen in [14, 16, 8]. In contrast, w2v models show a different trend. A notable similarity is observed in the higher layers of both base (layer L8 - L10) and large (L20 - L23) models with respect to all the layers within the model. These intra-model similarities are seen using both $pwcca$ and $svcca$ similarity measures. Despite the high similarities in inter-model layer representation (as shown in Section 4.B), the final layer of the hub and w2v (both base and large) are very distinct. Our observations indicate that the models within the same family (base and large) exhibit behavioral similarities in representation. We hypothesize the uniqueness in representation between the family – hub vs w2v is more likely attributed to the training objective of self-supervised models rather than the architecture’s number of layers which aligns with the findings reported in [17].

D. Adjacent and Lower Layers Similarity.

All the heatmaps in Figure 1, including the $svcca$ similarity in Figure 2, show a bright diagonal and its neighboring areas. This brightness suggests that neighboring layers share similar representations, indicating that adjacent layers in the models encapsulate similar information subspaces. Similar patterns are observed in both language models and vision networks [25]. Additionally, $pwcca$ discloses that lower layers subspaces demonstrate similarity across studied models. This alignment with expectations, and with previous findings in Section 4.A, where lower layers closely resemble CNN layers functioning as feature extractors, and these features exhibit equivalence representations across the considered models.

E. Attention Weights Similarity.

Examining Figure 4, we observe high similarities between the attention heads in the upper layers of the models with respect to the lower layers. These high similarities in the upper layer could indicate redundancy in design. However, it is important to note that attention-based similarity measures are not reliable and are harder to interpret, as fine-grained patterns are less noticeable in the similarity-based analysis.

F. ASR Finetuning Effect.

Figure 3 depicts the similarities in heatmaps between w2v-large, and hub-large, along with their ASR fine-tuned counterparts w2v-large-ft and hub-large-ft, on Librispeech dataset. The analysis utilizes $pwcca$ and $\text{neu-neu}_{\text{sm}}$ similarities to explore potential changes in information across different layers and the localized information within neurons. Results indicate that hub-based models primarily undergo significant changes after fine-tuning only at the last few layers in comparison to other layers (similarity between the foundation and its finetuned counterpart is less than $0.5$ in upper layers). For w2v model, we observe that a large number of upper layers has changed significantly at both the neuron and layer levels, in alignment with the findings in [14]. Such findings indicate that finetuning exclusively the upper layers can be as effective as finetuning the full model. Our observation is aligned with the findings in [16] where this conclusion was drawn by examining phoneme-level $cca$ following the fine-tuning of only the layer $16$ in w2v-large and layer $20$ in hub-large, which yielded comparable results to finetuning all parameters. Note in [16] used human annotation for phoneme boundaries, whereas our proposed method gave the same conclusion without relying on any external annotation.

Key Points.

Our study highlights how different models trained with distinct objectives can converge toward similar representations and concepts. We observed that neurons in one layer can be expressed as linear combinations of neurons from other layers in different models. Importantly, this convergence is driven more by the distributivity nature of representations than by neuron concept localization. In other words, individual neurons learn different localized concepts, but overall, they contribute to similar subspaces across layers.

5 Conclusion

The paper introduces both annotation- and task-independent approaches for analyzing various speech SSL models. Our in-depth analysis explores both Wav2Vec2.0 and HuBERT model families, revealing intricate convergence patterns in inter- and intra-model neurons, layers, and attention weights similarities. Our finding suggests that models share similar distributional representations but different localized concepts, and the training objective emerges as a pivotal factor, outweighing the influence of model size. Hence, signaling how understanding the inner workings of such large models can facilitate effective and parameter-efficient design decisions for both foundation and downstream models.

References

[1] Stock Wolfgang G, “Concepts and semantic relations in information science,” Journal of the American Society for Information Science and Technology, vol. 61, 2010.
[2] Baevski Alexei et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
[3] Hsu Wei-Ning et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
[4] Mohamed Abdelrahman et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
[5] Shon Suwon al., “Slue: New benchmark tasks for spoken language understanding evaluation on natural speech,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
[6] Borgholt Lasse et al., “A brief overview of unsupervised neural speech representation learning,” arXiv preprint arXiv:2203.01829, 2022.
[7] Fan Zhiyun et al., “Exploring wav2vec 2.0 on speaker verification and language identification,” arXiv preprint arXiv:2012.06185, 2020.
[8] Chowdhury Shammur Absar et al., “What do end-to-end speech models learn about speaker, language and channel information? a layer-wise and neuron-level analysis,” Computer Speech & Language, 2023.
[9] Shammur A Chowdhury, Ahmed Ali, Suwon Shon, and James R Glass, “What does an end-to-end dialect identification model learn about non-dialectal information?,” in INTERSPEECH, 2020.
[10] Feng Chi-Luen et al., “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” arXiv preprint arXiv:2205.03759, 2022.
[11] Shah Jui et al., “What all do audio transformer models hear? probing acoustic representations for language delivery and its structure,” arXiv preprint arXiv:2101.00387, 2021.
[12] Li Yuanchao et al., “Exploration of a self-supervised speech model: A study on emotional corpora,” in Spoken Language Technology Workshop (SLT). IEEE, 2023.
[13] Ji Hang et al., “Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models,” arXiv preprint arXiv:2206.12489, 2022.
[14] Pasad Ankita et al., “Layer-wise analysis of a self-supervised speech representation model,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021.
[15] Yang Mu et al., “What can an accent identifier learn? probing phonetic and prosodic information in a wav2vec2-based accent identification model,” arXiv preprint arXiv:2306.06524, 2023.
[16] Pasad Ankita et al., “Comparative layer-wise analysis of self-supervised speech models,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[17] Chung Yu-An et al., “Similarity analysis of self-supervised speech representations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[18] Belinkov Yonatan and James Glass, “Analysis methods in neural language processing: A survey,” Transactions of the Association for Computational Linguistics, 2019.
[19] Morcos Ari et al., “Insights on representational similarity in neural networks with canonical correlation,” Advances in neural information processing systems, vol. 31, 2018.
[20] James L McClelland, David E Rumelhart, and Geoffrey E Hinton, “The appeal of parallel distributed processing,” MIT Press, Cambridge MA, 1986.
[21] Wu John et al., “Similarity analysis of contextual word representation models,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 4638–4655, Association for Computational Linguistics.
[22] Raghu Maithra et al., “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” Advances in neural information processing systems, vol. 30, 2017.
[23] Garofolo John S, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
[24] Wu Felix et al., “Performance-efficiency trade-offs in unsupervised pre-training for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
[25] Kornblith Simon et al., “Similarity of neural network representations revisited,” in International conference on machine learning. PMLR, 2019, pp. 3519–3529.

Speech Representation Analysis based on Inter- and Intra-Model Similarities