\setcode

utf8

Speech Representation Analysis
based on Inter- and Intra-Model Similarities

Abstract

Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm – Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts111A concept represents a coherent fragment of knowledge, such as “a class containing certain objects as elements, where the objects have certain properties”[1]. We made the code publicly available for facilitating further research, we publicly released our code222.https://github.com/QCRIVoice/XSSL_speech .

Index Terms—  Self-Supervised Learning, Speech Models, Inter- and Intra- Similarities

1 Introduction

Self-supervised Speech models like Wav2Vec2 [2] and HuBERT [3] have shown remarkable advancements in a variety of speech processing tasks, including speech recognition, emotion recognition, speaker verification, and language identification [4, 5, 6] among others. This significant advancement over supervised state-of-the-art methods and the opaqueness of these models has sparked interest in understanding and exploring their internal mechanisms.

Several studies have aimed to understand the information these models capture about different properties such as speaker characteristics [7, 8, 9, 10], paralinguistic aspects [11, 12], articulatory features [13], acoustic-linguistic elements [14], as well as accent features [15] among others. Moreover, studies like [14, 16] have also shown how better model understanding can lead to efficient fine-tuning strategies for downstream tasks.

A widely used interpretation technique includes training supervised classifiers, aka probing classifiers [17, 8, 18], based on the learned representations of given models, to predict various task properties. This methodology has found application in various studies and showed the ability of representations from different models to capture distinct properties. Additionally, similarity-based methods are used to find associations at the frame-, phoneme-, and word-levels. These methods utilizes metrics such as projected-weighted canonical correlation analysis (pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a) [19] and mutual information without training classifiers. However, the effectiveness of this approach is limited by the need for annotated data, requiring precise boundary alignment, accurate phoneme transcription, coupled with consistent word alignment, to ensure valid and reliable analysis and results.

In this study, we introduce inter- and intra-model similarity measures to understand contextual representations within speech models. Instead of focusing on a specific category or property of information, we focus on exploring both inter- and intra-similarity across a spectrum of models. We investigate localization/distributivity333Does every single neuron encode a single concept or all concepts are spread across multiple neurons? [20] properties in these models. We adopted a set of 5555 distinct similarity measures, to explore the SSL models for localization/distributivity behavior at individual neurons, layers, and attention mechanisms levels. This comprehensive range of metrics allows us to capture the nuances in the patterns embedded within the models, offering a granular view of their structural and functional dynamics.

Our in-depth analysis reaffirms prior discoveries without the need for external data or defined tasks. Moreover, our findings also reveal noteworthy insights: (i) Speech SSL model neurons exhibit higher intra-model similarity than inter-model similarity. (ii) Information encapsulated by neurons from one layer can be represented as a linear combination of other layers. Models have similar representation subspaces but different localized neuron concepts. (iii) Lower and adjacent layers demonstrate a high degree of similarity across diverse models. (iv) The training objective has a greater impact on representation similarity than the size of the model architecture. (v) Finally, we show how the similarity analysis can motivate efficient finetuning for ASR, where freezing the bottom layers of models still maintains comparable performance to finetuning the full network while reducing the finetuning time.

Refer to caption
Fig. 1: Comparison of Heatmap Similarities Between HuBERT and Wav2Vec2.0 Models: neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT, neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT, and pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a Similarities. Model boundaries are highlighted in yellow, and noteworthy similarities are encircled in green.

2 Methodology

We analyzed M𝑀Mitalic_M pretrained speech SSL models for both localized and distributed information using various widely accepted similarity metrics [21], capturing different notations at individual neurons, layers, and attention levels. We propose to remove any dependency on external labels or boundary annotation by utilizing only frame-level representation for the study.

For each model m𝑚mitalic_m, (mM𝑚𝑀m\in Mitalic_m ∈ italic_M), we extracted frame level representations hl(m)subscriptsuperscript𝑚𝑙h^{(m)}_{l}italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT dmabsentsuperscriptsubscript𝑑𝑚\in\mathbb{R}^{d_{m}}∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dm={768,1024}subscript𝑑𝑚7681024d_{m}=\{768,1024\}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { 768 , 1024 }, is indicative of number of neurons, and attention weights αl(m)subscriptsuperscript𝛼𝑚𝑙\alpha^{(m)}_{l}italic_α start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at each layer l𝑙litalic_l. We then exploit the extracted neuron/layer-level representation and attention weight to find inter- and intra-similarities at various levels of granularities.

2.1 Neuron-level Similarity

We adopted two different similarity measures: (i) neuron-neuron similarity (neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT) – similarity between pairs of individual neurons, and (ii) neuron-layer similarity (neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT) – similarity between a neuron in one model with a layer in another.

For a given neuron hl(m)[k]subscriptsuperscript𝑚𝑙delimited-[]𝑘h^{(m)}_{l}[k]italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] of a model m𝑚mitalic_m, and a layer l𝑙litalic_l, neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT is defined as the maximum correlation between hl(m)[k]subscriptsuperscript𝑚𝑙delimited-[]𝑘h^{(m)}_{l}[k]italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] and another neuron hl(m)[k]subscriptsuperscriptsuperscript𝑚𝑙delimited-[]superscript𝑘h^{(m^{\prime})}_{l}[k^{\prime}]italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] of layer lsuperscript𝑙l^{\prime}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of another model msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

neu-neusm~(hl(m)[k],hl(m))~subscriptneu-neusmsubscriptsuperscript𝑚𝑙delimited-[]𝑘subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle\widetilde{\text{neu-neu}_{\text{sm}}}(h^{(m)}_{l}[k],h^{(m^{% \prime})}_{l^{\prime}})over~ start_ARG neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT end_ARG ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =maxkρ(hl(m)[k],hl(m)[k])absentsubscriptsuperscript𝑘𝜌subscriptsuperscriptsuperscript𝑚superscript𝑙delimited-[]superscript𝑘subscriptsuperscript𝑚𝑙delimited-[]𝑘\displaystyle=\max_{k^{\prime}}\rho(h^{(m^{\prime})}_{l^{\prime}}[k^{\prime}],% h^{(m)}_{l}[k])= roman_max start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ρ ( italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] , italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] ) (1)

Then, we average over all neurons in layer l𝑙litalic_l of the model m𝑚mitalic_m,

neu-neusm(hl(m),hl(m))subscriptneu-neusmsubscriptsuperscript𝑚𝑙subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle\text{neu-neu}_{\text{sm}}(h^{(m)}_{l},h^{(m^{\prime})}_{l^{% \prime}})neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =\displaystyle== (2)
1dm×kneu-neusm~(hl(m)[k],hl(m)).1subscript𝑑𝑚subscript𝑘~subscriptneu-neusmsubscriptsuperscript𝑚𝑙delimited-[]𝑘subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle\frac{1}{d_{m}}\times\sum_{k}\widetilde{\text{neu-neu}_{\text{sm}% }}(h^{(m)}_{l}[k],h^{(m^{\prime})}_{l^{\prime}}).divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG × ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT end_ARG ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) .

where ρ𝜌\rhoitalic_ρ is the Pearson correlation.

neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT is designed to assess the localization of information, reflecting higher values when two layers exhibit pairs of neurons that demonstrate similar behavioral patterns.

In contrast, neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT assesses how can a neuron hl(m)[k]subscriptsuperscript𝑚𝑙delimited-[]𝑘h^{(m)}_{l}[k]italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] be expressed as linear regression of neurons of another layer lsuperscript𝑙l^{\prime}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of another model msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and measures the quality of regression fit which is defined as:

neu-laysm~(hl(m)[k],hl(m))~subscriptneu-laysmsubscriptsuperscript𝑚𝑙delimited-[]𝑘subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle\widetilde{\text{neu-lay}_{\text{sm}}}(h^{(m)}_{l}[k],h^{(m^{% \prime})}_{l^{\prime}})over~ start_ARG neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT end_ARG ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =lstsq(hl(m),hl(m)[k]).rformulae-sequenceabsent𝑙𝑠𝑡𝑠𝑞subscriptsuperscriptsuperscript𝑚superscript𝑙subscriptsuperscript𝑚𝑙delimited-[]𝑘𝑟\displaystyle=lstsq(h^{(m^{\prime})}_{l^{\prime}},h^{(m)}_{l}[k]).r= italic_l italic_s italic_t italic_s italic_q ( italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] ) . italic_r (3)

lstsq𝑙𝑠𝑡𝑠𝑞lstsqitalic_l italic_s italic_t italic_s italic_q denotes linear least-squares, and r𝑟ritalic_r represents the associated r-value. As before, this is extended to the layer level:

neu-laysm(hl(m),hl(m))subscriptneu-laysmsubscriptsuperscript𝑚𝑙subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle\text{neu-lay}_{\text{sm}}(h^{(m)}_{l},h^{(m^{\prime})}_{l^{% \prime}})neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =1dmkneu-laysm~(hl(m)[k],hl(m))absent1subscript𝑑𝑚subscript𝑘~subscriptneu-laysmsubscriptsuperscript𝑚𝑙delimited-[]𝑘subscriptsuperscriptsuperscript𝑚superscript𝑙\displaystyle=\frac{1}{d_{m}}\sum_{k}\widetilde{\text{neu-lay}_{\text{sm}}}(h^% {(m)}_{l}[k],h^{(m^{\prime})}_{l^{\prime}})= divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT end_ARG ( italic_h start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ italic_k ] , italic_h start_POSTSUPERSCRIPT ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (4)

neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT reflects how the localized information in a particular neurons in m𝑚mitalic_m are distributed across the the layers of models msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

2.2 Representation-level Similarity

For layer-level representation analysis, we focus on canonical correlation analysis (CCA) similarities measures. Despite previous work that focuses on using projected-weighted CCA similarity (pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a)[14, 16] and singular vector CCA (svcca𝑠𝑣𝑐𝑐𝑎svccaitalic_s italic_v italic_c italic_c italic_a)[22], we focus on examining similarities among frame representations instead of frame representation with other information such as phonemes, words, and boundaries, which typically require extensive annotation and linguistic expertise. These similarity measures underscore the distributive nature of information across layers which highlights scenarios where two layers exhibit similar behaviors across all their neurons, emphasizing the collective patterns rather than relying solely on individual neuron matching.

2.3 Attention-level Similarity

Similar to neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT, attentionsmsubscriptattentionsm\text{attention}_{\text{sm}}attention start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT similarity identifies the most “correlated” other attention heads within the model m𝑚mitalic_m and across msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This measure captures the behavior similarity indicating the focus alignment. Given two attention heads, αlm[k]superscriptsubscript𝛼𝑙𝑚delimited-[]𝑘\alpha_{l}^{m}[k]italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_k ] and αlm[k]superscriptsubscript𝛼superscript𝑙superscript𝑚delimited-[]superscript𝑘\alpha_{l^{\prime}}^{m^{\prime}}[k^{\prime}]italic_α start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT [ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ], we calculate their similarity based on their Pearson correlation, then we average over the heads in layer l𝑙litalic_l as in Section 2.1.

3 Experimental Setup

SSL Models

We adopt widely used self-supervised speech models, HuBERT (hub) and Wav2Vec2.0 (w2v)444Available here: https://huggingface.co/collections/facebook as reported in Table 1. Both models share similar architectures. The encoder network consists of blocks of temporal convolution layers with 512512512512 channels, and the convolutions in each block have strides and kernel sizes that compress about 25252525ms of 16161616kHz audio every 20202020ms. The context network consists of 12 (base)12 (base){12\text{ (base)}}12 (base) and 24242424 (large) blocks with model dimension 768 (base)768 (base){768\text{ (base)}}768 (base) and 1024102410241024 (large) and attention heads of 12 (base)12 (base){12\text{ (base)}}12 (base) and 16161616 (large). The underlying difference in the models lies in their training objectives; w2v undergoes training through Contrastive Predictive Coding (CPC) loss, employing masking techniques, thereby classifying it as a contrastive model. On the other hand, hub, follows a different approach by attempting to predict discrete targets of masked regions using Cross-Entropy (CE) loss, classifying the model as a predictive model.

Dataset

We use the extensively employed TIMIT dataset [23] in research, studies on phone recognition, phone segmentation, and speaker recognition. TIMIT comprises 5.45.45.45.4 hours of clean data manually transcribed. Despite its limited size, the dataset features a diverse set of approximately 630630630630 speakers delivering phonetically rich sentences, rendering it favorable for our task. For our task focuses on studying similarities, we exclusively utilize the official training set. Given that we employ frame-level embeddings in this context, each of the selected models yields over 700K700𝐾700K700 italic_K embeddings from each layer (each 20202020 ms corresponds to an embedding).

Models (Abbreviation) Training Data
HuBERT (BASE) hub-base Librispeech 960hrs
HuBERT (LARGE) hub-large Libri-Light
Wav2Vec 2.0 (BASE) w2v-base Librispeech 960hrs
Wav2Vec 2.0 (LARGE) w2v-large Libri-Light
Table 1: Examined Pretrained Speech SSL Models

4 Analysis & Discussion

A. Neuron Intra-Model Similarity.

Figure 1 illustrates heatmaps showing similarities between different neurons and layers across various models using neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT, neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT, and pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a similarities. The neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT reveals a distinctive diagonal pattern within each model, indicating that neurons within a specific model m𝑚mitalic_m and layer l𝑙litalic_l tend to exhibit similarity to their counterparts neurons in adjacent layers within the same model m𝑚mitalic_m. However, individual neurons are very different when comparing a given model m𝑚mitalic_m individual neurons to other models neurons. This observation suggests that neurons exhibit significantly higher intra-model similarities than inter-model similarities. A similar pattern was found in contextualized language models [21]. Furthermore, the identified similarity pattern in lower layer neurons is consistent across all examined models, potentially linked to their proximity to CNN feature extraction layers equivalent to spectrogram features, as demonstrated in previous work [24].

Refer to caption
Fig. 2: Heatmap of svcca𝑠𝑣𝑐𝑐𝑎svccaitalic_s italic_v italic_c italic_c italic_a similarity
Refer to caption
Fig. 3: Comparing Heatmap Similarities between HuBERT and Wav2Vec2.0 Models in Relation to their ASR Finetuning Variations: neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT and pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a Similarities. Noteworthy similarities are encircled in green.

B. Layer Inter-Model Similarity.

While individual pairs exhibit distinct characteristics across different models, neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT in Figure 1 reveals strong inter-model similarity, suggesting that the representations of different models converge to similar subspaces. Furthermore, the results also suggest that the individual neurons of a model can be represented as a linear combination of neurons from other layers of the model. These cross-model similarities are also observed using representation-level similarities pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a, and svcca𝑠𝑣𝑐𝑐𝑎svccaitalic_s italic_v italic_c italic_c italic_a.

C. Models within the same family behaves similarly.

Notably, neu-laysmsubscriptneu-laysm\text{neu-lay}_{\text{sm}}neu-lay start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT similarity reveals that neurons concepts in the top layers of hub based models (base and large) are less similar to lower layers, and vice versa, which supports the fact that the lower layer captures different fine-grained concepts, whereas the higher layers are capturing more abstract information as seen in [14, 16, 8]. In contrast, w2v models show a different trend. A notable similarity is observed in the higher layers of both base (layer L8 - L10) and large (L20 - L23) models with respect to all the layers within the model. These intra-model similarities are seen using both pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a and svcca𝑠𝑣𝑐𝑐𝑎svccaitalic_s italic_v italic_c italic_c italic_a similarity measures. Despite the high similarities in inter-model layer representation (as shown in Section 4.B), the final layer of the hub and w2v (both base and large) are very distinct. Our observations indicate that the models within the same family (base and large) exhibit behavioral similarities in representation. We hypothesize the uniqueness in representation between the family – hub vs w2v is more likely attributed to the training objective of self-supervised models rather than the architecture’s number of layers which aligns with the findings reported in [17].

D. Adjacent and Lower Layers Similarity.

All the heatmaps in Figure 1, including the svcca𝑠𝑣𝑐𝑐𝑎svccaitalic_s italic_v italic_c italic_c italic_a similarity in Figure 2, show a bright diagonal and its neighboring areas. This brightness suggests that neighboring layers share similar representations, indicating that adjacent layers in the models encapsulate similar information subspaces. Similar patterns are observed in both language models and vision networks [25]. Additionally, pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a discloses that lower layers subspaces demonstrate similarity across studied models. This alignment with expectations, and with previous findings in Section 4.A, where lower layers closely resemble CNN layers functioning as feature extractors, and these features exhibit equivalence representations across the considered models.

E. Attention Weights Similarity.

Examining Figure 4, we observe high similarities between the attention heads in the upper layers of the models with respect to the lower layers. These high similarities in the upper layer could indicate redundancy in design. However, it is important to note that attention-based similarity measures are not reliable and are harder to interpret, as fine-grained patterns are less noticeable in the similarity-based analysis.

Refer to caption
Fig. 4: Heatmap of attentionsmsubscriptattentionsm\text{attention}_{\text{sm}}attention start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT similarity

F. ASR Finetuning Effect.

Figure 3 depicts the similarities in heatmaps between w2v-large, and hub-large, along with their ASR fine-tuned counterparts w2v-large-ft and hub-large-ft, on Librispeech dataset. The analysis utilizes pwcca𝑝𝑤𝑐𝑐𝑎pwccaitalic_p italic_w italic_c italic_c italic_a and neu-neusmsubscriptneu-neusm\text{neu-neu}_{\text{sm}}neu-neu start_POSTSUBSCRIPT sm end_POSTSUBSCRIPT similarities to explore potential changes in information across different layers and the localized information within neurons. Results indicate that hub-based models primarily undergo significant changes after fine-tuning only at the last few layers in comparison to other layers (similarity between the foundation and its finetuned counterpart is less than 0.50.50.50.5 in upper layers). For w2v model, we observe that a large number of upper layers has changed significantly at both the neuron and layer levels, in alignment with the findings in [14]. Such findings indicate that finetuning exclusively the upper layers can be as effective as finetuning the full model. Our observation is aligned with the findings in [16] where this conclusion was drawn by examining phoneme-level cca𝑐𝑐𝑎ccaitalic_c italic_c italic_a following the fine-tuning of only the layer 16161616 in w2v-large and layer 20202020 in hub-large, which yielded comparable results to finetuning all parameters. Note in [16] used human annotation for phoneme boundaries, whereas our proposed method gave the same conclusion without relying on any external annotation.

Key Points.

Our study highlights how different models trained with distinct objectives can converge toward similar representations and concepts. We observed that neurons in one layer can be expressed as linear combinations of neurons from other layers in different models. Importantly, this convergence is driven more by the distributivity nature of representations than by neuron concept localization. In other words, individual neurons learn different localized concepts, but overall, they contribute to similar subspaces across layers.

5 Conclusion

The paper introduces both annotation- and task-independent approaches for analyzing various speech SSL models. Our in-depth analysis explores both Wav2Vec2.0 and HuBERT model families, revealing intricate convergence patterns in inter- and intra-model neurons, layers, and attention weights similarities. Our finding suggests that models share similar distributional representations but different localized concepts, and the training objective emerges as a pivotal factor, outweighing the influence of model size. Hence, signaling how understanding the inner workings of such large models can facilitate effective and parameter-efficient design decisions for both foundation and downstream models.

References

  • [1] Stock Wolfgang G, “Concepts and semantic relations in information science,” Journal of the American Society for Information Science and Technology, vol. 61, 2010.
  • [2] Baevski Alexei et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  • [3] Hsu Wei-Ning et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  • [4] Mohamed Abdelrahman et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  • [5] Shon Suwon al., “Slue: New benchmark tasks for spoken language understanding evaluation on natural speech,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
  • [6] Borgholt Lasse et al., “A brief overview of unsupervised neural speech representation learning,” arXiv preprint arXiv:2203.01829, 2022.
  • [7] Fan Zhiyun et al., “Exploring wav2vec 2.0 on speaker verification and language identification,” arXiv preprint arXiv:2012.06185, 2020.
  • [8] Chowdhury Shammur Absar et al., “What do end-to-end speech models learn about speaker, language and channel information? a layer-wise and neuron-level analysis,” Computer Speech & Language, 2023.
  • [9] Shammur A Chowdhury, Ahmed Ali, Suwon Shon, and James R Glass, “What does an end-to-end dialect identification model learn about non-dialectal information?,” in INTERSPEECH, 2020.
  • [10] Feng Chi-Luen et al., “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” arXiv preprint arXiv:2205.03759, 2022.
  • [11] Shah Jui et al., “What all do audio transformer models hear? probing acoustic representations for language delivery and its structure,” arXiv preprint arXiv:2101.00387, 2021.
  • [12] Li Yuanchao et al., “Exploration of a self-supervised speech model: A study on emotional corpora,” in Spoken Language Technology Workshop (SLT). IEEE, 2023.
  • [13] Ji Hang et al., “Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models,” arXiv preprint arXiv:2206.12489, 2022.
  • [14] Pasad Ankita et al., “Layer-wise analysis of a self-supervised speech representation model,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021.
  • [15] Yang Mu et al., “What can an accent identifier learn? probing phonetic and prosodic information in a wav2vec2-based accent identification model,” arXiv preprint arXiv:2306.06524, 2023.
  • [16] Pasad Ankita et al., “Comparative layer-wise analysis of self-supervised speech models,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
  • [17] Chung Yu-An et al., “Similarity analysis of self-supervised speech representations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [18] Belinkov Yonatan and James Glass, “Analysis methods in neural language processing: A survey,” Transactions of the Association for Computational Linguistics, 2019.
  • [19] Morcos Ari et al., “Insights on representational similarity in neural networks with canonical correlation,” Advances in neural information processing systems, vol. 31, 2018.
  • [20] James L McClelland, David E Rumelhart, and Geoffrey E Hinton, “The appeal of parallel distributed processing,” MIT Press, Cambridge MA, 1986.
  • [21] Wu John et al., “Similarity analysis of contextual word representation models,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 4638–4655, Association for Computational Linguistics.
  • [22] Raghu Maithra et al., “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” Advances in neural information processing systems, vol. 30, 2017.
  • [23] Garofolo John S, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  • [24] Wu Felix et al., “Performance-efficiency trade-offs in unsupervised pre-training for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.
  • [25] Kornblith Simon et al., “Similarity of neural network representations revisited,” in International conference on machine learning. PMLR, 2019, pp. 3519–3529.