Search | arXiv e-print repository

Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers

Authors: Hosein Mohebbi, Grzegorz Chrupała, Willem Zuidema, Afra Alishahi

Abstract: Transformers have become a key architecture in speech processing, but our understanding of how they build up representations of acoustic and linguistic structure is limited. In this study, we address this gap by investigating how measures of 'context-mixing' developed for text models can be adapted and applied to models of spoken language. We identify a linguistic phenomenon that is ideal for such… ▽ More Transformers have become a key architecture in speech processing, but our understanding of how they build up representations of acoustic and linguistic structure is limited. In this study, we address this gap by investigating how measures of 'context-mixing' developed for text models can be adapted and applied to models of spoken language. We identify a linguistic phenomenon that is ideal for such a case study: homophony in French (e.g. livre vs livres), where a speech recognition model has to attend to syntactic cues such as determiners and pronouns in order to disambiguate spoken words with identical pronunciations and transcribe them while respecting grammatical agreement. We perform a series of controlled experiments and probing analyses on Transformer-based speech models. Our findings reveal that representations in encoder-only models effectively incorporate these cues to identify the correct transcription, whereas encoders in encoder-decoder models mainly relegate the task of capturing contextual dependencies to decoder modules. △ Less

Submitted 15 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 (main)

arXiv:2310.03686 [pdf, other]

DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers

Authors: Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, Jaap Jumelet

Abstract: In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a simple, new method: DecoderLens. Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representation… ▽ More In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a simple, new method: DecoderLens. Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers instead of using the final encoder output, as is normally done in encoder-decoder models. The method thus maps previously uninterpretable vector representations to human-interpretable sequences of words or symbols. We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation. The DecoderLens reveals several specific subtasks that are solved at low or intermediate layers, shedding new light on the information flow inside the encoder component of this important class of models. △ Less

Submitted 3 April, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: Accepted to Findings of NAACL 2024

arXiv:2301.12971 [pdf, other]

Quantifying Context Mixing in Transformers

Authors: Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupała, Afra Alishahi

Abstract: Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in… ▽ More Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis. △ Less

Submitted 8 February, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: Accepted to EACL 2023 (main)

arXiv:2203.08991 [pdf, other]

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Authors: Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar

Abstract: Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens t… ▽ More Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has a significantly lower false positive rate in generating rationales. Our code is freely available at https://github.com/amodaresi/AdapLeR . △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: Accepted to ACL 2022 (main conference)

arXiv:2203.07445 [pdf, other]

doi 10.1103/PhysRevApplied.18.034009

Fluctuation Spectroscopy of Two-Level Systems in Superconducting Resonators

Authors: J. H. Béjanin, Y. Ayadi, X. Xu, C. Zhu, H. R. Mohebbi, M. Mariantoni

Abstract: Superconducting quantum computing is experiencing a tremendous growth. Although major milestones have already been achieved, useful quantum-computing applications are hindered by a variety of decoherence phenomena. Decoherence due to two-level systems (TLSs) hosted by amorphous dielectric materials is ubiquitous in planar superconducting devices. We use high-quality quasilumped element resonators… ▽ More Superconducting quantum computing is experiencing a tremendous growth. Although major milestones have already been achieved, useful quantum-computing applications are hindered by a variety of decoherence phenomena. Decoherence due to two-level systems (TLSs) hosted by amorphous dielectric materials is ubiquitous in planar superconducting devices. We use high-quality quasilumped element resonators as quantum sensors to investigate TLS-induced loss and noise. We perform two-tone experiments with a probe and pump electric field; the pump is applied at different power levels and detunings. We measure and analyze time series of the quality factor and resonance frequency for very long time periods, up to 1000 h. We additionally carry out simulations based on the TLS interacting model in presence of a pump field. We find that loss and noise are reduced at medium and high power, matching the simulations, but not at low power. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: 20 two-column pages (including App. and Supplement), 12 figures, 3 tables

Journal ref: Phys. Rev. Applied 18, 034009 (2022)

arXiv:2109.05958 [pdf, other]

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations

Authors: Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar

Abstract: Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic informa… ▽ More Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy -- which is widely used in the context of layer-wise probing -- can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results. △ Less

Submitted 15 September, 2021; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: Accepted to BlackboxNLP Workshop at EMNLP 2021

arXiv:2104.01477 [pdf, other]

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results

Authors: Hosein Mohebbi, Ali Modarressi, Mohammad Taher Pilehvar

Abstract: Several studies have been carried out on revealing linguistic features captured by BERT. This is usually achieved by training a diagnostic classifier on the representations obtained from different layers of BERT. The subsequent classification accuracy is then interpreted as the ability of the model in encoding the corresponding linguistic property. Despite providing insights, these studies have le… ▽ More Several studies have been carried out on revealing linguistic features captured by BERT. This is usually achieved by training a diagnostic classifier on the representations obtained from different layers of BERT. The subsequent classification accuracy is then interpreted as the ability of the model in encoding the corresponding linguistic property. Despite providing insights, these studies have left out the potential role of token representations. In this paper, we provide a more in-depth analysis on the representation space of BERT in search for distinct and meaningful subspaces that can explain the reasons behind these probing results. Based on a set of probing tasks and with the help of attribution methods we show that BERT tends to encode meaningful knowledge in specific token representations (which are often ignored in standard classification setups), allowing the model to detect syntactic and semantic abnormalities, and to distinctively separate grammatical number and tense subspaces. △ Less

Submitted 11 September, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

Comments: Accepted to EMNLP 2021 (main conference)

arXiv:1910.01165 [pdf]

Indicators of retention in remote digital health studies: A cross-study evaluation of 100,000 participants

Authors: Abhishek Pratap, Elias Chaibub Neto, Phil Snyder, Carl Stepnowsky, Noémie Elhadad, Daniel Grant, Matthew H. Mohebbi, Sean Mooney, Christine Suver, John Wilbanks, Lara Mangravite, Patrick Heagerty, Pat Arean, Larsson Omberg

Abstract: Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generaliza… ▽ More Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generalizability of findings from these studies. We report the challenges in retention and recruitment in eight remote digital health studies comprising over 100,000 participants who participated for more than 850,000 days, completing close to 3.5 million remote health evaluations. Survival modeling surfaced several factors significantly associated(P < 1e-16) with increase in median retention time i) Clinician referral(increase of 40 days), ii) Effect of compensation (22 days), iii) Clinical conditions of interest to the study (7 days) and iv) Older adults(4 days). Additionally, four distinct patterns of daily app usage behavior that were also associated(P < 1e-10) with participant demographics were identified. Most studies were not able to recruit a representative sample, either demographically or regionally. Combined together these findings can help inform recruitment and retention strategies to enable equitable participation of populations in future digital health research. △ Less

Submitted 2 October, 2019; originally announced October 2019.

arXiv:1812.03227 [pdf, other]

Magnetic hysteresis of a superconducting microstrip resonator with a high edge barrier

Authors: Sangil Kwon, Yong-Chao Tang, Hamid R. Mohebbi, David G. Cory, Guo-Xing Miao

Abstract: We investigate the magnetic hysteresis of a superconducting microstrip resonator with a high edge barrier. We measure the magnetic hysteresis while either swee** a magnetic field or tuning the edge barrier by high microwave current. We show that the magnetic hysteresis of such a device is qualitatively different from that of one without an edge barrier and can be understood based on the generali… ▽ More We investigate the magnetic hysteresis of a superconducting microstrip resonator with a high edge barrier. We measure the magnetic hysteresis while either swee** a magnetic field or tuning the edge barrier by high microwave current. We show that the magnetic hysteresis of such a device is qualitatively different from that of one without an edge barrier and can be understood based on the generalized critical-state model. In particular, we propose and demonstrate a simple and intuitive method that relies on a plot of the quality factor versus the resonance frequency for revealing the physical processes behind those hysteretic behaviors. Based on this, we find that the interplay between the Meisser current and vortex pinning is essential for understanding the magnetic hysteresis of such a device. △ Less

Submitted 7 December, 2018; originally announced December 2018.

arXiv:1811.09170 [pdf, other]

doi 10.1063/1.5121758

Engineering Nonlinear Response of Superconducting Niobium Microstrip Resonators via Aluminum Cladding

Authors: Sangil Kwon, Yong-Chao Tang, Hamid R. Mohebbi, Olaf W. B. Benningshof, David G. Cory, Guo-Xing Miao

Abstract: In this work, we find that Al cladding on Nb microstrip resonators is an efficient way to suppress nonlinear responses induced by local Joule heating, resulting in improved microwave power handling capability. This improvement is likely due to the proximity effect between the Al and the Nb layers. The proximity effect is found to be controllable by tuning the thickness of the Al layer. We show tha… ▽ More In this work, we find that Al cladding on Nb microstrip resonators is an efficient way to suppress nonlinear responses induced by local Joule heating, resulting in improved microwave power handling capability. This improvement is likely due to the proximity effect between the Al and the Nb layers. The proximity effect is found to be controllable by tuning the thickness of the Al layer. We show that improving the film quality is also helpful as it enhances the microwave critical current density, but it cannot eliminate the local heating. △ Less

Submitted 15 October, 2019; v1 submitted 22 November, 2018; originally announced November 2018.

Journal ref: J. Appl. Phys. 126, 173906 (2019)

arXiv:1802.05183 [pdf, other]

doi 10.1063/1.5027003

Magnetic Field Dependent Microwave Losses in Superconducting Niobium Microstrip Resonators

Authors: Sangil Kwon, Anita Fadavi Roudsari, Olaf W. B. Benningshof, Yong-Chao Tang, Hamid R. Mohebbi, Ivar A. J. Taminiau, Deler Langenberg, Shinyoung Lee, George Nichols, David G. Cory, Guo-Xing Miao

Abstract: We describe an experimental protocol to characterize magnetic field dependent microwave losses in superconducting niobium microstrip resonators. Our approach provides a unified view that covers two well-known magnetic field dependent loss mechanisms: quasiparticle generation and vortex motion. We find that quasiparticle generation is the dominant loss mechanism for parallel magnetic fields. For pe… ▽ More We describe an experimental protocol to characterize magnetic field dependent microwave losses in superconducting niobium microstrip resonators. Our approach provides a unified view that covers two well-known magnetic field dependent loss mechanisms: quasiparticle generation and vortex motion. We find that quasiparticle generation is the dominant loss mechanism for parallel magnetic fields. For perpendicular fields, the dominant loss mechanism is vortex motion or switches from quasiparticle generation to vortex motion, depending on cooling procedures. In particular, we introduce a plot of the quality factor versus the resonance frequency as a general method for identifying the dominant loss mechanism. We calculate the expected resonance frequency and the quality factor as a function of the magnetic field by modeling the complex resistivity. Key parameters characterizing microwave loss are estimated from comparisons of the observed and expected resonator properties. Based on these key parameters, we find a niobium resonator whose thickness is similar to its penetration depth is the best choice for X-band electron spin resonance applications. Finally, we detect partial release of the Meissner current at the vortex penetration field, suggesting that the interaction between vortices and the Meissner current near the edges is essential to understand the magnetic field dependence of the resonator properties. △ Less

Submitted 26 June, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

Journal ref: Journal of Applied Physics 124, 033903 (2018)

arXiv:1711.01392 [pdf]

doi 10.1117/1.JBO.23.1.016013

A Learnable Despeckling Framework for Optical Coherence Tomography Images

Authors: Saba Adabi, Elaheh Rashedi, Hamed Mohebbi, Xue-wen Chen, Silvia Conforto, Mohammad. R. Nasiriavanaki

Abstract: Optical coherence tomography (OCT) is a prevalent, interferometric, high-resolution imaging method with broad biomedical applications. Nonetheless, OCT images suffer from an artifact, called speckle which degrades the image quality. Digital filters offer an opportunity for image improvement in clinical OCT devices where hardware modification to enhance images is expensive. To reduce speckle, a wid… ▽ More Optical coherence tomography (OCT) is a prevalent, interferometric, high-resolution imaging method with broad biomedical applications. Nonetheless, OCT images suffer from an artifact, called speckle which degrades the image quality. Digital filters offer an opportunity for image improvement in clinical OCT devices where hardware modification to enhance images is expensive. To reduce speckle, a wide variety of digital filters have been proposed, selecting the most appropriate filter for each OCT image/image set is a challenging decision. To tackle this challenge, we propose an expandable learnable despeckling framework, we called LDF. LDF decides which speckle reduction algorithm is most effective on a given image by learning a figure of merit (FOM) as a single quantitative image assessment measure. The architecture of LDF includes two main parts: (i) an autoencoder neural network, (ii) filter classifier. The autoencoder learns the figure of merit based on the quality assessment measures obtained from the OCT image including.Subsequently, the filter classifier identifies the most efficient filter from the following categories: (a) sliding window filters including median, mean, and symmetric nearest neighborhood, (b) adaptive statistical based filters including Wiener, homomorphic Lee, and Kuwahara, and, (c) edge preserved patch or pixel correlation based filters including non-local mean, total variation, and block matching 3D filtering. △ Less

Submitted 4 November, 2017; originally announced November 2017.

Comments: under review

Journal ref: Journal of Biomedical Optics -2018

Showing 1–12 of 12 results for author: Mohebbi, H