Skip to main content

Showing 1–13 of 13 results for author: Kunzmann, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14377  [pdf, other

    cs.LG cs.AI

    CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

    Authors: Zi Yang, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang

    Abstract: Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPUs and computing time. The high training cost has become only affordable to big tech companies, meanwhile also causing increasing concerns about the environmental impact. This paper presents CoMERA, a Computing- and Memory-Efficient training method via Rank-Adaptive… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  2. arXiv:2306.01076  [pdf, ps, other

    cs.CL cs.AI

    Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

    Authors: Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, Zheng Zhang

    Abstract: Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models.… ▽ More

    Submitted 8 July, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

  3. arXiv:2304.01905  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition

    Authors: Saumya Y. Sahai, **g Liu, Thejaswi Muniyappa, Kanthashree M. Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: We present dual-attention neural biasing, an architecture designed to boost Wake Words (WW) recognition and improve inference time latency on speech recognition tasks. This architecture enables a dynamic switch for its runtime compute paths by exploiting WW spotting to select which branch of its attention networks to execute for an input audio frame. With this approach, we effectively improve WW s… ▽ More

    Submitted 4 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: Accepted to Proc. IEEE ICASSP 2023

  4. arXiv:2211.06146  [pdf, other

    eess.IV cs.CV

    An unobtrusive quality supervision approach for medical image annotation

    Authors: Sonja Kunzmann, Mathias Öttl, Prathmesh Madhu, Felix Denzinger, Andreas Maier

    Abstract: Image annotation is one essential prior step to enable data-driven algorithms. In medical imaging, having large and reliably annotated data sets is crucial to recognize various diseases robustly. However, annotator performance varies immensely, thus impacts model training. Therefore, often multiple annotators should be employed, which is however expensive and resource-intensive. Hence, it is desir… ▽ More

    Submitted 22 November, 2022; v1 submitted 11 November, 2022; originally announced November 2022.

    Comments: 4 pages, 4 figures

  5. arXiv:2205.13660  [pdf, other

    cs.CL cs.LG

    Contextual Adapters for Personalized Speech Recognition in Neural Transducers

    Authors: Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, **g Liu, **ru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose train… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022

  6. arXiv:2111.03663  [pdf

    eess.IV cs.CV cs.LG

    First steps on Gamification of Lung Fluid Cells Annotations in the Flower Domain

    Authors: Sonja Kunzmann, Christian Marzahl, Felix Denzinger, Christof A. Bertram, Robert Klopfleisch, Katharina Breininger, Vincent Christlein, Andreas Maier

    Abstract: Annotating data, especially in the medical domain, requires expert knowledge and a lot of effort. This limits the amount and/or usefulness of available medical data sets for experimentation. Therefore, develo** strategies to increase the number of annotations while lowering the needed domain knowledge is of interest. A possible strategy is the use of gamification, i.e. transforming the annotatio… ▽ More

    Submitted 17 January, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: 6 pages, 4 figures

  7. arXiv:2111.03250  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Context-Aware Transformer Transducer for Speech Recognition

    Authors: Feng-Ju Chang, **g Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

    Comments: Accepted to ASRU 2021

  8. arXiv:2111.00400  [pdf, other

    cs.CL cs.SD eess.AS

    FANS: Fusing ASR and NLU for on-device SLU

    Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

    Abstract: Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-e… ▽ More

    Submitted 30 October, 2021; originally announced November 2021.

    Comments: Published in Interspeech 2021

  9. arXiv:2106.06126  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

    Authors: **g Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi, Chunchuan Lyu, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% w… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: TSD2021

  10. arXiv:2102.03951  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Multi-Channel Transformer for Speech Recognition

    Authors: Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, Siegfried Kunzmann

    Abstract: Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consist… ▽ More

    Submitted 7 February, 2021; originally announced February 2021.

    Comments: Accepted by 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

  11. arXiv:2011.09044  [pdf, other

    eess.AS cs.CL cs.SD

    Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

    Authors: Bhuvan Agrawal, Markus Müller, Martin Radfar, Samridhi Choudhary, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent spa… ▽ More

    Submitted 15 April, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

    Comments: 7 pages, 6 figures

  12. arXiv:2008.10984  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Neural Transformer Based Spoken Language Understanding

    Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In thi… ▽ More

    Submitted 12 August, 2020; originally announced August 2020.

    Comments: Interspeech 2020

  13. arXiv:2007.03900  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

    Authors: Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann

    Abstract: Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream pr… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.