Search | arXiv e-print repository

Cooperative Gradient Coding for Collaborative Federated Learning

Authors: Shudi Weng, Chengxi Li, Ming Xiao, Mikael Skoglund

Abstract: We investigate federated learning (FL) in the presence of stragglers, with emphasis on wireless scenarios where the power-constrained edge devices collaboratively train a global model on their local datasets and transmit local model updates through fading channels. To tackle stragglers resulting from link disruptions without requiring accurate prior information on connectivity or dataset sharing,… ▽ More We investigate federated learning (FL) in the presence of stragglers, with emphasis on wireless scenarios where the power-constrained edge devices collaboratively train a global model on their local datasets and transmit local model updates through fading channels. To tackle stragglers resulting from link disruptions without requiring accurate prior information on connectivity or dataset sharing, we propose a gradient coding (GC) scheme based on cooperative communication, which remains valid for general collaborative federated learning. Furthermore, we conduct an outage analysis of the proposed scheme, based on which we conduct the convergence analysis. The simulation results reveal the superiority of the proposed strategy in the presence of stragglers, especially under imbalanced data distribution. △ Less

Submitted 22 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

arXiv:2402.18147 [pdf, other]

A Lightweight Low-Light Image Enhancement Network via Channel Prior and Gamma Correction

Authors: Shyang-En Weng, Shaou-Gang Miaou, Ricky Christanto

Abstract: Human vision relies heavily on available ambient light to perceive objects. Low-light scenes pose two distinct challenges: information loss due to insufficient illumination and undesirable brightness shifts. Low-light image enhancement (LLIE) refers to image enhancement technology tailored to handle this scenario. We introduce CPGA-Net, an innovative LLIE network that combines dark/bright channel… ▽ More Human vision relies heavily on available ambient light to perceive objects. Low-light scenes pose two distinct challenges: information loss due to insufficient illumination and undesirable brightness shifts. Low-light image enhancement (LLIE) refers to image enhancement technology tailored to handle this scenario. We introduce CPGA-Net, an innovative LLIE network that combines dark/bright channel priors and gamma correction via deep learning and integrates features inspired by the Atmospheric Scattering Model and the Retinex Theory. This approach combines the use of traditional and deep learning methodologies, designed within a simple yet efficient architectural framework that focuses on essential feature extraction. The resulting CPGA-Net is a lightweight network with only 0.025 million parameters and 0.030 seconds for inference time, yet it achieves superior performance over existing LLIE methods on both objective and subjective evaluation criteria. Furthermore, we utilized knowledge distillation with explainable factors and proposed an efficient version that achieves 0.018 million parameters and 0.006 seconds for inference time. The proposed approaches inject new solution ideas into LLIE, providing practical applications in challenging low-light scenarios. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Preprint of an article submitted for consideration in [International Journal of Pattern Recognition and Artificial Intelligence] \c{opyright} [2024] [copyright World Scientific Publishing Company] [https://www.worldscientific.com/worldscinet/ijprai]

arXiv:2209.08606 [pdf, other]

Wideband mmWave Massive MIMO Channel Estimation and Localization

Authors: Shudi Weng, Fan Jiang, Henk Wymeersch

Abstract: Spatial wideband effects are known to affect channel estimation and localization performance in millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems. Based on perturbation analysis, we show that the spatial wideband effect is in fact more pronounced than previously thought and significantly degrades performance, even at moderate bandwidths, if it is not properly considere… ▽ More Spatial wideband effects are known to affect channel estimation and localization performance in millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems. Based on perturbation analysis, we show that the spatial wideband effect is in fact more pronounced than previously thought and significantly degrades performance, even at moderate bandwidths, if it is not properly considered in the algorithm design. We propose a novel channel estimation method based on multidimensional ESPRIT per subcarrier, combined with unsupervised learning for pairing across subcarriers, which shows significant performance gain over existing schemes under wideband conditions. △ Less

Submitted 18 September, 2022; originally announced September 2022.

arXiv:2104.04221 [pdf]

The NTNU Taiwanese ASR System for Formosa Speech Recognition Challenge 2020

Authors: Fu-An Chao, Tien-Hong Lo, Shi-Yan Weng, Shih-Hsuan Chiu, Yao-Ting Sung, Berlin Chen

Abstract: This paper describes the NTNU ASR system participating in the Formosa Speech Recognition Challenge 2020 (FSR-2020) supported by the Formosa Speech in the Wild project (FSW). FSR-2020 aims at fostering the development of Taiwanese speech recognition. Apart from the issues on tonal and dialectical variations of the Taiwanese language, speech artificially contaminated with different types of real-wor… ▽ More This paper describes the NTNU ASR system participating in the Formosa Speech Recognition Challenge 2020 (FSR-2020) supported by the Formosa Speech in the Wild project (FSW). FSR-2020 aims at fostering the development of Taiwanese speech recognition. Apart from the issues on tonal and dialectical variations of the Taiwanese language, speech artificially contaminated with different types of real-world noise also has to be dealt with in the final test stage; all of these make FSR-2020 much more challenging than before. To work around the under-resourced issue, the main technical aspects of our ASR system include various deep learning techniques, such as transfer learning, semi-supervised learning, front-end speech enhancement and model ensemble, as well as data cleansing and data augmentation conducted on the training data. With the best configuration, our system obtains 13.1 % syllable error rate (SER) on the final-test set, achieving the first place among all participating systems on Track 3. △ Less

Submitted 9 July, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: 17 pages, 3 figures, Accepted for publication in IJCLCLP

arXiv:2010.14764

Effective Decoder Masking for Transformer Based End-to-End Speech Recognition

Authors: Shi-Yan Weng, Berlin Chen

Abstract: The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct map** from an input sequence to an output sequence, without recourse to prior knowledge such as… ▽ More The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct map** from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked or corrupted. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments on the Librispeech960h and TedLium2 benchmark datasets demonstrate the superior performance of our approach in comparison to some existing strong E2E ASR systems. △ Less

Submitted 21 July, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: More extensions and experiments are under exploration

arXiv:2005.08440 [pdf]

An Effective End-to-End Modeling Approach for Mispronunciation Detection

Authors: Tien-Hong Lo, Shi-Yan Weng, Hsiu-Jui Chang, Berlin Chen

Abstract: Recently, end-to-end (E2E) automatic speech recognition (ASR) systems have garnered tremendous attention because of their great success and unified modeling paradigms in comparison to conventional hybrid DNN-HMM ASR systems. Despite the widespread adoption of E2E modeling frameworks on ASR, there still is a dearth of work on investigating the E2E frameworks for use in computer-assisted pronunciati… ▽ More Recently, end-to-end (E2E) automatic speech recognition (ASR) systems have garnered tremendous attention because of their great success and unified modeling paradigms in comparison to conventional hybrid DNN-HMM ASR systems. Despite the widespread adoption of E2E modeling frameworks on ASR, there still is a dearth of work on investigating the E2E frameworks for use in computer-assisted pronunciation learning (CAPT), particularly for Mispronunciation detection (MD). In response, we first present a novel use of hybrid CTCAttention approach to the MD task, taking advantage of the strengths of both CTC and the attention-based model meanwhile getting around the need for phone-level forced alignment. Second, we perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task. On the other hand, we adopt two MD decision methods so as to better cooperate with the proposed framework: 1) decision-making based on a recognition confidence measure or 2) simply based on speech recognition results. A series of Mandarin MD experiments demonstrate that our approach not only simplifies the processing pipeline of existing hybrid DNN-HMM systems but also brings about systematic and substantial performance improvements. Furthermore, input augmentation with text prompts seems to hold excellent promise for the E2E-based MD approach. △ Less

Submitted 17 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:2005.08433 [pdf, other]

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

Authors: Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen

Abstract: This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems… ▽ More This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively. △ Less

Submitted 2 June, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020 Special Session: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech

Showing 1–7 of 7 results for author: Weng, S