\section

Introduction

The advances in deep learning have brought unprecedented improvement in the field of automatic speech recognition (ASR). Recent works such as Whisper and its variants \citeradford2022robust,bain23_interspeech have demonstrated remarkable accuracy and robustness in recognizing speech across various languages, accents, and noisy environments. However, despite these advancements, there remain significant challenges attaining privacy and diversity concerns in adapting and deploying these models in real-world applications.

The first concern when deploying ASR models is the preservation of the privacy of data used in both training and inference. This requirement is crucial when the ASR’s task is to transcribe and analyze audio samples from sensitive subjects who do not wish to disclose any sensitive or private information \citepahmed2020preech,liu2022private. Tackling this challenge usually involves cutting the ASR’s connection to the cloud and deploying the model on local edge devices \citepahmed2020preech,liu2022private. Those devices often have constrained computational power and resources, which limits the capabilities of deploying large ASR models that utilize an extensive amount of resources. However, while smaller models could be trained and perform real-time inference on the edge, they are not expressive enough to perform those tasks accurately. This presents the challenge of balancing model accuracy with deployment feasibility.

The second concern when deploying smaller ASR models is the service quality for a diverse population. Fine-tuning ASR models for specific downstream tasks is often necessary to ensure the best transcription quality, especially for smaller models. Traditional fine-tuning approaches, such as \citepsurvey2024,zaiem2023finetuning,attia2024kidwhisper, typically fine-tune the model to specific tasks or populations with \textitone single dataset from the downstream task, under the assumption that the downstream task targets some speakers that share certain similar characteristics. While these methods can enhance performance, they often fall short when dealing with the diverse speech characteristics of a broad population. Such an inability to adjust to a diverse distribution renders those models less practical, which presents the challenge of making the ASR inclusive and fair to a more diverse population.

Recent works have sought to improve ASR models’ ability to accurately transcribe diverse speech by incorporating additional characteristics, mainly accent information, during the training stage \citepwinata2020learning,prabhu2023accented. Although these approaches have shown effectiveness in improving representation through the additional accent information, they assume a static target distribution. That is, existing works assume a fixed training set with a limited number of accents. When the target distribution evolves with new accents, the entire framework lacks a continual learning ability on the new accents and needs to be retrained, as shown in Table LABEL:tab:p-whisper-comparison. Compared to continual learning, retraining costs quadratic total training time with respect to the data size to ensure catastrophic forgetting does not happen. Moreover, it remains a non-trivial work to add additional characteristics besides accents, such as gender or age information, into existing frameworks. To address all the aforementioned challenges, we propose P-Whisper, a novel lightweight ASR framework that employs multiple LoRA profiles in parallel to fine-tune ASR models and complete downstream tasks. P-Whisper constructs a LoRA profile library for each type of characteristic and structurally identical profiles within each characteristic library. By dynamically selecting and leveraging multiple parallel profiles that accurately represent the speaker’s complex background, P-Whisper enables a more nuanced and effective adaptation process, enhancing transcription quality across a diverse population and improving fairness among marginalized groups. This approach ensures that ASR models remain both high-performing and deployable on local edge devices, maintaining privacy, inclusivity, and operational feasibility. P-Whisper is particularly suited for scenarios where the speaker presents diverse speech characteristics, and those characteristics are not fully available in the training data in the beginning. Ultimately, P-Whisper bridges the gap between advanced ASR capabilities and the practical requirements of real-world applications, offering a robust solution for inclusive and efficient speech recognition. We summarize our contributions as follows:

  • We introduce a lightweight ASR framework capable of capturing diverse characteristics of the speakers and utilizing those characteristics to enhance transcription quality.

  • We demonstrate the effectiveness of merging multiple LoRA profiles for ASR tasks.

  • We empirically study the performance of P-Whisper, showcasing its superior performance and fairness compared to baselines.

  • We present the inference latency of utilizing the P-Whisper framework on edge devices, showcasing that off-the-shelf ASR models can achieve near-real-time inference even on resource-constrained devices like the Raspberry Pi.

To the best of our knowledge, in addition to the features described in Table LABEL:tab:p-whisper-comparison, compared to similar-size ASR models, P-Whisper achieves state-of-the-art performance on downstream transcription tasks due to its utilization of speaker information. Compared to traditional fine-tuning with a single LoRA profile, P-Whisper’s parallel profile framework brings up to 13.7% relative WER reduction without additional training overhead.