ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Ahmed Heakl Youssef Zaghloul Mennatullah Ali Rania Hossam Walid Gomaa Egypt-Japan University of Science and Technology, New Borg El-Arab City, 21934, Alexandria, Egypt Mansoura University, Mansoura, 35511, Egypt Alexandria University, El-Shatby, 21526, Alexandria, Egypt
Abstract

Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in develo** these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of 56%percent5656\%56 % in English translation over the state-of-the-art and 9.3%percent9.39.3\%9.3 % in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. 111Code: http://github.com/ahmedheakl/arazn-llm, 222Models: http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e.

keywords:
Dialectal Egyptian Arabic , Code-Switching , Machine Translation , Automatic Speech Recognition , Large Language Models
\setcode

utf8

1 Introduction

The term “code-switching” describes the phenomenon of a bilingual or multilingual speaker switching between two or more languages [17]. It has grown to be a prominent phenomenon in multilingual societies around the globe, particularly in the Arab world [16]. In Egypt, code-switching is a significant and common linguistic phenomenon. People’s code choices have been impacted by recent political and social changes in Egypt. As shown in Table 1, code-switching is evident in everyday conversations.

Addressing the complexities of code-switching presents a significant challenge due to the vast range of potential data combinations. Compounding this challenge is the scarcity of resources dedicated to training models on code-switched data. Additionally, the extent to which existing language models have encountered code-switched content during pre-training remains uncertain. Consequently, the ability of these models to effectively transfer knowledge to downstream code-switched tasks remains largely unexplored [16].

Machine translation approaches include direct-based, which uses dictionaries but lacks analysis [20]; rule-based, which leverages linguistic rules but requires manual effort [21]; corpus-based, which relies on data but struggles with low-resource languages [21]; knowledge-based, which incorporates explicit knowledge but struggles with ambiguity [22]; and hybrid, which combines approaches for better quality [23]. Arabic and English have different cultural backgrounds, affecting translation. ‘The news warms my heart’ becomes \<الخبر يثلج صدري¿ in Arabic, where ‘warms’ is translated to \<ثلج¿ (ices), due to the languages’ origins in different climates. This is because English was born in a cold climate, where warmth is a pleasant weather, whereas Arabic was born in a hot climate, where cold is a pleasant weather. Human translators can understand these cultural differences, but machine translators may struggle to capture them [19]. ArzEn corpus serves as a valuable resource for linguistic research and the development of NLP systems capable of handling code-switched Egyptian Arabic-English while preserving cultural aspects [3].

Code-switched English Egyptian Arabic
\RL\LRmeeting فى الشركة meeting at the company \RLاجتماع فى الشركة
\RLنجرب اكل \LRitalian try Italian food \RLنجرب اكل ايطالى
\RLاعمل \LRcheck لل\LRemail I check the email \< ببص على رسايلي¿
Table 1: Examples of English and Egyptian Arabic human translations.

Our primary contributions are the following:

  • 1.

    Translation: Develo** translation models using open-source models (Llma2, Llama3, and Gemma) for code-switched Egyptian Arabic-English, aiming to achieve translations that closely mimic human-generated outputs, from code-switched Egyptian Arabic to either English or Egyptian Arabic.

  • 2.

    ASR: Develo** an Automatic Speech Recognition (ASR) system using Whisper as a crucial component of a complete pipeline, where spoken code-switched Egyptian Arabic-English utterances are transcribed into written text, which is then translated using machine translation.

  • 3.

    Quantization: Quantizing our models to be more accessible to human users through their CPUs/GPUs, ensuring efficient deployment and utilization of our models

  • 4.

    Evaluation framework: Extending available metrics to enhance the reliability of our models, prioritizing evaluation accuracy and performance.

  • 5.

    Open-Sourcing: Making our models and code publicly available to encourage community engagement and further research.

The rest of this paper is organized as follows. Section 2 reviews related literature. Section 3 gives our methodology and experimental work. Section 4 presents our results and discussion, featuring evaluations across multiple metrics. Concluding remarks are provided in Section 5.

2 Related Works

2.1 Enhancements in Code-Switching Resources for Egyptian Arabic

The authors in [17] discussed the phenomenon of code-switching (CSW) in Egyptian movies where code-switching is prevalent due to the complex linguistic landscape and social variables [27], where speakers seamlessly blend dialectal Egyptian Arabic with other languages like English and French. The authors in [6] introduced ArzEn-ST which is a three-way speech translation corpus for code-switched Egyptian Arabic-English, which extends the ArzEn corpus [3]. They also presented benchmark baseline results for ASR, MT, and speech translation (ST) tasks. In addition, the authors in  [4] expanded the existing Egyptian Arabic datasets by introducing a new dataset focused on daily life conversations from movies and songs. This dataset is designed for benchmarking new machine translation models, fine-tuning large language models in few-shot settings, and facilitating research in cross-linguistic analysis and lexical semantics. This also helps in capturing more cultural nuances related to Egyptian Arabic.

2.2 Code-switched corpora

The authors in [3] presented the ArzEn corpus, an Egyptian Arabic-English code-switching spontaneous speech corpus. The corpus comprises 12 hours of recorded interviews with 38 Egyptian bilingual university students and employees. The corpus is designed for Automatic Speech Recognition (ASR) systems and offers insights into linguistic, sociological, and psychological aspects of code-switching. The work done in [6] extends the ArzEn corpus with translation in both primary (Egyptian-Arabic) and secondary (English) languages. The authors in [4] presented ArzEn-MultiGenre corpus comprising 25,557 segment pairs of Egyptian Arabic song lyrics, novels, and TV show subtitles, all manually translated and aligned with their English counterparts.

2.3 The era of Large Language Models (LLMs)

The process of translation requires a complete understanding of linguistic conversion, syntactic, grammatical, and cultural dimensions. It is more than map** words between languages [24]. Accurate translation requires a deep understanding of the cultural nuances inherent in both languages, ensuring the preservation of cultural sensitivity and local values [2]. This versatility of LLMs enabled them to excel in numerous NLP applications, such as text generation (Llama2 [25], ChatGPT [26]), machine translation (NLLB [31], SemalessM4T [32], ArzEn-ST [6]). Recent advancements in Large Language Models (LLMs) have led to the development of powerful models like LLaMa2 [25], Gemma (2B, 7B) [1], and LLaMa3 8B, which have demonstrated impressive capabilities in NLP tasks. Notably, these models have been designed to be more computationally efficient, allowing them to be deployed on consumer-grade GPUs. This shift enables researchers and developers to harness the power of LLMs on local machines, facilitating faster experimentation, prototy**, and deployment of AI applications.

2.4 Code-switching Automatic Speech Recognition (CSW-ASR)

Researchers have explored acoustic, linguistic, and pronunciation modeling approaches, including language identification systems [28], parallel recognizers [29], and single-pass methods [30]. The authors in [5] presented Whisper, a speech recognition system trained on 680,000 hours of multilingual and multitask audio data, achieving zero-shot transfer capabilities and approaching human accuracy and robustness. The system’s architecture is based on an encoder-decoder transformer, leveraging a minimalist data processing approach and multitask training.

3 Methodology

In this section, we present the machine translation and automatic speech recognition systems we used.

3.1 Machine Translation (MT)

The task of machine translation is represented by a map** 𝒯:XSYT:𝒯superscript𝑋𝑆superscript𝑌𝑇\mathcal{T}:X^{S}\rightarrow Y^{T}caligraphic_T : italic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT → italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where 𝒯𝒯\mathcal{T}caligraphic_T is the machine translation function, XSsuperscript𝑋𝑆X^{S}italic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is the set of source sentences in the source language S𝑆Sitalic_S, represented as a sequence of tokens x=(x1,x2,,xn)𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑛x=(x_{1},x_{2},...,x_{n})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and YTsuperscript𝑌𝑇Y^{T}italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the set of translated sentences in the target language T𝑇Titalic_T. The goal of machine translation is to find the optimal translation y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG that maximizes the likelihood of the target sentence given the source sentence y^=argmaxyYTP(y|x)^𝑦subscript𝑦superscript𝑌𝑇𝑃conditional𝑦𝑥\hat{y}=\arg\max_{y\in Y^{T}}P(y|x)over^ start_ARG italic_y end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_y ∈ italic_Y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_y | italic_x ) where P(y|x)𝑃conditional𝑦𝑥P(y|x)italic_P ( italic_y | italic_x ) is the conditional probability of the target sentence given the source sentence. Formally, we can define the machine translation problem as:

𝒯=argmin𝒯𝔼x𝒳𝒮[d(𝒯(x),y)]superscript𝒯subscript𝒯subscript𝔼similar-to𝑥superscript𝒳𝒮delimited-[]𝑑𝒯𝑥superscript𝑦\centering\mathcal{T}^{*}=\arg\min_{\mathcal{T}}\mathbb{E}_{x\sim\mathcal{X}^{% \mathcal{S}}}[d(\mathcal{T}(x),y^{*})]\@add@centeringcaligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_X start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_d ( caligraphic_T ( italic_x ) , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] (1)

where 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal machine translation function. d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is a distance metric (e.g. BLEU score, METEOR score) that measures the similarity between the translated sentence and the reference translation ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The goal is to find the optimal machine translation function 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that minimizes the expected distance between the translated sentence and the reference translation. This mathematical definition provides a formal framework for understanding the task of machine translation and its optimization problem.

We used the infamous translation ArzEn-ST dataset to train all of our models [6]. We adhere to the same train and test splits as described in [3]. Specifically, we utilize the ArzEn-ST test set, comprising 1,402 sentences, and the train set, consisting of 3,344 sentences. To provide our models with a richer context, we also pre-train them on larger datasets, including the entire parallel corpora presented in [4]. This approach enables our models to leverage a broader range of linguistic patterns and cultural nuances.

Data pre-processing involves removing corpus-specific annotations, URLs, and emoticons, as well as converting all text to lowercase. This step is crucial in ensuring that our models focus on the underlying linguistic structures and cultural nuances of the Egyptian-Arabic language.

Given the sequential nature of the translation task and the need for culturally enriched translations, we opt for large language models (LLMs) as our primary approach. Specifically, we employ the latest LLMs that can be accommodated by consumer-grade RAM or GPU, including LLaMA3 8B, Gemma1.1 2B, and Gemma1.1 7B [1]. Notably, we utilize the chat version of each model, which has been trained to follow human instructions, thereby facilitating the training process. All models are trained using 2 T4 GPUs with 16GB VRAM. It is worth noting that these models are decode-based architectures, which are particularly well-suited for sequential tasks like machine translation. By leveraging the strengths of these models, we aim to produce culturally fitting translations that capture the nuances of Egyptian-Arabic language and culture.

We employed the paged-Adam optimizer with weight decay [12] in 32-bit precision for all models, except for LLaMa3, which required 8-bit precision due to its substantial size (8 billion parameters). To accommodate the computational demands of the Adam optimizer, which utilizes multiple gradient copies, we trained our models using adapters for LLMs. Specifically, we explored the use of Quantized low-Rank Adapters (QLoRA) [10] and weight-Decomposed low-Rank Adaptation (DoRA) [11], with the latter yielding the most promising results and exhibiting similar behavior to the original fine-tuning process. We opted for int4 quantization with normal floats (nf4) for each adapter.

To mitigate memory constraints during training, we leveraged gradient checkpointing [13], which incurs only an additional forward pass per mini-batch, while reducing memory consumption to O(n)𝑂𝑛O(\sqrt{n})italic_O ( square-root start_ARG italic_n end_ARG ). Furthermore, to enable training with effectively large batch sizes while minimizing memory constraints, we implemented a gradient accumulation step of 4 [14]. This approach allows us to accumulate gradients from 4 batches, perform backward propagation, and achieve comparable accuracy to updating a batch of 4 at once, while reducing memory requirements by a factor of 4.

Our experiments revealed that the optimal strategy involves training models for a single epoch with a constant learning rate schedule. Additionally, we ensured that input attention masks were configured to mask out the output translation, thereby computing gradients and loss only for the output translation. Lastly, to make our models available on a consumer CPU, we provide the quanitized GGUF version of our best model. The quantization was done through the implementation of GGUF llama.cpp.

3.2 Automatic Speech Recognition (ASR)

In the context of Automatic Speech Recognition (ASR), we aim to convert a speech signal into a sequence of words. Let’s assume a speech signal x=(x1,x2,,xT)𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑇x=(x_{1},x_{2},...,x_{T})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) where xtDsubscript𝑥𝑡superscript𝐷x_{t}\in\mathbb{R}^{D}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is the acoustic feature vector at time t𝑡titalic_t and T𝑇Titalic_T is the length of the speech signal. The goal of ASR is to find the most likely sequence of words w=(w1,w2,,wN)𝑤subscript𝑤1subscript𝑤2subscript𝑤𝑁w=(w_{1},w_{2},...,w_{N})italic_w = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) where wn𝒱subscript𝑤𝑛𝒱w_{n}\in\mathcal{V}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_V is the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT word in the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V and N𝑁Nitalic_N is the length of the transcription. The ASR problem can be formulated as w^=argmaxw𝒱P(w|x)^𝑤subscript𝑤superscript𝒱𝑃conditional𝑤𝑥\hat{w}=\arg\max_{w\in\mathcal{V}^{*}}P(w|x)over^ start_ARG italic_w end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_w ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_w | italic_x ), where P(w|x)𝑃conditional𝑤𝑥P(w|x)italic_P ( italic_w | italic_x ) is the posterior probability of the word sequence w𝑤witalic_w given the speech signal x𝑥xitalic_x.

We propose a cascaded speech-to-text translation system, wherein an ASR system is trained to generate transcriptions, which are subsequently fed into a machine translation model. We opted for a cascaded architecture over an end-to-end approach due to the constraints imposed by limited resources, which rendered the development of an end-to-end system infeasible. Furthermore, previous research has demonstrated that cascaded systems can outperform end-to-end systems in low-resource settings, thereby motivating our design choice [15]

We employed the Whisper model [5] to tackle the task of ASR for Egyptian Arabic. The Whisper model, trained on a large-scale dataset of 680,000 hours of multilingual and multitask supervision, demonstrated excellent generalizability to our specific use case. This is particularly valuable for our application, as we are dealing with a unique dialect of Arabic, namely the Egyptian Arabic. The Whisper model, an encoder-decoder architecture, takes the input signal in spectrogram format and utilizes cross-attention mechanisms. For our experiments, we leveraged the ArzEn-ST dataset [3], but restricted the output to transcription only, focusing on code-switched Egyptian Arabic.

Data preprocessing involved resampling all audio to 16 kHz, removing URLs and emoticons from the text, segmenting the speech into 30-second clips, and converting each clip into mel-spectrogram images. Training was conducted on 2 T4 GPUs, each equipped with 16 GB of VRAM. The training process was completed in approximately 5 hours.

Model BLEU \uparrow BERT-F1 \uparrow EED \downarrow METEOR \uparrow LLMG \uparrow
Hamed et al., 2022 [6] 8.6 - - - -
Hamed et al., 2022 + Extra 34.3 - - - -
LLaMa2 7B 26.2 42.9% 0.68 0.12 48%
Gemma1.1 2B 34.3 72.1% 0.41 0.39 75.8%
Gemma1.1 2B + Extra 37.5 75.8% 0.37 0.56 79.6%
Gemma1.1 7B 38 77.0% 0.37 0.53 84.6%
Gemma1.1 7B + Extra 38.2 77.6% 0.37 0.56 84.3%
LLaMa3 8B GGUF Q5 53.01 80.8% 0.31 0.58 86.2%
LLaMa3 8B 53.64 81.1% 0.31 0.62 86.4%
LLaMa3 8B + Extra 52.27 80.1% 0.30 0.59 85.8%
Table 2: Summary results for the models trained on ArzEn-ST to generate English translations. We report BLEU score using SacreBLEU [7], BERT F1, Edit Distance (EED), METEOR, and LLaMa3 70B as an LLM Grader (LLMG). The lower section of the table represents our work.
Model BLEU \uparrow BERT-F1 \uparrow EED \downarrow METEOR \uparrow LLMG \uparrow
Hamed et al., 2022 [6] 48.0 - - - -
Hamed et al., 2022 + Extra 79.8 - - - -
Gemma1.1 2B 86.9 97.1% 0.09 0.87 94%
Gemma1.1 7B 83.7 95.9% 0.12 0.84 92.6%
LLaMa3 8B GGUF Q5 86.3 96.2% 0.09 0.76 94%
LLaMa3 8B 87.2 98.8% 0.07 0.88 96%
Table 3: Summary results for the models trained on ArzEn-ST to generate Egyptian Arabic translations. We report BLEU score using SacreBLEU [7], BERT F1, Edit Distance (EED), METEOR, and LLaMa3 70B as an LLM Grader (LLMG). The lower section of the table represents our work.

4 Results and Discussion

We evaluated the machine translation models using five criteria: BLEU [7], BERT Score [8], edit distance (EED), METEOR [9], and LLaMa3-based grading, inspired by [33], as traditional metrics are limited in capturing semantic nuances. For ASR, we employed Word-Error Rate (WER) and Character-Error Rate (CER) as evaluation metrics. Our models are compared to the state-of-the-art results in [6], with a focus on BLEU for MT and WER and CER for ASR, as these are the only reported metrics.

Figure 1(a) shows that LLaMa3 outperforms all other models on the ArzEn to English translation task. As in table 2, LLaMa3 achieves a BLEU score of 53.6453.6453.6453.64, which is significantly higher than the SoTA [6] by 56%percent5656\%56 %. Also, smaller models such as Gemma 2B and Gemma 7B achieved comparable results to LLaMa3 8B with 9%percent99\%9 % and 4.1%percent4.14.1\%4.1 % lower in BERT-f1 score, respectively. On the other hand, LLaMa2 performance is the lowest which can be easily interpreted due to the fact that its tokenizer does not support Arabic tokenization. In contrast to new models such as Gemma and LLaMa3 which uses Byte-Pair Encoding (BPE) [18] implemented with tiktoken, LLaMa2 just breaks down the Arabic sentence into characters as shown in table 5.

Notably, models pre-trained on additional data (Hamed et al., 2022 [6] + Extra and Gemmal.1 2B + Extra) generally outperform their counterparts trained only on the ArzEn dataset, suggesting that extra pre-training data can effectively enhance machine translation model performance. Although, this gain is marginal for larger models, such as Gemma1.1 7B, it can even be detrimental, as observed in LLaMa3 8B, with a 1%percent1-1\%- 1 % decrease in BERT-f1 score.

Model WER \downarrow CER \downarrow BLEU \uparrow LLMG \uparrow EED \downarrow
Hamed et al., 2022 [6] 57.9 36.2 - - -
Hamed et al., 2022 + Extra 34.7 20.0 - - -
Whisper Small 32.6 12.8 51.77 88.1% 0.14
Whisper Medium 31.1 12.0 55.41 92.5% 0.09
Table 4: Performance of automatic speech recognition from speech to code-switched Arabic-English task. Lower Word-Error Rate (WER), Character-Error Rate (CER), and Edit Distance (EED) scores indicate better quality. The lower section of the table represents our work.

As shown in table 3, translating into Arabic yields significantly higher BLEU scores compared to translating into English, with our optimal Arabic model achieving a BLEU score of 87.2, whereas the best English model attains a BLEU score of 53.64, representing a notable difference of approximately 62%. This phenomenon is consistent with the linguistic characteristics of the source text, where a significant proportion (approximately 85%percent8585\%85 %) of Arabic words remain largely unchanged, with only minor modifications required to accommodate the target language.

Model Tokens
LLaMa2 [\<أ، ن، ا، أ، ح، ب، ا، ل، ت، ف، ا، ح¿]
Gemma [\<أنا، أح، ب، التف، اح¿]
LLaMa3 [\<أنا، أح، ب، التف، اح¿]
Table 5: Tokenization results for the Arabic sentence \RLأنا أحب التفاح (I love apples) using different tokenizers.
Refer to caption
(a) Training curves for English translation training.
Refer to caption
(b) Training curves for different QLoRA ranks on Gemma1.1 2B.
Refer to caption
(c) Training curves for Arabic translation training.
Refer to caption
(d) Training curves for Whisper.
Fig. 1: Training curves for various machine translation models. The x-axis represents the number of training steps, and the y-axis represents the loss.

As illustrated in figure 1(b), increasing the LoRA rank consistently yields better results. Our experiments reveal that the optimal parameters are a rank of 256 and an alpha value of 128. Furthermore, we observe that higher ranks require increased LoRA dropout to mitigate overfitting, with a dropout of 0.1 employed for ranks exceeding 32.

As shown in table 4, our trained Whisper models surpass the state-of-the-art results in [6] (+ Extra) by 11.6%percent11.611.6\%11.6 % in WER, despite being trained solely on the original data without additional pre-training. Furthermore, figure 1(d) illustrates that the medium Whisper model marginally outperforms the small version, resulting in a 7.1%percent7.17.1\%7.1 % increase in BLEU score, as reflected in table 4. Whisper can achieve real-time output, with a latency of 1.31.31.31.3 seconds for a single 30-second clip inference on a consumer-grade GPU with fp16 precision, and 18 seconds on a CPU.

Notably, for English models, we found that human evaluation is particularly well-suited. Therefore, we conducted a human evaluation study, where 65 university students were asked to assess the quality of 10 randomly selected generated sentences on a scale of 1-10, with 1 indicating an irrelevant translation and 10 representing a perfect translation that captures both meaning and cultural nuances. This approach was necessary, as traditional evaluation metrics such as BERTScore, METEOR, edit distance, and BLEU fail to adequately capture the nuances of meaning and cultural context. Our results show that, on average, the generated translations received a rating of 9.2 out of 10, which supports our claim of capturing both perfect meaning and cultural nuances. For instance, when presented with the sentence “\RLانا دخلت \LRIG,” our model produced the translation ”I entered IG school,” notwithstanding that “IG” signifies “Instagram” in contexts outside of Egyptian culture.

Finally, our top-performing model, LLaMa3, was quantized from bfloat16 to 5-bit Q5, achieving a 68.75% reduction in bits while maintaining performance, with only 1.2% and 1% degradation in English and Arabic versions, respectively. The quantized model can be deployed on a consumer-grade RAM with a modest 5.6 GB footprint, supporting a throughput of 7.2 tokens/sec, thereby enabling real-time speech translation and video dubbing applications.

5 Conclusion

This paper has provided insights into the methodologies employed in develo** machine translation and automatic speech recognition systems for code-switched Egyptian Arabic. Through careful experimentation and rigorous evaluation, we have demonstrated the effectiveness of our approaches in achieving culturally fitting translations and accurate speech recognition.

Our findings emphasize the importance of using large language models and pre-training with additional data to enhance the performance of MT systems. Moreover, the success of our ASR models, particularly the Whisper architecture, highlights the potential of deep learning techniques in tackling speech recognition tasks, even in low-resource settings.

Looking ahead, further research could explore advanced optimization techniques and novel model architectures to push the boundaries of MT and ASR performance. Additionally, efforts to expand training data and refine models for specific dialects could result in even more precise translations and transcriptions, fostering greater linguistic accessibility in our globalized world.

References

  • [1] Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., … Kenealy, K. (2024) “Gemma: Open Models Based on Gemini Research and Technology”, arXiv.org, https://arxiv.longhoe.net/abs/2403.08295.
  • [2] Huang, H., Yu, F., Zhu, J., Sun, X., Cheng, H., Song, D., Chen, Z., Alharthi, A., An, B., He, J., Liu, Z., Zhang, Z., Chen, J., Li, J., Wang, B., Zhang, L., Sun, R., Wan, X., Li, H., Xu, J. (2023) “AceGPT, Localizing Large Language Models in Arabic”, arXiv.org, https://arxiv.longhoe.net/abs/2309.12053.
  • [3] Hamed, I., Vu, N. T., Abdennadher, S. (2020) “Arzen: A speech corpus for code-switched Egyptian Arabic-English”, in Proceedings of the International Conference on Language Resources and Evaluation, pages 4237-4246.
  • [4] Al-Sabbagh, R. (2024) “ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations”, Data in Brief, 54, 110271.
  • [5] Radford, Alec, Kim, Jong Wook, Xu, Tao, Brockman, Greg, McLeavey, Christine, Sutskever, Ilya (2023) “Robust speech recognition via large-scale weak supervision”, in Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA.
  • [6] Hamed, I., Habash, N., Abdennadher, S., Vu, N. T. (2022) “ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English”, in Bouamor, H., Al-Khalifa, H., Darwish, K., Rambow, O., Bougares, F., Abdelali, A., Tomeh, N., Khalifa, S., Zaghouani, W. (eds) Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates (Hybrid).
  • [7] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-**g Zhu (2002) “Bleu: a method for automatic evaluation of machine translation”, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311-318.
  • [8] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • [9] Banerjee, S. and Lavie, A., 2005, June. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72).
  • [10] Dettmers, T., Pagnoni, A., Holtzman, A. and Zettlemoyer, L., 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  • [11] Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T. and Chen, M.H., 2024. DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv preprint arXiv:2402.09353.
  • [12] Loshchilov, I. and Hutter, F., 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • [13] Chen, T., Xu, B., Zhang, C. and Guestrin, C., 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  • [14] Lamy-Poirier, J., 2021. Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models. arXiv preprint arXiv:2106.02679.
  • [15] Denisov, P., Mager, M. and Vu, N. T. (2021) ’IMS’s systems for the IWSLT 2021 low-resource speech translation task’, Proceedings of the International Conference on Spoken Language Translation.
  • [16] Sitaram, S., Chandu, K.R., Rallabandi, S.K. and Black, A.W., 2019. A survey of code-switched speech and language processing. arXiv preprint arXiv:1904.00784.
  • [17] Hafez, R. (2015) Factors affecting code switching between Arabic and English. Master’s thesis, The American University in Cairo. Available at: https://fount.aucegypt.edu/etds/148
  • [18] Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T. and Arikawa, S., 1999. Byte pair encoding: A text compression scheme that accelerates pattern matching.
  • [19] Li, L., 2004. Corpus-based machine translation. Shanghai Journal of Translators for Science and Technology, 19(2), pp.59-62.
  • [20] Al-Taani, A.T. and Hailat, Z.M., 2005. A direct English-Arabic machine translation system. Information Technology Journal, 4(3), pp.256-261.
  • [21] Farhat, A. and Al-Taani, A., 2015. A rule-based English to Arabic machine translation approach. In international Arab conference on information technology (ACIT’2015).
  • [22] Carbonell, J.G., Cullingford, R.E. and Gershman, A.V., 1981. Steps toward knowledge-based machine translation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (4), pp.376-392.
  • [23] Oladosu, J., Esan, A., Adeyanju, I., Adegoke, B., Olaniyan, O. and Omodunbi, B., 2016. Approaches to machine translation: a review. FUOYE Journal of Engineering and Technology, 1(1).
  • [24] Abiola, O.B., Adetunmbi, A.O. and Oguntimilehin, A., 2015. Using hybrid approach for English-to-Yoruba text to text machine translation system (proposed)”. International Journal of Computer Science and Mobile Computing, 4(8), pp.308-313.
  • [25] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. and Bikel, D., 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • [26] OpenAI, 2022. ChatGPT. Available at: https://openai.com/blog/chatgpt.
  • [27] Jacobson, R. ed., 2001.Codeswitching worldwide II. Mouton de Gruyter.
  • [28] Chan, J.Y., Ching, P.C., Lee, T. and Meng, H.M., 2004, December. Detection of language boundary in code-switching utterances by bi-phone probabilities. In 2004 International Symposium on Chinese Spoken Language Processing (pp. 293-296). IEEE.
  • [29] Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.C., Chng, E.S. and Li, H., 2012. Integration of language identification into a recognition system for spoken conversations containing code-switches. In Spoken Language Technologies for Under-Resourced Languages.
  • [30] Lyu, D.C., Lyu, R.Y., Chiang, Y.C. and Hsu, C.N., 2006, May. Speech recognition on code-switching among the Chinese dialects. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 1, pp. I-I). IEEE.
  • [31] Costa-jussà, M.R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J. and Sun, A., 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  • [32] Barrault, L., Chung, Y.A., Meglioli, M.C., Dale, D., Dong, N., Duppenthaler, M., Duquenne, P.A., Ellis, B., Elsahar, H., Haaheim, J. and Hoffman, J., 2023. Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv preprint arXiv:2312.05187.
  • [33] Xiao, C., Ma, W., Xu, S.X., Zhang, K., Wang, Y. and Fu, Q., 2024. From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape. arXiv preprint arXiv:2401.06431.