Soft Language Prompts for Language Transfer

Ivan Vykopal^1,2, Simon Ostermann³ Marián Šimko²
¹ Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
² Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia
{name.surname}@kinit.sk
³ German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
[email protected]

Abstract

Cross-lingual knowledge transfer, especially between high- and low-resource languages, remains a challenge in natural language processing (NLP). This study offers insights for improving cross-lingual NLP applications through the combination of parameter-efficient fine-tuning methods. We systematically explore strategies for enhancing this cross-lingual transfer through the incorporation of language-specific and task-specific adapters and soft prompts. We present a detailed investigation of various combinations of these methods, exploring their efficiency across six languages, focusing on three low-resource languages, including the to our knowledge first use of soft language prompts. Our findings demonstrate that in contrast to claims of previous work, a combination of language and task adapters does not always work best; instead, combining a soft language prompt with a task adapter outperforms other configurations in many cases.

\useunder

\ul

Soft Language Prompts for Language Transfer

Ivan Vykopal^1,2, Simon Ostermann³ and Marián Šimko² ¹ Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ² Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia {name.surname}@kinit.sk ³ German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany [email protected]

1 Introduction

Many multilingual large language models (LLMs) have been developed in recent years, demonstrating promising performance on various NLP tasks across multiple languages Xue et al. (2021); Workshop et al. (2023). These models are pre-trained on extensive corpora of unlabelled data in numerous languages, allowing an adaptation to linguistic characteristics and nuances. In addition, LLMs are often further trained on downstream tasks in a selected subset of languages Muennighoff et al. (2023). However, only few LLMs focus on low-resource languages Üstün et al. (2024).

As the number of covered languages in a model increases, the issue of the curse of multilinguality arises. This problem occurs when the LLM’s capacity is limited, causing languages with less training data to perform poorly Conneau et al. (2020). Various approaches have been employed to address this limitation, primarily involving additional trainable parameters specific to individual languages Pfeiffer et al. (2020, 2023).

An alternative to language-specific tuning is cross-lingual transfer: LLMs have achieved superior results on many NLP tasks in high-resource languages. This motivated researchers to try to transfer such capabilities to low-resource languages. In Cross-lingual transfer methods, an LLM is trained on a downstream task in one language and evaluated in other languages Pikuliak et al. (2021). However, training only task-specific representations does not always capture the nuances of languages on which the LLM has not been trained, or has been trained only to a small extent. Therefore, incorporating language-specific features can enhance knowledge transfer across languages.

Previous work has primarily investigated language and task representations by training language and task-specific adapters Pfeiffer et al. (2020); Parović et al. (2022) or by employing language arithmetic Klimaszewski et al. (2024). Nonetheless, other approaches that involve adding additional parameters to the model for language representation have not been thoroughly explored. This brings the opportunity to explore the combination of language and task representations using other methods and their impact in cross-lingual settings.

To explore the utilization of language and task representation methods, we have evaluated various configurations by combining two parameter-efficient fine-tuning (PEFT) methods that incorporate additional parameters into the LLM, especially adapters and prompt-tuning. By adding these additional language- and task-specific parameters, we increased the capacity of the mT0-Base model and improved cross-lingual performance. We evaluated the performance of each configuration by training on three high-resource languages and evaluating its effectiveness on three low-resource languages on four selected tasks.¹¹1Code is available at: https://github.com/ivanvykopal/adapter-prompt-evaluation Our main contributions are:

•

We propose soft language prompts as an alternative method for cross-lingual transfer.
•

We provide a comprehensive evaluation of combinations of adapters and soft prompts in cross-lingual transfer, and find that language prompts provide a viable alternative to language adapters, especially for some low-resource languages.
•

In addition, we provide an exhaustive evaluation of both prompts and adapters for task transfer. We find that the best combination of adapters and prompts for task and language transfer depends highly on task and language, resp., and that there is no solution that clearly outperforms the others.

2 Related Work

Adapters and Soft Prompts.

PEFT methods are designed to address the problem of increasing number of trainable parameters in LLMs Dettmers et al. (2023); Zhang et al. (2023); Xu et al. (2023); Xie et al. (2024). These methods reduce the number of trained parameters and incorporate new parameters commonly used to train LLMs on other tasks. Adapters Houlsby et al. (2019a) and prompt-tuning Lester et al. (2021) represent two PEFT methods for adapting the LLM to different NLP domains. Adapters incorporate new parameters into the transformer architecture by including down- and up-projection layers along with residual connection, while prompt-tuning introduced trainable soft-prompts prepended to the input embedding to condition the LLM’s generation.

Limitations of Multilingual LLMs.

One major limitation of LLMs is catastrophic forgetting, which occurs when training the LLM on a new task, causing it to partially or entirely forget previously learned knowledge for other tasks McCloskey and Cohen (1989); Luo et al. (2024); Ren et al. (2024). This forgetting extends beyond task-specific knowledge to language-specific knowledge if the model is fine-tuned on a subset of the original languages Vu et al. (2022a); Liu and Huang (2023).

Another challenge with multilingual LLMs is associated with the number of languages on which these LLMs have been pre-trained Conneau et al. (2020); Pfeiffer et al. (2022). Previous research has shown that as the number of languages covered by LLMs increases, their performance on various NLP tasks degrades Hu et al. (2020); Ponti et al. (2020). Additionally, low-resource languages are often underrepresented during pre-training, resulting in poor performance in these languages Wu and Dredze (2020).

Cross-Lingual Transfer.

Given the many low-resource and underrepresented languages, cross-lingual transfer is crucial in training LLMs to address NLP tasks in various languages Pikuliak et al. (2021). A commonly employed approach includes training LLM in one language and evaluating its performance in another. However, with the development of methods that incorporate additional parameters, researchers have used these methods to train language-specific representations, assisting LLMs in solving NLP tasks also in low-resource languages Pfeiffer et al. (2020); Ansell et al. (2021); Parović et al. (2022); Lee et al. (2022); Pfeiffer et al. (2023); Kunz and Holmström (2024).

3 Methodology

We propose a comprehensive study on combinations of language and task representations using adapters and soft prompts. We evaluate for the first time the capabilities of soft language prompts in a systematic manner and evaluate the performance of diverse combinations of prompts and adapters in cross-lingual settings. The pipeline consisting of training, evaluation, and the languages and tasks that constitute each step is illustrated in Figure 1.

Refer to caption — Figure 1: The full pipeline consists of training language and task representations along with evaluation on four selected tasks. Selected languages are represented by red color, while used tasks in each step are colored in green.

In the following sections, we first give details on the different methods that we investigate for representing language information (Section 3.1) and task information (Section 3.2). We then explain the combinations of prompts and adapters that we evaluate (Section 3.3).

3.1 Language Representation

Language Adapters.

Previous work has investigated the effectiveness of training language-specific transformation using the adapter architecture Houlsby et al. (2019b). Pfeiffer et al. (2020) proposed a MAD-X framework, which includes training language adapters using the masked language modeling objective on unlabelled data. Inspired by language adapters proposed by the authors, we build upon their architecture, and the approach used to train language adapters. Language adapters in our settings are incorporated into each transformer layer of the LLM and trained using unlabelled data.

Soft Language Prompts.

Soft Prompt Tuning provides a promising alternative for parameter-efficient adaptations of LLMs. Previous work has primarily focused on task transferability and thus training task-specific soft prompts, mainly for a single language Vu et al. (2022b); Asai et al. (2022). We leverage this restriction and train language-specific soft prompts to specify multilingual LLMs towards one language. Since multilingual LLMs have the ability to provide answers in various languages, we defined a soft language prompt as a set of token embeddings prepended to the input embedding and further fed into LLMs that can condition LLMs to answer in the specific language.

In existing work, authors have identified that soft prompt initialization is essential for the LLM’s performance. Lester et al. (2021) have defined three possible options: (1) random initialization from Gaussian distribution; (2) initialization from the model’s vocabulary; and (3) initialization with the embedding of output classes in the case of the classification task. Each type of initialization has its advantages and drawbacks. However, none is appropriate for our experiments because our primary focus is on multilingual LLMs. Therefore, we defined a text instruction specific to each language (see Appendix C) for the soft prompt initialization. When employing text instructions for the initialization, this text is first embedded. Then, its length is adjusted according to the desired soft prompt size, i.e., the text embedding is repeated in the case of insufficient size until the desired length of the soft prompt is obtained.

Language Modeling Objective.

To train language-specific representations, unlabelled data from selected languages is necessary, and choosing an appropriate training objective is integral. As we employ an encoder-decoder architecture, we chose span corruption as objective, which has been shown to work well in previous work Raffel et al. (2020); Xue et al. (2021). Unlike casual language modeling objective, where the LLM is trained to predict the next token in a sequence, in span corruption, 15% of tokens in the input text is randomly masked using sentinel tokens. The only purpose of sentinel tokens is for masking parts of the input, which the LLM is tasked to predict Raffel et al. (2020). Finally, the LLM is trained to reconstruct masked tokens to obtain the original input text. In this manner, the LLM learns linguistic nuances and patterns that are beneficial in training task-specific adapters and soft prompts.

3.2 Task Representation

Task Adapters.

Similarly to language adapters, we use task-specific adapters, represented by the same architecture, which are incorporated into each transformer layer of the LLM. However, when combining task representations with language representations, the final architecture differs across configurations and depends on the type of language representation used during the training and inference. Detailed information on the architecture for all combinations is in Section 3.3.

Task adapters are updated only during training on the desired downstream task, while the rest of the model, along with the language representation, is kept frozen. In the case of task-specific representations, LLMs learn knowledge that is characteristic of the specified tasks and that should be language-independent.

Soft Task Prompts.

In addition to task adapters, we also use soft task prompts that employ the same architecture and parameters used for soft language prompts. The difference when using a soft task prompt occurs in the configuration consisting of a soft language prompt and a soft task prompt. With this configuration, both soft prompts are combined using a concatenation operation and further fed into the model to condition the final generation.

3.3 Combining Adapters and Soft Prompts

Since our experiments are focused on evaluating language and task representations and their combination, we define six possible configurations: (1) only task adapter; (2) only soft task prompt; (3) the combination of language and task adapter; (4) the combination of language adapter and soft task prompt; (5) the combination of soft language prompt and task adapter; and (6) the combination of soft language prompt and soft task prompt. The position of task representations within the LLM highly depends on the type of language representation used in experiments. The architecture along with the form of the input for all configurations are illustrated in Figure 2.

Task Adapters & Soft Prompts.

The first two configurations aim to train task representations alone, without incorporating language-specific representation. We trained adapters and soft prompts on each selected dataset, evaluating the trained task representations in all defined languages. During the training, only adapters and soft prompts are trained while kee** the rest of the LLM frozen.

Language & Task Adapters.

In addition to training task representations alone, we trained a task adapter on top of the already trained language adapter, as proposed in MAD-X Pfeiffer et al. (2020). The task adapter receives the output from the language adapter as input and processes it further. In the training process, only the task adapter is trained, while the language adapter together with LLM is kept frozen.

Adapters and Soft Prompts Combinations.

In our study, we introduce two combinations of language and task representation using adapters and soft prompts. The first configuration involves soft task prompts along with a language adapter. This combination incorporates trained language-specific knowledge using a language adapter, and a soft task prompt trained on the desired downstream task.

The second combination includes training a task adapter with the trained soft language prompt. Soft language prompts condition LLMs to activate knowledge specific to the desired language, while task adapters learn task-specific knowledge.

Soft Language Prompts & Soft Task Prompts.

The last configuration includes soft language and soft task prompts. Inspired by stacking language and task adapters on top of each other, we concatenated embeddings of language and task prompts to a final soft prompt, with the LLM and soft language prompt being frozen during training.

4 Experiments

4.1 Model Selection

We selected an encoder-decoder architecture, specifically the mT0-Base model, to conduct a cross-lingual evaluation. The mT0 model is based on the pre-trained multilingual mT5 model, which has been further fine-tuned on a collection of 46 languages across 16 NLP tasks Muennighoff et al. (2023). The model selection played a crucial role in further experiments and we conducted several preliminary experiments with the original mT5-Base model. However, we observed that in the case of using the pre-trained model, which has not been further fine-tuned on downstream tasks, prompt-tuning is not enough to train the LLM to produce meaningful outputs.

4.2 Languages

The original mT5 pre-trained model covers more than 100 languages, while mT0 employed only 46 of them for further fine-tuning. Based on the list of languages initially supported by the mT5 model, we selected six languages and categorized them into high and low-resource. On the one hand, we consider English, German, and Spanish to be high-resource languages. On the other hand, Czech, Slovak, and Telugu are considered low-resource. Our distinction between these two groups is based on the number of resources available for each language (in terms of unlabelled and labelled data).

Additionally, we included languages from two families and script types in the low-resource category. Czech and Slovak represent the Indo-European language family with the Latin script, while the Telugu language represents the Dravidian language family with the Telugu script. The purpose of including Telugu in our cross-lingual evaluation is to investigate the ability of the mT0 model to transfer knowledge between more similar and more distant languages, both in the form of script and language features.

To train language representations on unlabelled data, we selected Wikipedia as a source that contains many articles in various languages, including low-resource ones. All the Wikipedia data are from the latest preprocessed dump from HuggingFace²²2https://huggingface.co/datasets/wikimedia/wikipedia, specifically from November 2023.

4.3 Tasks

In order to evaluate the capabilities of mT0-Base for cross-lingual transfer, we chose four distinct tasks involving various NLP areas to explore the model performance. These tasks differ in the type of the provided output and include question answering (QA), named-entity recognition (NER), natural language inference (NLI), and check-worthy claim detection (CWCD). They were selected based on the availability of datasets for selected languages and to include various NLP tasks related to reading comprehension, recognizing textual entailment, or fact-checking domains. Table 1 lists the datasets used in our experiments.

Due to the absence of datasets for some languages, we employed Google Translate to translate data for several languages. This concerns, in particular, the dataset for the Slovak NLI task and the dataset for check-worthy claim detection for German and Telugu. In the case of the missing Slovak NLI dataset, we utilized the CS ANLI dataset and translated it from Czech to Slovak. In contrast, for check-worthy claim detection, we translated the English dataset into German and Telugu to obtain results for the comparison.³³3To evaluate the correctness of the translations, we conducted a manual verification of a subset of samples, focusing specifically on translations between Czech and Slovak (based on native speakers). Our findings indicate that the translations generated by Google Translate are correct for these particular languages.

Dataset	Task	Languages	Citation
SQuAD	QA	en	Rajpurkar et al. (2016)
MLQA	QA	ar, de, hi, zh, es, vi	Lewis et al. (2019)
SK-QuAD	QA	sk	Hládek et al. (2023)
Czech SQuAD	QA	cs	Macková and Straka (2020)
TeQuAD	QA	te	Vemula et al. (2022)
WikiANN	NER	en, de, es, cs, sk, te, 170 others	Rahimi et al. (2019)
XNLI	NLI	en, de, es, 12 others	Conneau et al. (2018)
IndicXNLI	NLI	te, 10 others	Aggarwal et al. (2022)
CS ANLI	NLI	cs, sk*	CS-ANLI
MultiClaim	CWCD	en, es, sk, cs, de, te, 6 others	Pikuliak et al. (2023) Hyben et al. (2023)

Table 1: The list of datasets used in our experiments. Languages marked with * represent language versions of datasets that are not original but were obtained by translating texts from Czech (CS ANLI) or English (MultiClaim).

4.4 Experimental Setup

Language Representations.

Language adapters and soft prompts were trained for 100,000 steps using a span corruption objective. We utilized the same batch size of 32 for all languages with a learning rate of $5e-5$ for training language adapters and $5e-1$ for soft language prompts. We identified these learning rates as the best for particular PEFT methods on English data. Detailed parameters are listed in Table 3 in Appendix D.

Task Representations.

In training task representations, we divided the training set into training and validation splits using 15% of the records for validation, done only for the datasets that do not include a test set. In contrast, the original validation split is considered in that case as a test set. This is especially the case of the question answering and check-worthy claim detection tasks. Secondly, we preprocessed each dataset by transforming each record from the particular dataset into the text-to-text format employing prompt templates listed in Appendix B. Finally, we trained task representations using the same training parameters across all tasks, with differences only between learning rates and weight decay for adapters and soft prompts.

Besides distinctions in training parameters, the instruction used for training soft prompts differs across languages and tasks. These variations are due to the name of the language in which the answer is to be generated and based on the task that is LLM solving.

Task representations in all configurations were trained for 50,000 steps using only a single seed with a batch size of 32, saving and evaluating every 1,000 steps.⁴⁴4We employed only one seed due to computational and time limitations. However, we performed a check of the generalizability of the approach by training the task representation on the German version of the WikiANN dataset for NER using two additional seeds and evaluated cross-lingual transfer from German to all languages. The results are in Appendix F. The best model was chosen based on the performance on the validation split with respect to the loss. For the classification task, we set the maximum number of tokens generated based on the predicted classes to minimize the problem when the LLM continues to generate the answer so that we are able to evaluate the LLM’s performance correctly. Table 3 in Appendix D shows the exact parameters for training language and task representations.

Evaluation.

For evaluation, we selected several standard metrics employed for particular tasks. Specifically, we use the F1-Score and Accuracy for classification tasks and QA in the SQuAD format. Besides the F1-Score for QA, we also calculated Exact Match, assessing how many of the answers exactly match the ground truth.⁵⁵5Exact Match tends to underestimate models’ performance for low-resource languages, where LLMs are not often able to produce the exact answer with the correct grammar. For the evaluation, we employed metrics implemented in the Hugging Face evaluate library⁶⁶6https://huggingface.co/docs/evaluate.

We evaluated the results primarily on cross-lingual transfer from high-resource languages to low-resource ones, where task representations were trained on datasets in high-resource languages. We aim to assess the combination of language representations of low-resource languages with task representations trained on datasets from high-resource languages, i.e., high-resource language as source language and low-resource as target ones. We also explored the transfer between all possible language pairs, considering each language as the source language and every other language as the target language. Extended results are shown in Appendix E.

Baselines.

To evaluate the proposed methods, we employed several baseline approaches and configurations in our study. Baselines include task adapters, soft task prompts (prompt-tuning approach), and the combination of a language and task adapter, as proposed by Pfeiffer et al. (2020). These baselines provided a foundation for assessing the effectiveness of cross-lingual transfer in our experiments.

5 Results and Analyis

Overall Results.

Our investigation into cross-lingual transfer performance between high-resource and low-resource languages is summarized in Table 2. The table presents averaged metrics across four defined tasks for low-resource languages, especially Czech, Slovak and Telugu. The Task Language column specifies the language used for training the task representation, whereas the last three columns represent the low-resource languages employed for language representation in cross-lingual transfer.

Our results demonstrate that the selection of source languages plays an important role in the overall results, with distinct languages demonstrating different performance gains. Using English as a source language resulted in the highest performance for most low-resource languages when employing task representations alone. However, for Slovak, a combination of soft language prompts and task adapters proved more effective.

In contrast, when using German and Spanish as source languages, configurations combining language and task representations yielded superior scores. Specifically, transferring knowledge from Spanish using a combination of soft language prompts and task adapters resulted in the highest performance. Therefore, this configuration using Spanish enhanced the model’s performance, making Spanish the most effective high-resource language for cross-lingual transfer between languages with the same script.

Task

Language

Representation

Task

Representation

Czech

Slovak

Telugu

English

None

Adapter

49.64

\ul47.98

\ul52.13

Soft Prompt

41.26

40.19

52.43

Adapter

46.50

43.30

38.47

Soft Prompt

38.23

37.41

37.48

Soft Prompt

Adapter

\ul48.37

50.88

49.16

Soft Prompt

47.64

46.48

47.12

German

None

Adapter

51.72

51.27

57.50

Soft Prompt

47.60

46.53

52.91

Adapter

53.81

54.56

45.65

Soft Prompt

47.34

46.63

45.88

Soft Prompt

Adapter

\ul51.90

\ul53.98

\ul55.90

Soft Prompt

48.85

48.56

54.60

Spanish

None

Adapter

\ul53.61

51.82

57.12

Soft Prompt

50.28

49.05

53.94

Adapter

50.03

\ul53.04

45.45

Soft Prompt

50.55

51.52

39.71

Soft Prompt

Adapter

55.24

54.84

\ul56.90

Soft Prompt

51.41

50.68

50.54

Table 2: Average scores for each configuration across all tasks for low-resource languages. The languages in rows represent the language in which the task representation was trained, and the languages in columns represent the language representation that was used, if any (except for configurations with None in the language representation). For each language pair, the best results are boldfaced and the second best are underlined.

Question Answering.

Our experiments (see Table 4) revealed that the configuration of a soft language prompt and task adapter achieved the highest performance in many cases in the QA task when transferring to low-resource languages, with only small differences across languages. For Slovak, this configuration was particularly effective, while for Telugu, the task adapter without language representation outperformed other configurations. This suggests that the complexity of the target language cannot be sufficiently modeled based on the small number of Wikipedia articles in Telugu.

In addition to investigating the effects of individual configurations, we also evaluated the improvement of a soft language prompt combined with a task adapter over the original mT0-Base model without any language or task representations. Relative F1-Score improvements are illustrated in Figure 3 demonstrating that training and evaluating task adapters in the same language provides the most evident improvement. Furthermore, when transferring knowledge from high-resource to low-resource languages, the positive transfer was common, except for Telugu, which uses a different script. We conjecture that the cross-lingual transfer depends on the script used for the language.

Cross-lingual transfer from low-resource languages to all others did not improve the baseline performance of the mT0-Base model, with the overall performance on the QA task declining. There is only one exception: When employing a task adapter trained on the Czech version of the SQuAD dataset, the results for Slovak improved, probably due to their linguistic similarities.

Named-Entity Recognition.

In the case of the NER task, German and Spanish, among high-resource languages, performed best in cross-lingual transfer to low-resource languages, while English performed poorly. However, based on the results in Table 5, the best improvements were observed using a soft language prompt with a task adapter, outperforming the combination of language and task adapters for Spanish and German. This is especially the case for Telugu, where the difference between these two configurations is more than 33% in favor of the combination of soft language prompt and task adapter using Spanish training data.

Natural Language Inference.

The cross-lingual evaluation of the NLI task from Table 6 demonstrated the effectiveness of some proposed configurations for knowledge transfer. In particular, we mostly achieve superior results using the combination of language adapters along with soft task prompts in Czech and Slovak as target languages. This is not the case for Telugu, where without employing language representations, we consistently achieved the highest performance. This observation confirms previously identified findings on the question answering task. Furthermore, the high effect on the Telugu language observed in the cross-lingual evaluation is probably due to the fact that Telugu has been involved during instruction fine-tuning of mT5 to create the mT0 model and therefore the LLM is able to adapt better.

Examining the six proposed configurations for transferring knowledge from a low-resource language to all others, we observed similar findings as in the case of Telugu, where configurations without language representations have the best performance, while still not outperforming the results obtained without any language and task representations: Overall performance mostly decreased.

Check-Worthy Claim Detection.

For check-worthy claim detection, the configuration of soft language prompts and task adapters surpassed previous methods in many cases (see Table 7). This configuration proved to be effective across most language pairs, demonstrating the model’s improved ability also in some fact-checking tasks, such as check-worthy claim detection. However, cross-lingual transfer from Slovak to other languages did not yield improvements in using language representations, suggesting that training task representations in combination with Slovak language representation was less effective for this task, similar to natural language inference.

6 Discussion

Based on our experiments and the results obtained, we make several observations, which are summarized below.

Prompt Tuning Performs Better with Fine-Tuned Models.

During our preliminary experiments in model selection, we observed that prompt tuning is not able to improve the results of pre-trained LLMs (e.g., mT5), trained only on unlabelled data, on the downstream task. Therefore, prompt tuning can enhance the performance of already fine-tuned LLMs on any labelled data, while the tasks we want to evaluate may not have been part of the previous fine-tuning. We confirmed this observation in our experiments with NER and check-worthy claim detection, where we obtained superior results even without the LLM being previously trained on them.

Soft Language Prompt with Task Adapter Outperformed Baselines in Many Cases.

Our proposed method of combining soft language prompts with task adapters demonstrated better performance in many cases, compared to the approach of combined language and task adapters, which has been shown to be very effective in previous work. Specifically, the combination of soft language prompts and task adapters is most effective on the classification tasks, achieving superior results most often. For languages with a different script (e.g., Spanish and Telugu), these differences were over 20%.

Language Representations are Unable to Capture Linguistic Characteristics Using Small Number of Unlabelled Data.

Language representations have several limitations that led to configurations without language representations performing consistently better on cross-lingual transfer to Telugu on QA, NLI and check-worthy claim detection tasks. We postulate that the reason is the small number of Wikipedia articles on which the language representations were trained, rendering them unable to adequately capture sufficient linguistic characteristics.

7 Conclusion

Our study provides a comprehensive evaluation of various configurations of adapters and soft prompts for cross-lingual transfer in low-resource languages. With the systematic evaluation of task adapters, soft task prompts, and combinations of language and task representations, we identified configurations that positively affect LLM’s performance across different tasks and languages. Our findings demonstrated that the combination of soft language prompts and task adapters emerged as an effective alternative for transferring knowledge from high-resource to low-resource languages and within linguistically similar languages, such as Czech and Slovak. Furthermore, our findings provide valuable insights for the utilization of a combination of PEFT methods for cross-lingual transfer, while highlighting the need to incorporate language-specific knowledge.

Limitations

Model Selection.

Our analysis on the effectiveness of the language and task representations focused on highly multilingual LLMs that include a wider variety of low-resource languages. From this perspective, there is not a vast number of open-source multilingual LLMs with such extensive language coverage as the mT5 or BLOOM model, while having fewer than 1B parameters. Another aspect of the selection was the involvement of only generative models consisting of encoder-decoder or decoder-only architecture.

Other Languages.

In selecting appropriate languages, we were limited by the languages covered by the mT5 model. To select high-resource languages, we considered languages that are the most extensive in terms of available resources and are in Latin script. On the other hand, when selecting low-resource languages, we also considered the availability of datasets in languages other than the Latin script as well as the availability of datasets in those languages (both human-annotated and machine-translated).

Other Tasks.

The tasks in our experiments were selected based on the availability of datasets for each selected language and also to cover multiple areas of the NLP domain, i.e. reading comprehension, fact-checking, and recognizing textual entailment. We mostly considered tasks involved in the instruction fine-tuning of the mT0-model, but we also included tasks that were not originally used to train the mT0-model.

Acknowledgements

This research was partially supported byDisAI - Improving scientific excellence and creativity in combating disinformation with artificial intelligence and language technologies, a project funded by Horizon Europe under GA No.101079164, and by the MIMEDIS, a project funded by the Slovak Research and Development Agency under GA No. APVV-21-0114. This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254).

References

(1) CS ANLI. https://huggingface.co/datasets/ctu-aic/anli_cs. Accessed: 2024-05-30.
Aggarwal et al. (2022) Divyanshu Aggarwal, Vivek Gupta, and Anoop Kunchukuttan. 2022. IndicXNLI: Evaluating multilingual inference for Indian languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10994–11006, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Ansell et al. (2021) Alan Ansell, Edoardo Maria Ponti, Jonas Pfeiffer, Sebastian Ruder, Goran Glavaš, Ivan Vulić, and Anna Korhonen. 2021. MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Asai et al. (2022) Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. 2022. ATTEMPT: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655–6672, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc.
Hládek et al. (2023) Daniel Hládek, Ján Staš, Jozef Juhár, and Tomáš Koctúr. 2023. Slovak dataset for multilingual question answering. IEEE Access, 11:32869–32881.
Houlsby et al. (2019a) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019a. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR.
Houlsby et al. (2019b) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019b. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
Hyben et al. (2023) Martin Hyben, Sebastian Kula, Ivan Srba, Robert Moro, and Jakub Simko. 2023. Is it indeed bigger better? the comprehensive study of claim detection lms applied for disinformation tackling. Preprint, arXiv:2311.06121.
Klimaszewski et al. (2024) Mateusz Klimaszewski, Piotr Andruszkiewicz, and Alexandra Birch. 2024. No train but gain: Language arithmetic for training-free language adapters enhancement. Preprint, arXiv:2404.15737.
Kunz and Holmström (2024) Jenny Kunz and Oskar Holmström. 2024. The impact of language adapters in cross-lingual transfer for nlu. Preprint, arXiv:2402.00149.
Lee et al. (2022) Jaeseong Lee, Seung-won Hwang, and Taesup Kim. 2022. FAD-X: Fusing adapters for cross-lingual transfer to low-resource languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 57–64, Online only. Association for Computational Linguistics.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lewis et al. (2019) Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475, arXiv: 1910.07475.
Liu and Huang (2023) Lei Liu and Jimmy Xiangji Huang. 2023. Prompt learning to mitigate catastrophic forgetting in cross-lingual transfer for open-domain dialogue generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2287–2292, New York, NY, USA. Association for Computing Machinery.
Luo et al. (2024) Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2024. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. Preprint, arXiv:2308.08747.
Macková and Straka (2020) Kateřina Macková and Milan Straka. 2020. Reading comprehension in czech via machine translation and cross-lingual transfer. In Text, Speech, and Dialogue, pages 171–179, Cham. Springer International Publishing.
McCloskey and Cohen (1989) Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press.
Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. Preprint, arXiv:2211.01786.
Parović et al. (2022) Marinela Parović, Goran Glavaš, Ivan Vulić, and Anna Korhonen. 2022. BAD-X: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1791–1799, Seattle, United States. Association for Computational Linguistics.
Pfeiffer et al. (2022) Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. 2022. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495, Seattle, United States. Association for Computational Linguistics.
Pfeiffer et al. (2023) Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang, Machel Reid, and Sebastian Ruder. 2023. mmT5: Modular multilingual pre-training solves source language hallucinations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1978–2008, Singapore. Association for Computational Linguistics.
Pfeiffer et al. (2020) Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
Pikuliak et al. (2023) Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melisek, Ivan Vykopal, Jakub Simko, Juraj Podrouzek, and Maria Bielikova. 2023. Multilingual previously fact-checked claim retrieval. Preprint, arXiv:2305.07991.
Pikuliak et al. (2021) Matúš Pikuliak, Marián Šimko, and Mária Bieliková. 2021. Cross-lingual learning for text processing: A survey. Expert Systems with Applications, 165:113765.
Ponti et al. (2020) Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. Xcopa: A multilingual dataset for causal commonsense reasoning. Preprint, arXiv:2005.00333.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
Rahimi et al. (2019) Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. Massively multilingual transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164, Florence, Italy. Association for Computational Linguistics.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Ren et al. (2024) Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and reducing catastrophic forgetting in parameter efficient tuning. Preprint, arXiv:2402.18865.
Vemula et al. (2022) Rakesh Vemula, Mani Nuthi, and Manish Srivastava. 2022. TeQuAD:Telugu question answering dataset. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pages 300–307, New Delhi, India. Association for Computational Linguistics.
Vu et al. (2022a) Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant. 2022a. Overcoming catastrophic forgetting in zero-shot cross-lingual generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9279–9300, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Vu et al. (2022b) Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. 2022b. SPoT: Better frozen model adaptation through soft prompt transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5039–5059, Dublin, Ireland. Association for Computational Linguistics.
Workshop et al. (2023) BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2023. Bloom: A 176b-parameter open-access multilingual language model. Preprint, arXiv:2211.05100.
Wu and Dredze (2020) Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? Preprint, arXiv:2005.09093.
Xie et al. (2024) Zhihui Xie, Handong Zhao, Tong Yu, and Shuai Li. 2024. Discovering low-rank subspaces for language-agnostic multilingual representations. Preprint, arXiv:2401.05792.
Xu et al. (2023) Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. Preprint, arXiv:2312.12148.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. Preprint, arXiv:2010.11934.
Zhang et al. (2023) Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. Preprint, arXiv:2303.10512.
Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual language model. Preprint, arXiv:2402.07827.

Appendix A Computational Resources

For our experiments, we have utilized a computational infrastructure consisting of A10 and A40 NVIDIA GPUs, while our experiments ran in parallel on multiple GPUs.

Appendix B Prompts Used

For the purpose of the encoder-decoder model, the record from each dataset needs to be transformed into a text-to-text format. To choose an appropriate prompt format, we experimented with all the prompts used in the mT0 paper Muennighoff et al. (2023) and with prompts used in the T5 paper Raffel et al. (2020). Prompts, which achieved the best performance during inference with the mT0-Base model, were selected for transforming the records into a text-to-text format. In the following paragraphs, there are the prompts for the individual tasks that have been used to convert to text-to-text format.

B.1 Question Answering

Template: question: {question} context: {context}

B.2 Natural Language Inference

Template: {premise} \n\n Question: Does this imply that "{hypothesis}"? Yes, no, or maybe?

B.3 Named-Entity Recognition

Template: tag: {text}

B.4 Check-Worthy Claim Detection

Template: checkworthiness claim: {claim}

Appendix C Soft Prompt Initialization

This section includes templates for soft prompts used for the initialization for each language and each task. Templates are divided into language and task templates.

C.1 Language Templates

To train language representation using a language modeling objective, we employed a specific prompt that varied only based on the language present in the instruction, leaving the rest of the instruction the same.

The template we used for initialization is as follows: "Generate the output in {Language}:", where the Language is replaced by the desired language.

C.2 Task Templates

The following are initialization prompt templates for each task, where the instruction depends not only on the task but also on the language.

Question Answering.

For the question answering task, we utilized "Answer the question in {Language} language:", while replacing Language with the desired language.

Natural Language Inference.

Natural language inference is the task of assessing whether a hypothesis logically follows from the premise. It is defined as a classification with three possible classes: entailment, contradiction or neutral. However, based on the previous work and instruction tuning of the mT0 model, we replaced above mentioned classes with Yes, No and Maybe, based on the used prompt template.

According to the employed classes, we defined an initialization prompt as follows: "Select Yes, No or Maybe based on the implication of the premise on the hypothesis in {Language}:", while Language is replaced by the desired language.

Named-Entity Recognition.

The named-entity recognition task aims to identify named entities within the input text. While there are many possible categories, the WikiANN dataset focuses only on detecting three categories: location (LOC), person (PER) and organization (ORG). Based on the defined classes, we created the initialization prompt as follows: "Identify NER tags (ORG, PER, LOC) in the text in {Language}:", where Language is substituted with the specific language.

Check-Worthy Claim Detection.

The latter task includes check-worthy claim detection, which is a binary classification of assessing whether the given claim is worthy of fact-checking or not. As text labels, we used Not checkworthy and Checkworthy. This is the initialization prompt for the check-worthy claim detection task: Determine whether a given claim in {Language} is checkworthy:", where Language is replaced by the desired language.

Appendix D Hyperparameters

Table 3 shows hyperparameters used for training language and task representations using adapters and soft prompts.

Hyperparameters	Language Modeling		Task Modeling
Hyperparameters	Language Adapter	Soft Language Prompt	Task Adapter	Soft Task Prompt
Learning rate	5e-5	5e-1	5e-5	5e-1
Weight decay	0	1e-5	0	1e-5
Batch size	32	32	32	32
No. Training steps	100,000	100,000	50,000	50,000
Optimizer	AdamW	Adafactor	AdamW	Adafactor
Evaluation steps	500	500	1000	1000
Max input length	256	256	256	256
Token size of soft prompt	NaN	50	NaN	50

Table 3: Final parameters employed to train language and task representation using adapters and soft prompts.

Appendix E Cross-Lingual Evaluation For All Six Languages

Tables 4 to 7 present the results for all language pairs, i.e., all languages represent source languages and all others (including itself) represent task languages. The first row in each table represents the scores obtained by inference of the original mT0-Base model without additional training of language or task representations.

Task Language

Language

Representation

Task

Representation

English

German

Spanish

Czech

Slovak

Telugu

None

62.80 (55.88)

35.4 (27.82)

39.54 (27.48)

31.34 (24.78)

26.39 (9.78)

18.64 (12.10)

English

None

Adapter

\ul66.48 (59.19)

40.55 (30.98)

\ul41.85 (28.04)

36.95 (28.57)

\ul30.11 (11.46)

\ul19.65 (12.70)

Soft Prompt

66.09 (58.79)

39.22 (29.60)

41.12 (28.09)

33.59 (25.55)

27.76 (10.07)

19.38 (12.80)

Adapter

65.96 (58.91)

34.88 (24.83)

39.18 (26.59)

33.75 (24.33)

28.98 (10.58)

9.85 (5.00)

Soft Prompt

64.70 (57.86)

39.65 (30.16)

39.12 (27.10)

33.94 (26.22)

29.82 (11.54)

12.67 (6.60)

Soft Prompt

Adapter

68.12 (60.95)

39.23 (29.44)

42.86 (29.50)

\ul35.52 (27.27)

30.84 (11.73)

20.19 (13.50)

Soft Prompt

66.19 (58.90)

\ul40.49 (30.98)

41.15 (27.97)

35.35 (27.02)

29.89 (11.39)

11.71 (7.30)

German

None

Adapter

65.70 (58.52)

48.07 (38.14)

\ul41.41 (29.28)

35.37 (27.29)

27.51 (10.16)

18.81 (12.80)

Soft Prompt

63.15 (55.20)

44.53 (33.85)

38.34 (26.65)

28.56 (21.31)

24.54 (8.92)

12.46 (9.70)

Adapter

61.52 (51.29)

\ul48.00 (38.01)

38.36 (25.78)

\ul36.78 (27.82)

31.43 (11.35)

11.32 (6.00)

Soft Prompt

61.13 (52.17)

43.30 (33.49)

38.95 (27.24)

38.13 (31.20)

\ul30.70 (11.98)

16.05 (8.70)

Soft Prompt

Adapter

\ul65.02 (57.44)

46.78 (36.69)

41.49 (29.10)

31.81 (24.29)

27.42 (10.16)

\ul17.13 (11.20)

Soft Prompt

63.86 (56.48)

44.79 (35.07)

38.45 (27.19)

32.68 (24.65)

27.95 (10.30)

12.34 (9.50)

Spanish

None

Adapter

65.81 (58.73)

39.33 (29.90)

44.64 (32.11)

\ul33.72 (25.43)

27.24 (9.71)

19.10 (12.50)

Soft Prompt

61.01 (53.41)

33.90 (25.69)

41.86 (29.15)

25.98 (19.06)

22.43 (7.62)

11.94 (9.00)

Adapter

58.59 (48.30)

37.98 (27.98)

44.49 (31.62)

33.98 (23.88)

28.66 (10.01)

10.76 (5.00)

Soft Prompt

59.41 (50.01)

\ul38.20 (28.67)

40.33 (28.18)

32.86 (24.45)

\ul27.82 (10.35)

10.89 (5.40)

Soft Prompt

Adapter

\ul64.97 (57.50)

38.02 (29.29)

\ul44.58 (32.01)

32.43 (24.92)

27.95 (10.27)

\ul17.64 (11.40)

Soft Prompt

60.51 (52.55)

33.71 (25.74)

42.16 (29.34)

27.21 (19.82)

21.49 (7.49)

9.28 (7.20)

Czech

None

Adapter

58.61 (53.43)

\ul34.46 (27.64)

\ul36.06 (26.76)

38.6 (32.96)

29.5 (11.51)

13.06 (10.20)

Soft Prompt

51.65 (47.37)

28.09 (22.94)

28.84 (22.48)

32.99 (28.35)

25.57 (10.59)

8.45 (6.70)

Adapter

52.89 (46.49)

29.35 (24.39)

26.65 (20.50)

36.78 (31.69)

33.53 (13.34)

7.17 (4.90)

Soft Prompt

39.55 (36.05)

35.59 (27.55)

20.21 (16.64)

29.53 (25.88)

26.64 (11.17)

7.31 (4.70)

Soft Prompt

Adapter

\ul57.90 (52.96)

33.69 (27.34)

37.05 (27.07)

\ul38.15 (33.02)

\ul30.42 (12.13)

\ul12.84 (9.90)

Soft Prompt

49.42 (45.53)

27.63 (22.90)

29.51 (23.32)

32.32 (27.78)

27.82 (11.48)

7.15 (6.40)

Slovak

None

Adapter

48.84 (34.31)

26.72 (15.70)

\ul28.24 (13.96)

26.65 (12.69)

39.13 (19.89)

13.46 (6.90)

Soft Prompt

40.64 (27.59)

19.37 (10.49)

22.13 (10.38)

20.56 (8.39)

35.12 (16.53)

8.64 (5.80)

Adapter

33.25 (21.76)

20.47 (11.55)

12.63 (7.93)

17.56 (9.06)

38.43 (19.03)

7.16 (1.70)

Soft Prompt

9.06 (6.93)

21.49 (13.21)

6.80 (5.06)

16.06 (9.55)

33.93 (17.05)

1.95 (0.80)

Soft Prompt

Adapter

\ul47.31 (32.55)

\ul24.80 (14.99)

28.45 (13.95)

\ul23.84 (10.82)

\ul38.52 (19.50)

\ul10.90 (6.00)

Soft Prompt

39.06 (27.90)

15.72 (9.51)

21.51 (11.25)

19.09 (8.22)

34.84 (17.30)

6.71 (4.40)

Telugu

None

Adapter

60.78 (51.87)

32.43 (23.61)

\ul37.35 (24.99)

27.34 (20.47)

\ul24.14 (8.77)

21.12 (13.80)

Soft Prompt

59.22 (49.80)

\ul31.01 (21.92)

35.69 (23.41)

23.49 (17.63)

22.60 (8.40)

19.73 (12.90)

Adapter

20.23 (10.33)

12.71 (5.55)

14.91 (6.29)

13.66 (5.82)

14.08 (3.59)

21.89 (14.80)

Soft Prompt

37.55 (20.31)

24.72 (7.09)

26.72 (12.79)

24.01 (13.31)

27.49 (0.09)

20.73 (14.80)

Soft Prompt

Adapter

\ul59.63 (50.76)

30.22 (22.38)

38.13 (25.99)

\ul26.64 (19.80)

24.39 (9.08)

\ul21.75 (14.30)

Soft Prompt

57.72 (47.93)

30.38 (21.49)

35.26 (22.74)

24.58 (18.14)

23.14 (8.60)

18.73 (11.60)

Table 4: Results for the question answering task for all language pairs. The reported results are in the format: F1-Score (Exact Match). The best results for each combination of source and target languages are in bold and the second best scores are underlined.

Task Language

Language

Representation

Task

Representation

English

German

Spanish

Czech

Slovak

Telugu

None

English

None

Adapter

84.78

38.69

44.54

41.53

44.15

32.02

Soft Prompt

81.66

43.75

\ul59.83

\ul42.31

45.40

\ul33.41

Adapter

\ul82.78

22.46

32.75

33.99

23.18

13.91

Soft Prompt

78.62

47.56

60.21

40.62

35.97

19.98

Soft Prompt

Adapter

82.47

50.78

51.48

40.65

51.62

31.23

Soft Prompt

81.02

\ul48.23

56.74

44.33

\ul48.64

33.85

German

None

Adapter

27.79

\ul82.58

\ul73.37

\ul66.61

67.96

48.56

Soft Prompt

36.02

79.06

69.66

61.04

62.28

43.20

Adapter

29.66

82.42

70.05

64.70

68.42

24.52

Soft Prompt

38.96

73.06

68.86

54.79

51.17

27.96

Soft Prompt

Adapter

\ul37.56

83.95

75.23

67.12

71.04

49.70

Soft Prompt

36.52

78.71

69.85

62.84

\ul70.80

\ul48.67

Spanish

None

Adapter

44.82

63.55

88.60

68.66

70.46

\ul46.24

Soft Prompt

44.38

62.65

86.23

63.95

63.25

47.81

Adapter

40.50

56.12

\ul87.29

58.05

66.00

17.84

Soft Prompt

40.43

\ul64.47

82.79

52.71

60.82

31.27

Soft Prompt

Adapter

\ul51.46

63.59

89.05

\ul67.45

\ul68.78

50.91

Soft Prompt

59.55

65.73

83.77

65.55

66.04

40.59

Czech

None

Adapter

\ul40.14

67.36

75.08

85.18

75.02

\ul48.80

Soft Prompt

37.61

63.85

65.80

81.50

72.02

44.67

Adapter

33.68

\ul70.50

70.87

\ul86.19

75.76

23.98

Soft Prompt

31.33

69.45

64.34

78.75

57.95

17.34

Soft Prompt

Adapter

42.74

70.94

73.11

86.67

79.74

49.90

Soft Prompt

36.96

67.94

\ul75.04

81.17

\ul76.89

48.29

Slovak

None

Adapter

\ul39.96

68.75

67.23

\ul77.79

\ul88.68

\ul53.53

Soft Prompt

38.34

64.19

69.09

72.93

83.50

50.55

Adapter

35.70

66.58

69.08

75.51

88.36

22.37

Soft Prompt

37.21

64.83

65.96

70.12

79.49

30.93

Soft Prompt

Adapter

45.37

72.88

75.57

79.55

89.01

54.20

Soft Prompt

37.70

\ul69.20

\ul70.15

73.76

83.62

51.36

Telugu

None

Adapter

48.85

\ul56.93

\ul56.52

49.21

\ul50.02

71.05

Soft Prompt

37.61

45.01

45.51

38.78

38.30

74.44

Adapter

32.16

41.52

34.16

35.73

36.74

69.48

Soft Prompt

33.57

38.76

47.26

43.49

19.21

72.04

Soft Prompt

Adapter

\ul46.60

57.20

61.52

\ul48.75

53.31

71.97

Soft Prompt

33.79

48.34

42.41

41.52

43.26

70.46

Table 5: Results for the named-entity recognition task for the Cartesian product using F1-Score, where each language is the source and also the target language. The best scores are boldfaced, and the second best are underlined.

Task Language

Language

Representation

Task

Representation

English

German

Spanish

Czech

Slovak

Telugu

None

44.31

43.49

43.07

35.50

36.42

39.58

English

None

Adapter

81.82

73.75

\ul76.07

35.00

34.58

\ul69.38

Soft Prompt

74.89

67.98

71.20

33.83

34.58

63.43

Adapter

\ul82.30

42.95

56.25

\ul35.33

35.08

44.11

Soft Prompt

68.30

52.71

66.29

37.75

\ul37.25

35.37

Soft Prompt

Adapter

82.34

\ul73.71

76.47

33.83

35.58

\ul67.88

Soft Prompt

73.19

66.97

69.00

35.17

37.33

63.35

German

None

Adapter

\ul79.12

76.19

75.59

35.17

35.92

68.90

Soft Prompt

68.90

69.76

67.78

35.50

\ul37.00

64.45

Adapter

67.37

\ul76.51

75.91

34.92

35.50

66.81

Soft Prompt

61.10

64.95

60.04

37.75

36.67

46.45

Soft Prompt

Adapter

79.52

76.99

\ul75.77

33.75

34.50

\ul67.13

Soft Prompt

70.32

67.68

68.88

\ul36.00

37.67

64.37

Spanish

None

Adapter

80.12

74.53

77.98

35.08

35.17

\ul70.00

Soft Prompt

71.98

68.52

70.74

34.83

35.75

64.11

Adapter

71.66

66.43

77.54

34.92

35.83

59.40

Soft Prompt

66.85

56.57

65.29

38.75

36.42

33.49

Soft Prompt

Adapter

\ul77.39

\ul71.22

\ul77.64

33.92

36.25

\ul67.33

Soft Prompt

69.42

66.33

69.12

\ul35.92

\ul36.33

63.19

Czech

None

Adapter

\ul41.32

37.74

41.82

34.00

34.83

\ul38.74

Soft Prompt

38.66

\ul36.49

36.57

39.00

36.58

37.54

Adapter

33.43

33.37

33.35

33.75

33.50

33.33

Soft Prompt

33.33

34.57

32.87

36.75

36.92

33.33

Soft Prompt

Adapter

33.57

34.31

34.61

33.12

\ul36.67

33.33

Soft Prompt

42.28

34.63

\ul37.33

\ul37.75

36.33

39.10

Slovak

None

Adapter

47.60

43.93

46.25

\ul35.08

34.67

42.32

Soft Prompt

38.66

35.85

\ul36.13

36.50

\ul35.92

36.03

Adapter

33.33

32.73

33.42

34.58

33.33

Soft Prompt

\ul39.52

\ul42.14

33.43

32.67

36.58

34.53

Soft Prompt

Adapter

33.57

33.53

33.57

33.42

33.83

33.33

Soft Prompt

37.09

33.23

32.57

33.58

35.00

\ul36.57

Telugu

None

Adapter

\ul76.63

71.76

73.41

35.58

\ul36.50

\ul72.67

Soft Prompt

70.34

67.90

68.18

\ul36.00

34.83

68.20

Adapter

63.49

60.14

72.77

35.17

36.00

72.69

Soft Prompt

52.44

49.88

49.96

38.17

37.75

59.66

Soft Prompt

Adapter

76.89

\ul71.20

\ul73.15

34.75

33.92

72.59

Soft Prompt

67.13

67.31

66.73

\ul37.08

\ul36.50

66.89

Table 6: For NLI, we report accuracy as a metric, with the table containing results for all language pairs. The best results for each language pair are highlighted in bold and the second best are underlined.

Task Language

Language

Representation

Task

Representation

English

German

Spanish

Czech

Slovak

Telugu

None

English

None

Adapter

99.45

96.90

\ul97.07

85.10

83.09

\ul87.46

Soft Prompt

99.06

89.18

93.29

55.29

53.03

93.50

Adapter

\ul99.59

\ul97.80

95.79

82.95

85.98

85.99

Soft Prompt

99.10

95.40

94.75

40.60

46.60

81.91

Soft Prompt

Adapter

99.76

98.30

97.14

\ul83.47

\ul85.49

77.34

Soft Prompt

99.55

95.05

93.46

75.72

70.08

79.55

German

None

Adapter

\ul98.96

\ul97.21

69.72

73.70

93.73

Soft Prompt

98.43

98.89

96.16

65.32

62.31

91.54

Adapter

98.60

99.00

97.16

78.84

\ul82.91

79.96

Soft Prompt

97.77

97.65

96.06

58.68

67.99

\ul93.06

Soft Prompt

Adapter

99.28

99.14

97.90

\ul74.94

82.96

89.64

Soft Prompt

87.91

98.08

93.23

63.87

57.84

93.02

Spanish

None

Adapter

98.12

\ul97.28

98.71

76.98

74.40

\ul93.15

Soft Prompt

97.49

87.98

99.10

76.38

74.78

91.90

Adapter

97.15

95.49

98.99

73.15

\ul81.65

93.80

Soft Prompt

96.44

94.75

\ul99.20

77.88

81.02

83.19

Soft Prompt

Adapter

\ul97.64

97.31

99.31

87.14

86.36

91.72

Soft Prompt

97.30

95.87

99.03

\ul79.97

78.85

89.11

Czech

None

Adapter

\ul91.68

86.29

86.61

99.07

97.59

72.85

Soft Prompt

82.80

86.94

87.70

98.54

97.14

71.64

Adapter

91.00

\ul91.87

\ul88.39

99.45

\ul98.65

\ul82.19

Soft Prompt

50.54

65.33

74.34

97.78

95.93

72.53

Soft Prompt

Adapter

92.78

93.40

89.13

\ul99.38

98.71

85.96

Soft Prompt

81.87

76.13

85.01

98.17

92.57

75.41

Slovak

None

Adapter

92.92

89.76

96.44

98.97

99.17

87.82

Soft Prompt

68.12

84.83

\ul91.94

97.45

98.39

68.92

Adapter

72.55

85.38

89.29

98.15

98.99

\ul85.71

Soft Prompt

69.55

81.88

87.89

95.85

97.32

51.13

Soft Prompt

Adapter

78.85

\ul88.08

91.42

\ul98.55

\ul99.03

75.80

Soft Prompt

\ul87.03

76.63

83.71

94.16

98.20

78.43

Telugu

None

Adapter

\ul97.72

96.87

96.92

\ul67.93

61.91

98.89

Soft Prompt

95.98

89.50

93.90

65.27

\ul68.61

98.54

Adapter

98.04

96.10

96.45

52.68

55.44

99.17

Soft Prompt

96.28

93.45

94.26

60.72

34.00

97.15

Soft Prompt

Adapter

96.84

\ul96.61

\ul96.91

68.81

69.78

\ul98.79

Soft Prompt

95.78

96.13

94.80

53.40

49.78

97.93

Table 7: Results for the check-worthy claim detection task with each language as the source and target language. Results are reported using F1-Score, with best scores in bold and the second best underlined.

Appendix F Evaluation with Multiple Training Seeds

In Table 8, we report the evaluation results of all configurations that were trained on the German version of the WikiANN dataset using three different seeds. Along with the mean values, we also report the standard deviation

The obtained results demonstrate that the best results for knowledge transfer from German to other languages are obtained when combining a soft language prompt and a task adapter, supporting our observation that this configuration achieves superior results on the classification tasks.

Language

Representation

Task

Representation

English

German

Spanish

Czech

Slovak

Telugu

None

Adapter

46.12 ± 22.46

82.51 ± 0.45

71.12 ± 3.41

\ul66.00 ± 0.77

67.55 ± 1.22

47.69 ± 2.35

Soft Prompt

48.61 ± 15.42

78.65 ± 0.58

70.87 ± 1.51

61.48 ± 0.69

62.42 ± 0.75

43.94 ± 1.64

Adapter

44.28 ± 17.93

\ul82.76 ± 0.74

70.36 ± 1.68

64.78 ± 0.15

68.15 ± 1.19

25.96 ± 2.86

Soft Prompt

\ul48.93 ± 12.23

73.57 ± 0.90

68.16 ± 3.87

50.76 ± 14.17

53.63 ± 4.44

26.83 ± 4.04

Soft Prompt

Adapter

50.95 ± 16.39

83.80 ± 0.20

76.06 ± 1.40

67.25 ± 0.23

72.95 ± 2.35

52.58 ± 5.43

Soft Prompt

48.85 ± 15.23

78.43 ± 0.48

\ul71.55 ± 2.44

64.00 ± 2.18

\ul69.87 ± 2.21

\ul48.35 ± 2.19

Table 8: Results of cross-lingual transfer from German to all languages for the NER task. We report the mean of three runs along with the standard deviation. The best results are bolded and the second best results are underlined.