Large Language Models Are Cross-Lingual Knowledge-Free Reasoners

Peng Hu\clubsuit, Sizhe Liu\clubsuit11footnotemark: 1, Changjiang Gao\clubsuit, Xin Huang\diamond, Xue Han\diamond,
Junlan Feng\diamond, Chao Deng\diamond, Shujian Huang\clubsuit
\clubsuitNational Key Laboratory for Novel Software Technology, Nan**g University
\diamondChina Mobile Research, Bei**g, China
{hup, liusz, gaocj}@smail.nju.edu.cn, [email protected]
{huangxinyjy, hanxueai, fengjunlan, dengchao}@chinamobile.com
*Equal contribution
Abstract

Large Language Models have demonstrated impressive reasoning capabilities across multiple languages. However, the relationship between capabilities in different languages is less explored. In this work, we decompose the process of reasoning tasks into two separated parts: knowledge retrieval and knowledge-free reasoning, and analyze the cross-lingual transferability of them. With adapted and constructed knowledge-free reasoning datasets, we show that the knowledge-free reasoning capability can be nearly perfectly transferred across various source-target language directions despite the secondary impact of resource in some specific target languages, while cross-lingual knowledge retrieval significantly hinders the transfer. Moreover, by analyzing the hidden states and feed-forward network neuron activation during the reasoning tasks, we show that higher similarity of hidden representations and larger overlap of activated neurons could explain the better cross-lingual transferability of knowledge-free reasoning than knowledge retrieval. Thus, we hypothesize that knowledge-free reasoning embeds in some language-shared mechanism, while knowledge is stored separately in different languages. 111Our code and data is available at: https://github.com/NJUNLP/Knowledge-Free-Reasoning.

Large Language Models Are Cross-Lingual Knowledge-Free Reasoners


Peng Hu\clubsuitthanks: *Equal contribution, Sizhe Liu\clubsuit11footnotemark: 1, Changjiang Gao\clubsuit, Xin Huang\diamond, Xue Han\diamond, Junlan Feng\diamond, Chao Deng\diamond, Shujian Huang\clubsuit \clubsuitNational Key Laboratory for Novel Software Technology, Nan**g University \diamondChina Mobile Research, Bei**g, China {hup, liusz, gaocj}@smail.nju.edu.cn, [email protected] {huangxinyjy, hanxueai, fengjunlan, dengchao}@chinamobile.com


1 Introduction

Large language models (LLMs) today have shown strong multitask and multilingual performance in various domains Huang and Chang (2022), including robust reasoning capabilities across multiple languages Ranaldi and Zanzotto (2023), even for low-resource languages in the training corpus Shi et al. (2022).

Previous study reveals that these multilingual LLMs possess certain ability of multilingual transfer Ye et al. (2023), which means the skills or knowledge learned with one language can be automatically transferred to another language without extra training. However, the effect of such cross-lingual transfer varies across tasks. In knowledge-related tasks, current LLMs show unsatisfactory cross-lingual transfer Gao et al. (2024), while in certain reasoning tasks, more effective transfer is observed Qi et al. (2023). Previous study lacks the analysis on the difference between these tasks, and does not dig further into the specific factors affecting the transfer effectiveness.

In this study, we divide a reasoning task into two separated components: knowledge retrieval and knowledge-free reasoning, and investigate the transfer effectiveness of them in current LLMs. The former means recalling information encountered during training, while the latter refers to organizing the given knowledge in the context into a final answer.

Based on such definition, we perform evaluation and interpretability analysis on LLaMA-2 models Touvron et al. (2023) to compare the cross-lingual transferability of knowledge retrieval and knowledge-free reasoning. In the evaluation part, by controlling the requirement of knowledge retrieval in the StrategyQA (Geva et al., 2021) dataset and constructing a knowledge-free reasoning dataset, we compare the cross-lingual transfer effectiveness of knowledge-related and knowledge-free reasoning. In the interpretability analysis part, we assess the cross-lingual computational similarity of hidden states and Feed-Forward Network (FFN) neural activation to trace the computational process of cross-lingual transfer of knowledge-related and knowledge-free reasoning in LLMs. Our main findings are:

  • Knowledge retrieval significantly hinders cross-lingual transfer. The more knowledge retrieval is required in the task, the lower effectiveness of cross-lingual transfer is observed.

  • The knowledge-free reasoning ability can be effectively transferred to other languages after finetuning in one, while the model’s ability in these languages are less important.

  • The overall cross-lingual computational similarity of knowledge-free reasoning is significantly higher than knowledge-related reasoning. Also, the difference in such similarity level is the most significant in the middle-high layers of the model. This suggests a language-shared mechanism of knowledge-free reasoning in multilingual LLMs.

2 Methods

The work in this paper consists of two parts: evaluation and interpretability analysis. For each part, we employ specific datasets and design certain metrics or measurements.

2.1 Datasets

This part introduces the datasets used in this paper for the analysis of the cross-lingual transfer of reasoning tasks. For datasets available only in English or part of the tested languages, we translate them into other languages with Google Translate.

2.1.1 Adapted StrategyQA dataset

The StrategyQA dataset is a multi-domain reasoning dataset including common sense, arithmetic, symbolic, and logical reasoning, where model is given at least two pieces of evidence from Wikipedia, and required to give "yes" or "no" answers to the questions. An example can be found in Table 3 in Appendix B.

Since Wikipedia is a commonly used source in LLM pretraining, the model can either collect the required information from evidence in the context, or retrieve it from its internal knowledge strorage. Thus, by masking out some of the evidence, we can control the need of knowledge retrieval in the reasoning process, which allows us to analyze the impact of knowledge retrieval on the cross-lingual transfer of reasoning skills. Namely, we design two kinds of scenarios in the experiments:

  • No Fact (NF): The model is given only the questions and must retrieve the necessary knowledge (evidence).

  • With Fact (WF): The model is provided with the questions and some of the evidence. To control the degree of knowledge retrieval needed, we further devide the WF-1, WF-2 and WF-all settings, where one piece, two pieces, and all pieces of evidence is provided for each question.

2.1.2 Knowledge-free reasoning dataset

Although the need for knowledge retrieval can be controlled with the adapted StrategyQA dataset, the questions themselves may still contain implicit requirement for knowledge retrieval. To rule out such effect and focus on the knowledge-free reasoning process, we construct a data that minimizes the need for knowledge retrieval. Inspired by Wei et al. (2022)’s taxonomy of reasoning tasks, we developed the Knowledge-Free Reasoning Dataset (KFRD), which consists of three fundamental reasoning tasks: arithmetic reasoning, symbolic reasoning, and logic reasoning in the 4-choice format. The dataset is constructed by filling GPT-4 Achiam et al. (2023) generated entities or simple English words into a unified template with one input part, one transformation rule, and one options part. Then, using the translated templates, entities and words, the dataset is expanded to all the tested languages. The template and examples can be found in Figure 11 and Table 1 in Appendix A.

Arithmetic Reasoning:

transforming one or two input numbers through mathematical operations into one or two output numbers. The mathematical operations include: basic arithmetic operations (addition, subtraction, multiplication, division), equality, geometric progression, arithmetic progression, and sorting.

Symbolic Reasoning:

transforming 3-5 words from the respective language input through symbolic operations to obtain the output. The symbolic operations include: repetition, addition, deletion, reordering, and their combinations, with up to three operations applied in combination.

Logical Reasoning:

selecting propositions derived from eight given premises using provided logical rules. The logical rules include: Implication Elimination, Conjunction Introduction, Conjunction Elimination, Disjunction Introduction, Disjunction Elimination, and Proof by Contradiction.

2.1.3 Knowledge-related reasoning dataset

To further compare the cross-lingual transfer of knowledge retrieval and knowledge-free reasoning, three knowledge-related reasoning datasets are used in the interpretability analysis. The three datasets are:

MKQA:

The MKQA dataset Longpre et al. (2021) is an open-domain question-answering evaluation set designed to provide a challenging benchmark for multilingual question-answering. The dataset consists of 10,000 question-answer pairs and covers 26 different languages.

BoolQ:

BoolQ (Clark et al., 2019) is a question answering dataset for yes/no questions, comprising 15,942 examples. The questions originate from queries to the Google search engine with Wikipedia pages as answers.

AmbigQA:

AmbigQA (Min et al., 2020) is an open-domain dataset for question answering that includes 14,042 questions. It addresses the issue of ambiguous answers by rephrasing the related questions.

Some examples of these datasets are shown in Table 3 in Appendix B.

2.2 Evaluation metric

In order to measure the model’s cross-lingual transferability while mitigating the difficulty difference in of the datasets, we calculate the Cross-lingual Transfer Ratio (XLTR) following the prior work Gao et al. (2024). The formula is as follows:

XLTR(s,t)=(|CsCt||Cs|Ar)/(1Ar)XLTR𝑠𝑡subscript𝐶𝑠subscript𝐶𝑡subscript𝐶𝑠subscript𝐴𝑟1subscript𝐴𝑟\text{XLTR}(s,t)=(\frac{|C_{s}\cap C_{t}|}{|C_{s}|}-A_{r})/(1-A_{r})XLTR ( italic_s , italic_t ) = ( divide start_ARG | italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG - italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) / ( 1 - italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )

where s𝑠sitalic_s and t𝑡titalic_t denote the source and target languages in the transfer. Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT represents the set of correct answers in language x𝑥xitalic_x, and Arsubscript𝐴𝑟A_{r}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the accuracy of random choices for the given task. If after training on a task in one specific language (namely, English), the model shows an XLTR score close to 100% in other untrained languages, we say it achieves Fully Cross-Lingual Transfer (FCLT) in this task.

2.3 Interpretability measurements

We employ two metrics to evaluate the cross-lingual computational consistency of the LLM during knowledge-related and knowledge-free reasoning processes.

2.3.1 Cosine similarity of hidden states (CS)

We measure the cosine similarity of the hidden representations across multiple languages during the reasoning process of a same question, in order to observe how the semantic space in the tested languages approximate each other. The similarity is calculated by:

CS(x)=n=1Na,b,ab𝐡na(x)𝐡nb(x)|𝐡na(x)||𝐡nb(x)|||(||1)NCS𝑥superscriptsubscript𝑛1𝑁subscriptformulae-sequence𝑎𝑏𝑎𝑏superscriptsubscript𝐡𝑛𝑎𝑥superscriptsubscript𝐡𝑛𝑏𝑥superscriptsubscript𝐡𝑛𝑎𝑥superscriptsubscript𝐡𝑛𝑏𝑥1𝑁\text{CS}(x)=\frac{\sum_{n=1}^{N}\sum_{\begin{subarray}{c}a,b\in\mathcal{L},a% \neq b\end{subarray}}\frac{\mathbf{h}_{n}^{a}(x)\cdot\mathbf{h}_{n}^{b}(x)}{% \left|\mathbf{h}_{n}^{a}(x)\right|\cdot\left|\mathbf{h}_{n}^{b}(x)\right|}}{|% \mathcal{L}|(|\mathcal{L}|-1)N}CS ( italic_x ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a , italic_b ∈ caligraphic_L , italic_a ≠ italic_b end_CELL end_ROW end_ARG end_POSTSUBSCRIPT divide start_ARG bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x ) ⋅ bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG | bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x ) | ⋅ | bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ) | end_ARG end_ARG start_ARG | caligraphic_L | ( | caligraphic_L | - 1 ) italic_N end_ARG

where x𝑥xitalic_x is a certain question sample, N𝑁Nitalic_N is the total number of model layers, \mathcal{L}caligraphic_L denotes the set of all tested languages, and 𝐡na(x)superscriptsubscript𝐡𝑛𝑎𝑥\mathbf{h}_{n}^{a}(x)bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x ) is the output hidden states of the n𝑛nitalic_n-th layer for sample x𝑥xitalic_x in language a𝑎aitalic_a. After that, the cosine similarity of all tested samples are averaged to report the final score.

2.3.2 Neuron Activation Overlap (NAO)

The Neuron Activation Overlap refers to the extent of overlap between activated neurons across multiple languages for the same corpus. The calculation steps are as follow:

  1. 1.

    Input a question in multiple languages into the model and extract the activation values of the last token in the question in each language.

  2. 2.

    Identify activated neurons. A neuron is considered activated if the absolute value of its activation exceeds a certain threshold.

  3. 3.

    Calculate the overlap of activated neurons in all the tested languages.

For a question sample x𝑥xitalic_x, the formula is as follows:

NAO(x)=|||lSl(x)|l|Sl(x)|NAO𝑥subscript𝑙superscript𝑆𝑙𝑥subscript𝑙superscript𝑆𝑙𝑥\text{NAO}(x)=\frac{\left|\mathcal{L}\right|\cdot\left|\bigcap_{l\in\mathcal{L% }}S^{l}(x)\right|}{\sum_{l\in\mathcal{L}}\left|S^{l}(x)\right|}NAO ( italic_x ) = divide start_ARG | caligraphic_L | ⋅ | ⋂ start_POSTSUBSCRIPT italic_l ∈ caligraphic_L end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l ∈ caligraphic_L end_POSTSUBSCRIPT | italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) | end_ARG

where \mathcal{L}caligraphic_L is set of languages, and Sl(x)superscript𝑆𝑙𝑥S^{l}(x)italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) is the set of activated neurons on sample x𝑥xitalic_x in language l𝑙litalic_l. After that, the NAO of all tested samples are averaged to report the final score.

3 Experiment Settings

3.1 Language and model choice

Language choice

To capture a broad spectrum of linguistic diversity, we selected ten languages based on their extensive use and representation of diverse linguistic families, following Gao et al. (2024). The languages selected are English (en), German (de), French (fr), Italian (it), Russian (ru), Polish (pl), Arabic (ar), Hebrew (he), Chinese (zh), and Japanese (ja). Further details are provided in Appendix C.

Model choice

We selected several LLMs including LLaMA-2-7B-Chat (Touvron et al., 2023), BLOOMZ-MT-7B (Muennighoff et al., 2023), Mistral-7B-Instruct-v0.1 (Jiang et al., 2023), and Qwen-1.5-7B-Chat (Bai et al., 2023) to conduct transferability experiments on existing and synthetic datasets.

For experiments focusing on the impact of the training language (Section 4.3.1) and interpretability (Section 4.4), we used LLaMA-2-7B-Chat as a representative model for analysis.

For the experiments evaluating impact of the target language proficiency in specific directions (Section 4.3.2), we selected derived models of LLaMA-2-7B and Mistral-7B to conduct experiments on Arabic and Hebrew respectively, both of which are low-resource languages.

3.2 Fine-tuning settings

We perform LoRA fine-tuning (Hu et al., 2021) on all model blocks in all experiments due to the limited computational resources. More details about fine-tuning can be found in Appendix E.

3.3 Decoding settings

In all experiments, we perform constrained decoding to prevent the model from generating tokens other than the desired choices (Yes/No for StrategyQA, A/B/C/D for KFRD).

In the interpretability analysis, we use the last token of the question to collect the hidden states and neural activation values, because the last input token is used to predict the next token, it gradually incorporates the primary information of the entire sentence, reflecting the overall thought process for the entire problem (Meng et al., 2022; Stolfo et al., 2023; Wu et al., 2024). By focusing on the model’s computational pathway for reasoning rather than calculating the similarity between multilingual sentences, we can better understand how the model processes complex queries. Calculating with an output token, on the other hand, would make it difficult to interpret the reasoning process. Additionally, token counts differ across languages, complicating direct comparisons. Therefore, using the last input token helps in standardizing the analysis across different languages.

To ensure consistency in the final token across different datasets, we added a language-specific ’?’ to certain datasets. For details on these modifications and metrics, please refer to Appendix D.

4 Results

Refer to caption
Figure 1: XLTR of different models on StrategyQA. Solid lines: WF results; Dashed lines: NF results. The label of training language (en) is capitalized.
Refer to caption
Figure 2: XLTR of LLaMA-2-7B-Chat on StrategyQA under different settings. NF stands for the No Facts setting, while WF-1 and WF-2 indicate that one or two facts are given for each question, respectively. WF-all signifies that all facts are provided.
Refer to caption
Figure 3: XLTR on the different parts of KFRD
Refer to caption
Figure 4: Accuracy of LLaMA-2-7B-Chat from different languages on three parts of KFRD
Refer to caption
Figure 5: XLTR of LLaMA-2-7B-Chat from different languages on three parts of KFRD

4.1 Impact of knowledge retrieval on cross-lingual transfer

We analyze the impact of the amount of retrieved knowledge on cross-lingual transfer in different settings of the StrategyQA dataset. The results for cross-lingual transfer ratio are presented in Figure 1, while the accuracy results are detailed in Figure 12 in Appendix G.

Knowledge retrieval requirement harms cross-lingual transfer

The experimental results indicate that, for all languages, the cross-lingual transfer ratio of all models are significantly higher when the necessary knowledge for reasoning is provided compared to when it is not. This suggests that the requirement for knowledge retrieval significantly hinders the model’s cross-lingual transferability when solving reasoning problems.

Less knowledge retrieval makes better cross-lingual transfer

Additionally, we conduct a more detailed evaluation using the LLaMA-2-7B-Chat model to observe the changes in cross-lingual transfer ratio with varying amounts of knowledge entries. As shown in Figure 2, the experimental results demonstrate that the transfer rate increases as the number of provided knowledge entries increases from none to all required knowledge. This further validates the conclusion that the retrieval of more knowledge significantly impacts cross-lingual transferability.

4.2 The cross-lingual transfer of knowledge-free reasoning

We assess the cross-lingual transferability of the model’s knowledge-free reasoning capabilities. We evaluate the performance of the KFRD by training in English and transferring to the other 9 languages. The resulting cross-lingual transfer ratio are shown in Figure 3, and accuracy results are shown in Figure 13 in Appendix G.

The results demonstrate that the KFRD exhibits extremely high cross-lingual transfer performance for most language pairs. For 7 out of the 9 languages, it can be observed that the cross-lingual transfer performance in knowledge-free reasoning tasks often exceeds 90%, with some instances approaching 100%, thereby achieving near FCLT.

However, for the two lowest-resource languages Arabic and Hebrew, the cross-lingual transferability is significantly poorer. We hypothesize that this may be due to the model’s weaker language proficiency in these two languages, which adversely affects its transferability. Further analysis on this issue is provided in the next section.

4.3 Impact of language proficiency on cross-lingual transfer

In this section, we explore the impact of language proficiency of the training and target language with additional experiments.

4.3.1 Training language proficiency

To test the impact of training language proficiency, we select German and Chinese as representatives of high-resource languages, and Arabic and Hebrew as representatives of low-resource languages as training languages. Then, we train models on the KFRD in these languages and evaluated their performance across the 10 languages. The accuracy results are shown in Figure 5, and the transfer ratio is presented in Figure 5.

The results show that, though training with low-resource languages resulted in slightly lower accuracy (which is particularly noticeable in the arithmetic dataset), the models show no significant differences in transfer ratio when trained with high-resource or low-resource languages, indicating that the proficiency and resource of the training language has no significant effect on the cross-lingual transfer of knowledge-free reasoning.

4.3.2 Target language proficiency in specific directions

In previous experiments, we observe strong cross-linguistic transfer performance between most languages. However, the transferability from English to Arabic and Hebrew was significantly weaker. We hypothesize that this is related to the model’s weaker language proficiency in these two target languages.

In this section, we select models from Hugging Face that have undergone Continuous Pre-Training (CPT), Supervised Fine-Tuning (SFT), and a combination of both (CPT + SFT) on the LLaMA-2 or Mistral platforms. These adapted models have better proficiency in the respective languages. The selected models are listed in Table 2 in Appendix F.

We train a Vanilla model and the above fine-tuned models on the KFRD in English and test their transferability from English to the respective target languages. The transfer rate results are shown in Figure 6, and the accuracy results are provided in Figure 14 in Appendix G.

The experimental results indicate that the Vanilla model exhibits very low transfer rates for low-resource languages. However, after applying CPT, SFT, or CPT+SFT, the transfer rates increase significantly. Notably, for Hebrew, the transfer rate reached over 95%, approaching FCLT. This suggests that the proficiency in Arabic and Hebrew limits the cross-linguistic transfer of knowledge-free reasoning abilities from English to these languages. Improving proficiency in the target language can alleviate this limitation. This demonstrates that the cross-linguistic transfer of knowledge-free reasoning in specific directions is influenced by target language proficiency.

4.4 Analyzing differences in transfer characteristics

4.4.1 Overall computational similarity

In this section, we assess the LLaMA-2-7B-Chat model’s hidden states cosine similarity and Feed-Forward Network neuron activation overlap on knowledge-related tasks and knowledge-free reasoning tasks. We also evaluate changes in these metrics after fine-tuning with LoRA to understand the mechanisms supporting cross-lingual transfer of knowledge-free reasoning abilities. The experimental results are shown in Figures 7 and 8.

Refer to caption
Figure 6: Averaged XLTR from English to Arabic/Hebrew on KFRD for models in different stages trained in Arabic/Hebrew
Refer to caption
Figure 7: CS for different datasets in the LLaMA-2-7B-Chat model. The black lines on each bar indicate the 99% confidence intervals estimated with bootstrap sampling (Efron, 1992).
Refer to caption
Figure 8: NAO for different dataset in the LLaMA-2-7B-Chat at activation thresholds ranging from 0.1 to 0.9. Shaded areas: 99% confidence intervals estimated with bootstrap sampling; Solid lines: results of the original model; Dashed lines: results of the LoRA tuned model. The meanings of the shaded areas and dashed lines in Figures 9 and 10 are consistent with those described here.
Internal representation of knowledge-free reasoning task is better aligned than retrieval

The results in Figure 7 indicate that the CS of the model on knowledge-free reasoning tasks is significantly higher than that on knowledge-related reasoning tasks both before and after training. Additionally, after training on the knowledge-free reasoning dataset, the model’s CS shows a significant increase. In contrast, after training on the knowledge-related reasoning dataset, the increase in CS is not significant and may even decrease. This suggests that adapting to knowledge-free reasoning tasks results in a more aligned hidden space processing across different languages, whereas adapting to knowledge-related reasoning tasks leads to relatively divergent hidden space processing.

Neuron activation pattern of knowledge-free reasoning task is more similar than retrieval

Neuron analysis further elucidates this phenomenon. The results in Figure 8 show that, across all threshold settings, NAO for knowledge-free reasoning tasks is significantly higher than for knowledge-related reasoning tasks. This indicates that the model tends to use similar neurons for processing knowledge-free reasoning tasks across different languages, resulting in similar neuron activation patterns and computational processes. Consistent with the hidden states results, after training on the knowledge-free reasoning dataset, the model’s NAO shows a significant increase, whereas there is no significant improvement and even a decline after training on the knowledge-related reasoning dataset. This suggests that training on knowledge-free reasoning tasks makes neuron activation characteristics across different languages more similar, whereas training on knowledge-related reasoning tasks makes these characteristics relatively divergent.

These results provide a comprehensive analysis of the differing cross-lingual transfer effectiveness between knowledge-free reasoning and knowledge-related reasoning from a computational similarity perspective. We hypothesize that this difference is because the model tends to store knowledge for different languages in different neurons, while using similar neuron groups for knowledge-free reasoning functions.

4.4.2 Layer-wise computational similarity

To analyze the relationship between hidden states and neuron activation characteristics with cross-lingual transfer of reasoning abilities more granularly, we performed a layer-wise analysis of the cross-lingual cosine similarity of hidden states and cross-lingual neural activation overlap. The experimental results are shown in Figures 9 and 10.

Refer to caption
Figure 9: CS for different layers of the LLaMA-2-7B-Chat.
Refer to caption
Figure 10: NAO for different layers of the LLaMA-2-7B-Chat at an activation threshold of 0.4.

It is observed that the phenomena identified in the previous section, specifically the significantly higher CS and NAO for knowledge-free reasoning tasks compared to knowledge tasks with more pronounced improvements after fine-tuning, are most evident in the middle layers (layers 6-25) of the model and less in the shallow layers (layers 1-6). Previous work (Zhao et al., 2024; Wendler et al., 2024) suggested that the middle layers of LLMs are primarily responsible for conceptual reasoning, which is cross-lingual. This hypothesis aligns with our findings and further supports the view that reasoning capabilities can transfer across languages.

Additionally, in the upper layers (layers 25-32) of the model, the CS and NAO for knowledge-free reasoning tasks are similar to those for knowledge tasks before training, but show significant improvement after training. This phenomenon is accompanied by a consistent performance improvement across different languages. The layered functionality theory of LLMs suggests that the upper layers are mainly responsible for generating the tokens of the target language responses. Therefore, we infer that task fine-tuning makes the model’s behavior more consistent during the decoding phase, resulting in better multilingual task performance.

5 Related Work

Multilingual reasoning evaluation

Laskar et al. (2023) performed evaluation for multilingual ability of ChatGPT. Shi et al. (2022) found LLMs can perform reasoning in multiple languages using CoT, even for those languages with very low resources.

Cross-lingual transfer

Gao et al. (2024) evaluated the cross-lingual transferability of models on multiple reasoning datasets, finding significant variations in transfer performance across different datasets. Additionally, Zhu et al. (2024) discovered that training on translated questions can enhance the cross-lingual transferability of reasoning tasks. Ye et al. (2023) assessed the imbalance of knowledge across different languages in LLMs, observing weak cross-lingual transferability of knowledge. Furthermore, Hu et al. (2024) found that knowledge transferability remains weak across various settings.

Analysis of multilingual internal representation

Zhao et al. (2024) analyzed the way LLMs handle multilingualism and suggested a three-phase working pattern, which includes understanding, task solving and generation. Wendler et al. (2024) also arrived at a similar conclusion.

6 Conclusion

In this study, we analyze the reasons behind the differing cross-lingual transfer abilities of LLMs on various reasoning tasks. We divide reasoning tasks into two components: knowledge retrieval and knowledge-free reasoning. We adapt several existing reasoning datasets and construct a knowledge-free reasoning dataset for analysis.

Experiments on these datasets demonstrated that the requirement of knowledge retrieval significantly lowers the cross-lingual transfer performance of reasoning tasks, while the knowledge-free reasoning ability can nearly fully transferred between languages.

In addition, we analyze the difference of computational processes between knowledge-related and knowledge-free reasoning tasks to explain the gap in cross-lingual transfer. The results show that the overall cross-lingual computational similarity of knowledge-free reasoning is significantly higher than that of knowledge-related reasoning, particularly in the middle-high layers of the model. This suggests that knowledge-free reasoning benefits from a shared neural mechanism across languages, whereas knowledge storage is more language-specific.

Limitations

One key limitation of this paper is the model selection and language coverage. In our exploration of language proficiency and interpretability experiments, we primarily rely on the LLaMA-2 model. Additionally, other parts of our research utilize only a few models, which may oversimplify the descriptions of model performance and behavior. In terms of language coverage, although we included ten languages from different language families, this number is still insufficient compared to the thousands of languages globally. This limitation is partly due to our computational resource constraints. With adequate resources, the proposed methods could be extended to other models and languages to further validate our conclusions.

Another limitation of our study is the depth of the interpretability analysis. We aim to investigate whether different knowledge-free reasoning tasks utilize the same neurons and whether knowledge is stored in different neurons for different languages. However, our support for this hypothesis is primarily based on macro-level numerical analyses, without precisely identifying specific reasoning neurons and knowledge neurons. This limitation restricts our fine-grained understanding of the model’s internal mechanisms. Future research should conduct more detailed neuron-level analyses to verify these hypotheses.

Ethics Statement

The authors declare no competing interests. All datasets used in this study are sourced from publicly available repositories and do not contain sensitive information, such as personal data. The data generated by GPT-4 have been verified to be non-toxic and are used exclusively for research purposes. The use of LLaMA-2 models, as well as several other large language models, complies with their respective licenses.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Ansel et al. (2024) Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Michael Suo, Phil Tillet, Eikan Wang, Xiaodong Wang, William Wen, Shunting Zhang, Xu Zhao, Keren Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng Wu, and Soumith Chintala. 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM.
  • Bai et al. (2023) **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, ** Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, **gren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  • Csaki et al. (2024) Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, and Urmish Thakker. 2024. Sambalingo: Teaching large language models new languages. Preprint, arXiv:2404.05829.
  • DICTA (2024) DICTA. 2024. Dictalm-2.0. https://huggingface.co/dicta-il/dictalm2.0. Accessed: 2024-06-15.
  • Efron (1992) Bradley Efron. 1992. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer.
  • Gao et al. (2024) Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. 2024. Multilingual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. arXiv preprint arXiv:2404.04659.
  • Geva et al. (2021) Mor Geva, Yoav Goldberg, and Jonathan Berant. 2021. Strategyqa: A question answering benchmark for reasoning about strategies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. Preprint, arXiv:2106.09685.
  • Hu et al. (2024) Peng Hu, Changjiang Gao, Ruiqi Gao, Jiajun Chen, and Shujian Huang. 2024. Limited out-of-context knowledge reasoning in large language models. Preprint, arXiv:2406.07393.
  • Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
  • Icebear-AI (2024) Icebear-AI. 2024. Llama-2-7b-chat-arabic-lora. https://huggingface.co/Icebear-AI/Llama-2-7b-chat-arabic-lora. Accessed: 2024-06-15.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
  • Laskar et al. (2023) Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. 2023. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. Preprint, arXiv:2305.18486.
  • Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. Mkqa: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  • Min et al. (2020) Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering ambiguous open-domain questions. In EMNLP.
  • Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. Preprint, arXiv:2211.01786.
  • Qi et al. (2023) Jirui Qi, Raquel Fernández, and Arianna Bisazza. 2023. Cross-lingual consistency of factual knowledge in multilingual language models. Preprint, arXiv:2310.10378.
  • Ranaldi and Zanzotto (2023) Leonardo Ranaldi and Fabio Massimo Zanzotto. 2023. Empowering multi-step reasoning across languages via tree-of-thoughts. arXiv preprint arXiv:2311.08097.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  • Saparov et al. (2024) Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Mehran Kazemi, Najoung Kim, and He He. 2024. Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems, 36.
  • Shi et al. (2022) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022. Language models are multilingual chain-of-thought reasoners. Preprint, arXiv:2210.03057.
  • Stolfo et al. (2023) Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  • Wendler et al. (2024) Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. Do llamas work in english? on the latent language of multilingual transformers. arXiv preprint arXiv:2402.10588.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wu et al. (2024) Wilson Wu, John X Morris, and Lionel Levine. 2024. Do language models plan ahead for future tokens? arXiv preprint arXiv:2404.00859.
  • Ye et al. (2023) Jiacheng Ye, Xijia Tao, and Lingpeng Kong. 2023. Language versatilists vs. specialists: An empirical revisiting on multilingual transfer ability. arXiv preprint arXiv:2306.06688.
  • Zhao et al. (2024) Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024. How do large language models handle multilingualism? arXiv preprint arXiv:2402.18815.
  • Zheng et al. (2024) Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372.
  • Zhu et al. (2024) Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, and Alexandra Birch. 2024. Question translation training for better multilingual reasoning. arXiv preprint arXiv:2401.07817.

Appendix A Detailed Description of Knowledge-Free Reasoning

The KFRD is generated using a unified template, consisting entirely of multiple-choice questions with four options. Each question is structured into three parts: input, output, and transformation rules. Specific examples can be seen in Table 1, and the templates used for these examples are shown in Figure 11.

Arithmetic Reasoning
Input 11,645 (one or two numbers)
Transformation Rule Addition (a mathematical operation)
Output Options A) 595 B) 536 C) 771 D) 656 (one number, sometimes the result is two numbers)
Symbolic Reasoning
Input education, game, president, night, man (3-5 words in the corresponding language)
Transformation Rule Swap the positions of the 5th and 2nd words; Delete the 2nd word (symbolic operations within three layers)
Output Options A) education, president, night, game B) education, problem, night, game C) hand, president, night, game D) education, house, night, game
Logical Reasoning
Input Alex is Aurora Vale. Everything that is Aurora Vale is Omicron Delta. Stella is not Chronos Wasteland. Max is not Dreamweaver’s Haven. Suppose Sally is Whispering Meadows, then Sally is Chimerical Citadel. Everything that is Ebonwyrm Abyss is Phoenixfire Ridge. (8 propositions)
Transformation Rule Implication Elimination (a logical rule)
Output Options A) Alex is Seraphim Heights. B) Alex is Tempestwilds. C) Alex is Omicron Delta. D) Polly is Arcadia Reach.
Table 1: Examples of different tasks in the KFRD dataset

A.1 Arithmetic Reasoning

This dataset transforms two input numbers through mathematical operations into one or two output numbers. The mathematical operations include basic arithmetic operations (addition, subtraction, multiplication, division), equality, geometric progression, arithmetic progression, and sorting.

  • Input: Numbers are randomly generated within the range of 0-999.

  • Transformation rules: Each rule generates an equal number of samples.

  • Output: Generated through transformation rules, constrained within the range of 0-999. Other options are randomly generated, ensuring a single correct answer.

A.2 Symbolic Reasoning

This dataset transforms 3-5 input words from the corresponding language through symbolic operations to generate the output. Symbolic operations include repetition, addition, deletion, reordering, and their combinations. Considering that single-step symbolic operations are too simple, we chose up to three-step symbolic operations.

  • Input: Randomly select 3-5 words from a specific language. We chose 100 simple English words and translated them into other languages using Google Translate.

  • Transformation rules: The dataset includes equal amounts of single-step, two-step, and three-step symbolic operations. For single-step operations, each rule generates an equal number of samples. For two-step and three-step operations, rule combinations are randomly selected.

  • Output: Generated through transformation rules. Other options are partially randomly generated and partially based on random replacements from the original input, ensuring consistent length and a unique correct answer.

A.3 Logical Reasoning

This dataset generates output from a subset of eight input propositions using logical rules. Logical rules include Implication Elimination, Conjunction Introduction, Conjunction Elimination, Disjunction Introduction, Disjunction Elimination, and Proof by Contradiction. The Logical rules are referenced from Saparov et al. (2024).

  • Input: Eight propositions are generated using proposition templates and randomly selected entities, proposition templates referenced from Saparov et al. (2024) and entities from Saparov et al. (2024) and Gao et al. (2024). Missing languages were supplemented using Google Translate.

  • Transformation rules: Each logical rule generates an equal number of samples.

  • Output: Generated through logical rules. Other options are partially based on entities appearing in the propositions and partially randomly generated, ensuring a unique correct answer.

Instruction: The output is the result of applying a specific transformation rule to the input. In this question, you will be given an input value and its corresponding transformation rule. Based on this information, determine the correct output from the options provided: A, B, C, or D. Please give the corresponding answer option directly. Transformation Rule: {Transformation Rule} Input: {Input} Based on the above rule and input, choose the correct output from the following options: A. Output: {Output1} B. Output: {Output2} C. Output: {Output3} D. Output: {Output4} Answer:
Figure 11: Example prompt template for our KFRD dataset
Training Arabic Hebrew
Vanilla LLaMA-2-7B-Chat Mistral-7B-Instruct-v0.1
SFT Llama-2-7b-chat-arabic-lora (Icebear-AI, 2024) -
CPT SambaLingo-Arabic-Base (Csaki et al., 2024) DictaLM-2.0 (DICTA, 2024)
CPT+SFT SambaLingo-Arabic-Chat DictaLM-2.0-Instruct
Table 2: Training models for Arabic and Hebrew

Appendix B Dataset Examples

We provide some examples of adapted datasets used in this paper in Table 3.

StrategyQA
Question Are more people today related to Genghis Khan than Julius Caesar?
Facts 1. Julius Caesar had three children. 2. Genghis Khan had sixteen children. 3. Modern geneticists have determined that out of every 200 men today has DNA       that can be traced to Genghis Khan.
Answer True
MKQA
Query Who sings "I Hear You Knocking But You Can’t Come In"?
Answers Dave Edmunds
BoolQ
Question Do Iran and Afghanistan speak the same language?
Answer True
AmbigQA
Question How often does spermatogenesis—the production of sperm—occur?
Answer 74 days
Table 3: Examples of adapted datasets used in this paper

Appendix C Language Choice

This section provides an overview of the languages utilized in our research, highlighting the primary countries where they are spoken and their respective language families. Refer to Table 4 for detailed information.

ISO Countries Language Family
en US, UK Germanic
de Germany, Austria Germanic
fr France, Canada Romance
it Italy Romance
pl Poland Slavic
ru Russia, Belarus Slavic
ar Egypt, Algeria Afro-Asiatic
he Israel Afro-Asiatic
ja Japan Japonic
zh China (Mainland) Sino-Tibetan
Table 4: Correspondence between Languages, Countries, and Language Families

Appendix D Implementation Details for Interpretability

D.1 Dataset adjustments

We made slight modifications of the datasets to ensure that the last character of different datasets is uniformly a "?" token. Since we are analyzing the internal representation of the last token, in this way, we can eliminate interference caused by the inconsistent input token, which may make the representation unreliable, especially in the bottom layers. Another reason why we append the token "?" is that it can act as a trigger to let the model start the process of preparing to answer the question, which is what we are analyzing.

For knowledge-free reasoning dataset, we added the phrase "Which option should I choose?" in different languages. For the MKQA and BoolQ datasets, where some questions did not end with a "?", we added a "?". All other datasets already ended with a "?".

D.2 Calculation method for activation values

We use the output of the gate linear layer in the SwiGLU module of the LLaMA model, processed through the SiLU function, as the activation values.

Dataset Samples Epoch
StrategyQA 2061 4
KFRD Arithmetic 8000 4
KFRD Symbolic 2000 1
KFRD Logical 4000 1
Table 5: Training epoch and number of samples of fine-tuned datasets in the transferability experiments
Dataset Samples
StrategyQA 228
KFRD Arithmetic 800
KFRD Symbolic 500
KFRD Logical 500
Table 6: The size of testset used in the transferability experiments

Appendix E Experiments Details

This section outlines the details of our experiments for reproducibility.

E.1 Infrastructure

We used the following scientific artifacts in our research:

  • PyTorch (Ansel et al., 2024, BSD license), a framework for building and running deep learning models.

  • Transformers (Wolf et al., 2020, Apache-2.0 license), a library providing a user friendly interface for running and fine-tuning pre-trained models.

  • DeepSpeed (Rasley et al., 2020, Apache-2.0 license), a library optimizing the parallel training of the deep learning models.

  • LLaMA-Factory (Zheng et al., 2024, Apache-2.0 license), a library that provides a unifying way to easily fine-tune large language models with parameter efficient fine-tuning technique like LoRA.

E.2 Hyperparameters

In the fine-tuning of all models, we use a learning rate of 2e-4 with a cosine learning rate scheduler. We clip the gradient norm to 1.0, use a total batch size of 64, set the rank of LoRA to 128, and alpha to 16. The LoRA adapters are applied to all the linear layers within Transformer blocks.

The numbers of training epoch and samples used in the transferability experiments are listed in Table 5. These numbers are tuned to enable LLaMA-2-7B-Chat to achieve 85% + accuracy on the corresponding tasks. The size of testsets used in the transferability experiments are shown in Table 6.

In the interpretability experiments, we adjust the number of training epochs or the size of the syntactic datasets to keep the number of total update steps on all datasets around 150, which avoids interference of different update steps on experimental results. We report the average cosine similarity and neuron activation overlap of 100 samples from each dataset.

E.3 Computation resources

All the fine-tuning experiments can be done on 4 NVIDIA Tesla V100 32GB GPUs. Each fine-tuning can be done in no more than 2 hours.

Appendix F Models Used in the Target Language Proficiency Experiment

The continue pre-training or fine-tuning models of LLaMA-2-7B and Mistral-7B used in the target language proficiency experiment in 4.3.2 are listed in Table 2.

Appendix G Additional Results of Experiment

Here we provide the accuracy of the above experiments in Figure 12, 13 and 14.

Refer to caption
Refer to caption
Refer to caption
Figure 12: Left: Accuracy of different models on StrategyQA. Solid and dashed line represent the result of With Facts and No Facts setting, respectively. Middle: Accuracy of different models on StrategyQA before fine-tuning. Right: Accuracy of LLaMA-2-7B-Chat on StrategyQA under various settings. The translucent line represents the accuracy before finetuning on the specific tasks (which are all around 50%).
Refer to caption
Figure 13: Accuracy of various models on different parts of KFRD. The translucent line represents the accuracy before finetuning on the specific tasks.
Refer to caption
Figure 14: Averaged Accuracy on English and Arabic/Hebrew KFRD for models in different stages trained in Arabic/Hebrew