Search | arXiv e-print repository

Why Not Transform Chat Large Language Models to Non-English?

Authors: Xiang Geng, Ming Zhu, Jiahuan Li, Zhejian Lai, Wei Zou, Shuaijie She, Jiaxin Guo, Xiaofeng Zhao, Yinglu Li, Yuang Li, Chang Su, Yanqing Zhao, Xinglin Lyu, Min Zhang, Jiajun Chen, Hao Yang, Shujian Huang

Abstract: The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized fo… ▽ More The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized for advanced abilities, e.g. multi-turn conversation and human preference alignment, and thus more powerful in both helpfulness and safety. However, transforming a chat LLM involves two critical issues: (1) How can we effectively transfer advanced abilities without their supervised data? (2) How can we prevent the original knowledge from catastrophic forgetting during transformation? We target these issues by introducing a simple framework called TransLLM. For the first issue, TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought, which uses the translation as the bridge between English and non-English step-by-step. We further enhance the performance of sub-tasks with publicly available data. For the second issue, we propose a method comprising two synergistic components: low-rank adaptation for training to maintain the original LLM parameters, and recovery KD, which utilizes data generated by the chat LLM itself to recover the original knowledge from the frozen parameters. In the experiments, we transform the LLaMA-2-chat-7B to the Thai language. Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench. Furthermore, our method, without safety data, rejects more harmful queries of safety benchmark AdvBench than both ChatGPT and GPT-4. △ Less

Submitted 31 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2401.07817 [pdf, other]

Question Translation Training for Better Multilingual Reasoning

Authors: Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, Alexandra Birch

Abstract: Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translat… ▽ More Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. This approach not only incurs high cost, but also results in poorly translated data due to the non-standard formatting of mathematical chain-of-thought. In this paper, we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data. In this way we perform targeted, in-domain language alignment which makes best use of English instruction data to unlock the LLMs' multilingual reasoning abilities. Experimental results on LLaMA2-13B show that question alignment leads to consistent improvements over the translate-training approach: an average improvement of 11.3% and 16.1% accuracy across ten languages on the MGSM and MSVAMP multilingual reasoning benchmarks. The project will be available at: https://github.com/NJUNLP/QAlign. △ Less

Submitted 29 June, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted to Findings of ACL 2024

arXiv:2401.06838 [pdf, other]

MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization

Authors: Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, Jiajun Chen

Abstract: Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimizatio… ▽ More Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages. △ Less

Submitted 13 April, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

Comments: The project is available at https://github.com/NJUNLP/MAPO

arXiv:2311.07194 [pdf, other]

Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models

Authors: Shuaijie She, Shujian Huang, Xingyun Wang, Yanke Zhou, Jiajun Chen

Abstract: LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help… ▽ More LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-QA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-QA. △ Less

Submitted 1 April, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Accepted at NAACL2024 Main

arXiv:2305.19426 [pdf, other]

doi 10.18653/v1/2023.acl-short.154

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

Authors: **gyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger

Abstract: A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up… ▽ More A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up to two negations where either zero, one, or both negative morphemes affect the NLI label. We use ScoNe-NLI to assess fine-tuning and in-context learning strategies. We find that RoBERTa and DeBERTa models solve ScoNe-NLI after many shot fine-tuning. For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning. To better understand this result, we extend ScoNe with ScoNe-NLG, a sentence completion test set that embeds negation reasoning in short narratives. Here, InstructGPT is successful, which reveals the model can correctly reason about negation, but struggles to do so on prompt-adapted NLI examples outside of its core pretraining regime. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2212.01611 [pdf, other]

CoP: Factual Inconsistency Detection by Controlling the Preference

Authors: Shuaijie She, Xiang Geng, Shujian Huang, Jiajun Chen

Abstract: Abstractive summarization is the process of generating a summary given a document as input. Although significant progress has been made, the factual inconsistency between the document and the generated summary still limits its practical applications. Previous work found that the probabilities assigned by the generation model reflect its preferences for the generated summary, including the preferen… ▽ More Abstractive summarization is the process of generating a summary given a document as input. Although significant progress has been made, the factual inconsistency between the document and the generated summary still limits its practical applications. Previous work found that the probabilities assigned by the generation model reflect its preferences for the generated summary, including the preference for factual consistency, and the preference for the language or knowledge prior as well. To separate the preference for factual consistency, we propose an unsupervised framework named CoP by controlling the preference of the generation model with the help of prompt. More specifically, the framework performs an extra inference step in which a text prompt is introduced as an additional input. In this way, another preference is described by the generation probability of this extra inference process. The difference between the above two preferences, i.e. the difference between the probabilities, could be used as measurements for detecting factual inconsistencies. Interestingly, we found that with the properly designed prompt, our framework could evaluate specific preferences and serve as measurements for fine-grained categories of inconsistency, such as entity-related inconsistency, coreference-related inconsistency, etc. Moreover, our framework could also be extended to the supervised setting to learn better prompt from the labeled data as well. Experiments show that our framework achieves new SOTA results on three factual inconsistency detection tasks. △ Less

Submitted 30 March, 2023; v1 submitted 3 December, 2022; originally announced December 2022.

Comments: Accepted to AAAI2023 regular paper

arXiv:2212.01488 [pdf]

Event knowledge in large language models: the gap between the impossible and the unlikely

Authors: Carina Kauf, Anna A. Ivanova, Giulia Rambelli, Emmanuele Chersoni, **gyuan Selena She, Zawad Chowdhury, Evelina Fedorenko, Alessandro Lenci

Abstract: Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of co… ▽ More Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events. △ Less

Submitted 26 October, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: The two lead authors have contributed equally to this work

arXiv:2209.11633 [pdf, ps, other]

Formal Semantics of the CDL Language

Authors: Thorsten Berger, Steven She

Abstract: We reverse-engineer a formal semantics of the Component Definition Language (CDL), which is part of the highly configurable, embedded operating system eCos. This work provides the basis for an analysis and comparison of the two variability-modeling languages Kconfig and CDL. The semantics given in this document are based on analyzing the CDL documentation, inspecting the source code of the toolcha… ▽ More We reverse-engineer a formal semantics of the Component Definition Language (CDL), which is part of the highly configurable, embedded operating system eCos. This work provides the basis for an analysis and comparison of the two variability-modeling languages Kconfig and CDL. The semantics given in this document are based on analyzing the CDL documentation, inspecting the source code of the toolchain, as well as testing the tools on particular examples. △ Less

Submitted 23 September, 2022; originally announced September 2022.

Comments: Technical Note, Department of Computer Science, University of Leipzig, Germany

arXiv:2209.04916 [pdf, ps, other]

Formal Semantics of the Kconfig Language

Authors: Steven She, Thorsten Berger

Abstract: The Kconfig language defines a set of symbols that are assigned a value in a configuration. We describe the semantics of the Kconfig language according to the behavior exhibited in the xconfig configurator. We assume an abstract syntax representation for concepts in the Kconfig language and delegate the details of the translation from concrete to abstract syntaxes to a later document. The Kconfig language defines a set of symbols that are assigned a value in a configuration. We describe the semantics of the Kconfig language according to the behavior exhibited in the xconfig configurator. We assume an abstract syntax representation for concepts in the Kconfig language and delegate the details of the translation from concrete to abstract syntaxes to a later document. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: Technical Note, Department of Electrical and Computer Engineering, University of Waterloo, Canada

arXiv:1710.10523 [pdf, other]

Autonomous Mobile Robot Navigation in Uneven and Unstructured Indoor Environments

Authors: Chaoqun Wang, Lili Meng, Sizhen She, Ian M. Mitchell, Teng Li, Frederick Tung, Weiwei Wan, Max. Q. -H. Meng, Clarence W. de Silva

Abstract: Robots are increasingly operating in indoor environments designed for and shared with people. However, robots working safely and autonomously in uneven and unstructured environments still face great challenges. Many modern indoor environments are designed with wheelchair accessibility in mind. This presents an opportunity for wheeled robots to navigate through sloped areas while avoiding staircase… ▽ More Robots are increasingly operating in indoor environments designed for and shared with people. However, robots working safely and autonomously in uneven and unstructured environments still face great challenges. Many modern indoor environments are designed with wheelchair accessibility in mind. This presents an opportunity for wheeled robots to navigate through sloped areas while avoiding staircases. In this paper, we present an integrated software and hardware system for autonomous mobile robot navigation in uneven and unstructured indoor environments. This modular and reusable software framework incorporates capabilities of perception and navigation. Our robot first builds a 3D OctoMap representation for the uneven environment with the 3D map** using wheel odometry, 2D laser and RGB-D data. Then we project multilayer 2D occupancy maps from OctoMap to generate the the traversable map based on layer differences. The safe traversable map serves as the input for efficient autonomous navigation. Furthermore, we employ a variable step size Rapidly Exploring Random Trees that could adjust the step size automatically, eliminating tuning step sizes according to environments. We conduct extensive experiments in simulation and real-world, demonstrating the efficacy and efficiency of our system. △ Less

Submitted 28 October, 2017; originally announced October 2017.

Comments: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Showing 1–10 of 10 results for author: She, S