HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: multibib

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-SA 4.0
arXiv:2403.16444v1 [cs.CL] 25 Mar 2024

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Abstract

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce KIT-19 as an instruction dataset for the development of LLM in Korean. KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using KIT-19 to demonstrate its effectiveness. The experimental results show that the model trained on KIT-19 significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that KIT-19 has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

Keywords: Large Language Model, Korean Instruction Dataset, Korean LLM Toolkit, Instruction Tuning

\NAT@set@cites

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Dongjun Jang, Sungjoo Byun, Hyemi Jo, Hyopil Shin
Department of Linguistics, Seoul National University
3-327, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
{qwer4107, byunsj, huimei6361, hpshin}@snu.ac.kr

Abstract content

1.   Introduction

Pretrained LLM (Large Language Models, Shanahan (2022); Brown et al. (2020); Taylor et al. (2022); Chowdhery et al. (2022); Touvron et al. (2023) are Transformer-based language models (Vaswani et al., 2017) with an extensive parameter count, typically in the hundreds of billions or beyond, and they undergo training on vast corpora. These models are typically employed for their intended purposes following instruction tuning methodology, a technique aimed at optimizing models to effectively adhere to user-provided instructions.

Since the release of InstructGPT (Ouyang et al., 2022), instruction datasets in English have been widely developed and made publicly available across various domains such as Ouyang et al. (2022); Taori et al. (2023a). However, for Korean, which falls into the category of a relatively low-resource language, there is a notable absence of datasets created in the native language. Notably, the Korean instruction datasets currently available are either created by translating existing English datasets using DeepL111https://www.deepl.com/translator or by relying on outputs from large language models like the ChatGPT API222https://openai.com/blog/chatgpt, without adequately capturing the cultural nuances of the Korean language (Table 1). Hence, to enhance and advance the performance of Korean Large Language Models, it is imperative to have datasets constructed specifically in the Korean language.

We introduce KIT-19 (A Comprehensive Korean Instruction Toolkit on 19 Tasks), a universal dataset for Korean Instruction Tuning. KIT-19 is a dataset constructed for Instruction Tuning, derived from 19 different NLP datasets in the Korean language, each consisting of 5,000 examples. It follows the methodology of Longpre et al. (2023); Bach et al. (2022) which integrates existing NLP datasets into an Instruction Dataset, without relying on machine-translated outputs from other languages or utilizing LLM output as a training dataset. KIT-19, as compared to the datasets currently employed for Korean LLM modeling, effectively captures the cultural features of the Korean language and, due to the comprehensive nature of its 19 distinct datasets, contributes to the model’s generalizability.

In this paper, we transparently disclose the construction process of KIT-19 and provide detailed information about the source of the 19 datasets, respectively. To assess the quality of the KIT-19 dataset, we conduct Full Fine-tuning of our dataset on Polyglot-Ko-5.8b and Polyglot-Ko-1.3b Ko et al. (2023), which are the Korean representative Pretrained LLMs.

Finally, we evaluate Korean Large Language Models publicly available on a total of 6 benchmark datasets for assessment. The results show that the performance of the Polyglot-Ko-5.8b model trained with KIT-19 outperforms others, and we could also observe that the Polyglot-Ko-1.3b model exhibits higher performance compared to other Korean LLMs.

The contributions of our study are as follows:

  • We construct and release 100K Korean instruction datasets, addressing the data scarcity problem and reducing the reliance on translated or GPT-generated instructions for Korean LLMs.

  • We demonstrate the efficacy of our KIT-19 by evaluating LLMs using various benchmarks, comparing those trained with KIT-19 and those without.

Table 1: Existing Korean Instruction Datasets and Construction Methods: It is problematic that most of the existing instruction datasets rely on translation and ChatGPT.

 

Dataset

Construction Method

After translating the Alpaca instruction, the output is generated by ChatGPT.

KoAlpaca v1.1444https://raw.githubusercontent.com/Beomi/KoAlpaca/main/KoAlpaca_v1.1.jsonl

After collecting questions from JiSikIn555https://kin.naver.com/ (Korean online knowledge-sharing platform), answers were generated using ChatGPT.

sharegpt_deepl_ko666https://huggingface.co/datasets/junelee/sharegpt_deepl_ko

Translated the ShareGPT data using DeepL.

ShareGPT-74k-ko777https://huggingface.co/datasets/dbdu/ShareGPT-74k-ko

Translated the cleaned version of ShareGPT with 90k using Google Translate.

KoChatGPT888https://github.com/airobotlab/KoChatGPT

After collecting questions from the Korean question dataset, answers were generated using ChatGPT.

OIG-small-chip2-ko999https://huggingface.co/datasets/heegyu/OIG-small-chip2-ko

Translated the OIG-smallchip-2101010https://github.com/LAION-AI/Open-Instruction-Generalist English data from LAION AI using Google Translate.

Korquad-Chat111111https://huggingface.co/datasets/heegyu/korquad-chat-v1

Given the context (paragraphs from news and Wikipedia) of the KorQuAD v1 data, relevant conversations were generated using ChatGPT.

AIRC-KETI/kowow121212https://github.com/AIRC-KETI/kowow

The translated data of WoW (Wizard of Wikipedia) - a knowledge-based conversation dataset.

CounselGPT131313https://github.com/MrBananaHuman/CounselGPT

Counseling data generated by ChatGPT API.

Evolve-instruct141414https://github.com/lcw99/evolve-instruct/

Instructions were augmented using evol-instruct from WizardLM, and then answers were generated by ChatGPT.

KULLM v2151515https://huggingface.co/datasets/nlpai-lab/kullm-v2

Translated the GPT4ALL, Dolly, and Vicuna (ShareGPT) data using DeepL.

namuwiki_alpaca_dataset161616https://huggingface.co/datasets/psymon/namuwiki_alpaca_dataset

A dataset modified to fit Stanford Alpaca training from the NamuWiki (Korean wiki platform) dump file.

ko-lima-vicuna171717https://huggingface.co/datasets/changpt/ko-lima-vicuna

A dataset regenerated in Korean using the ChatGPT API, based on the lima_vicuna_format181818https://huggingface.co/datasets/64bits/lima_vicuna_format data.

ko-lima191919https://huggingface.co/datasets/taeshahn/ko-lima

A dataset translated into Korean from the training data of Zhou et al. (2023)

Ko-StrategyQA202020https://huggingface.co/datasets/NomaDamas/Ko-StrategyQA

This dataset is the Korean version of StrategyQA212121https://allenai.org/data/strategyqa. All questions and paragraphs from the original dataset were translated using DeepL.

KOpen-platypus222222https://huggingface.co/datasets/kyu**py/KOpen-platypus

Translated Lee et al. (2023) with DeepL.

EverythingLM-data-V2-Ko232323https://huggingface.co/datasets/ziozzang/EverythingLM-data-V2-Ko

Translated EverythingLM V2242424https://huggingface.co/datasets/totally-not-an-llm/EverythingLM-data-V2 in DeepL.

human-rights-corpus/HRC/

Using the precedents and consultation cases from the National Human Rights Commission of South Korea as a reference, GPT-3.5-turbo for one-shot learning was employed to generate question-answer pairs.

 
Refer to caption
Figure 1: A glance at the KIT for Korean LLM: We create instruction datasets by drawing from 19 Korean NLP datasets across 10 different categories. We utilize ‘kowiki_text’ as a source dataset for both Closed Book QA and Next Sentence Prediction tasks.
Table 1: Existing Korean Instruction Datasets and Construction Methods: It is problematic that most of the existing instruction datasets rely on translation and ChatGPT.