KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Abstract

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce KIT-19 as an instruction dataset for the development of LLM in Korean. KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using KIT-19 to demonstrate its effectiveness. The experimental results show that the model trained on KIT-19 significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that KIT-19 has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

Keywords: Large Language Model, Korean Instruction Dataset, Korean LLM Toolkit, Instruction Tuning

\NAT@set@cites

Dongjun Jang, Sungjoo Byun, Hyemi Jo, Hyopil Shin

Department of Linguistics, Seoul National University

3-327, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea

{qwer4107, byunsj, huimei6361, hpshin}@snu.ac.kr

Abstract content

1. Introduction

Pretrained LLM (Large Language Models, Shanahan (2022); Brown et al. (2020); Taylor et al. (2022); Chowdhery et al. (2022); Touvron et al. (2023) are Transformer-based language models (Vaswani et al., 2017) with an extensive parameter count, typically in the hundreds of billions or beyond, and they undergo training on vast corpora. These models are typically employed for their intended purposes following instruction tuning methodology, a technique aimed at optimizing models to effectively adhere to user-provided instructions.

Since the release of InstructGPT (Ouyang et al., 2022), instruction datasets in English have been widely developed and made publicly available across various domains such as Ouyang et al. (2022); Taori et al. (2023a). However, for Korean, which falls into the category of a relatively low-resource language, there is a notable absence of datasets created in the native language. Notably, the Korean instruction datasets currently available are either created by translating existing English datasets using DeepL¹¹1https://www.deepl.com/translator or by relying on outputs from large language models like the ChatGPT API²²2https://openai.com/blog/chatgpt, without adequately capturing the cultural nuances of the Korean language (Table 1). Hence, to enhance and advance the performance of Korean Large Language Models, it is imperative to have datasets constructed specifically in the Korean language.

We introduce KIT-19 (A Comprehensive Korean Instruction Toolkit on 19 Tasks), a universal dataset for Korean Instruction Tuning. KIT-19 is a dataset constructed for Instruction Tuning, derived from 19 different NLP datasets in the Korean language, each consisting of 5,000 examples. It follows the methodology of Longpre et al. (2023); Bach et al. (2022) which integrates existing NLP datasets into an Instruction Dataset, without relying on machine-translated outputs from other languages or utilizing LLM output as a training dataset. KIT-19, as compared to the datasets currently employed for Korean LLM modeling, effectively captures the cultural features of the Korean language and, due to the comprehensive nature of its 19 distinct datasets, contributes to the model’s generalizability.

In this paper, we transparently disclose the construction process of KIT-19 and provide detailed information about the source of the 19 datasets, respectively. To assess the quality of the KIT-19 dataset, we conduct Full Fine-tuning of our dataset on Polyglot-Ko-5.8b and Polyglot-Ko-1.3b Ko et al. (2023), which are the Korean representative Pretrained LLMs.

Finally, we evaluate Korean Large Language Models publicly available on a total of 6 benchmark datasets for assessment. The results show that the performance of the Polyglot-Ko-5.8b model trained with KIT-19 outperforms others, and we could also observe that the Polyglot-Ko-1.3b model exhibits higher performance compared to other Korean LLMs.

The contributions of our study are as follows:

•

We construct and release 100K Korean instruction datasets, addressing the data scarcity problem and reducing the reliance on translated or GPT-generated instructions for Korean LLMs.
•

We demonstrate the efficacy of our KIT-19 by evaluating LLMs using various benchmarks, comparing those trained with KIT-19 and those without.

Refer to caption — Table 1: Existing Korean Instruction Datasets and Construction Methods: It is problematic that most of the existing instruction datasets rely on translation and ChatGPT.

Dataset	Construction Method
KoAlpaca v1.0³³3https://huggingface.co/datasets/Bingsu/ko_alpaca_data	After translating the Alpaca instruction, the output is generated by ChatGPT.
KoAlpaca v1.1⁴⁴4https://raw.githubusercontent.com/Beomi/KoAlpaca/main/KoAlpaca_v1.1.jsonl	After collecting questions from JiSikIn⁵⁵5https://kin.naver.com/ (Korean online knowledge-sharing platform), answers were generated using ChatGPT.
sharegpt_deepl_ko⁶⁶6https://huggingface.co/datasets/junelee/sharegpt_deepl_ko	Translated the ShareGPT data using DeepL.
ShareGPT-74k-ko⁷⁷7https://huggingface.co/datasets/dbdu/ShareGPT-74k-ko	Translated the cleaned version of ShareGPT with 90k using Google Translate.
KoChatGPT⁸⁸8https://github.com/airobotlab/KoChatGPT	After collecting questions from the Korean question dataset, answers were generated using ChatGPT.
OIG-small-chip2-ko⁹⁹9https://huggingface.co/datasets/heegyu/OIG-small-chip2-ko	Translated the OIG-smallchip-2¹⁰¹⁰10https://github.com/LAION-AI/Open-Instruction-Generalist English data from LAION AI using Google Translate.
Korquad-Chat¹¹¹¹11https://huggingface.co/datasets/heegyu/korquad-chat-v1	Given the context (paragraphs from news and Wikipedia) of the KorQuAD v1 data, relevant conversations were generated using ChatGPT.
AIRC-KETI/kowow¹²¹²12https://github.com/AIRC-KETI/kowow	The translated data of WoW (Wizard of Wikipedia) - a knowledge-based conversation dataset.
CounselGPT¹³¹³13https://github.com/MrBananaHuman/CounselGPT	Counseling data generated by ChatGPT API.
Evolve-instruct¹⁴¹⁴14https://github.com/lcw99/evolve-instruct/	Instructions were augmented using evol-instruct from WizardLM, and then answers were generated by ChatGPT.
KULLM v2¹⁵¹⁵15https://huggingface.co/datasets/nlpai-lab/kullm-v2	Translated the GPT4ALL, Dolly, and Vicuna (ShareGPT) data using DeepL.
namuwiki_alpaca_dataset¹⁶¹⁶16https://huggingface.co/datasets/psymon/namuwiki_alpaca_dataset	A dataset modified to fit Stanford Alpaca training from the NamuWiki (Korean wiki platform) dump file.
ko-lima-vicuna¹⁷¹⁷17https://huggingface.co/datasets/changpt/ko-lima-vicuna	A dataset regenerated in Korean using the ChatGPT API, based on the lima_vicuna_format¹⁸¹⁸18https://huggingface.co/datasets/64bits/lima_vicuna_format data.
ko-lima¹⁹¹⁹19https://huggingface.co/datasets/taeshahn/ko-lima	A dataset translated into Korean from the training data of Zhou et al. (2023)
Ko-StrategyQA²⁰²⁰20https://huggingface.co/datasets/NomaDamas/Ko-StrategyQA	This dataset is the Korean version of StrategyQA²¹²¹21https://allenai.org/data/strategyqa. All questions and paragraphs from the original dataset were translated using DeepL.
KOpen-platypus²²²²22https://huggingface.co/datasets/kyu**py/KOpen-platypus	Translated Lee et al. (2023) with DeepL.
EverythingLM-data-V2-Ko²³²³23https://huggingface.co/datasets/ziozzang/EverythingLM-data-V2-Ko	Translated EverythingLM V2²⁴²⁴24https://huggingface.co/datasets/totally-not-an-llm/EverythingLM-data-V2 in DeepL.
human-rights-corpus/HRC/	Using the precedents and consultation cases from the National Human Rights Commission of South Korea as a reference, GPT-3.5-turbo for one-shot learning was employed to generate question-answer pairs.