Skip to main content

Showing 1–3 of 3 results for author: Ohi, M

.
  1. arXiv:2404.17790  [pdf, other

    cs.CL cs.AI

    Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

    Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

    Abstract: Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  2. arXiv:2404.17733  [pdf, other

    cs.CL cs.AI

    Building a Large Japanese Web Corpus for Large Language Models

    Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki

    Abstract: Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This c… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 17 pages

  3. arXiv:2402.15987  [pdf, other

    cs.CL cs.AI

    Likelihood-based Mitigation of Evaluation Bias in Large Language Models

    Authors: Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki

    Abstract: Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate s… ▽ More

    Submitted 1 March, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

    Comments: 4 main pages