-
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
Authors:
LLM-jp,
:,
Akiko Aizawa,
Eiji Aramaki,
Bowen Chen,
Fei Cheng,
Hiroyuki Deguchi,
Rintaro Enomoto,
Kazuki Fujii,
Kensuke Fukumoto,
Takuya Fukushima,
Namgi Han,
Yuto Harada,
Chikara Hashimoto,
Tatsuya Hiraoka,
Shohei Hisada,
Sosuke Hosokawa,
Lu Jie,
Keisuke Kamata,
Teruhito Kanazawa,
Hiroki Kanezashi,
Hiroshi Kataoka,
Satoru Katsumata,
Daisuke Kawahara,
Seiya Kawano
, et al. (57 additional authors not shown)
Abstract:
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its…
▽ More
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
Authors:
Lisa Raithel,
Hui-Syuan Yeh,
Shuntaro Yada,
Cyril Grouin,
Thomas Lavergne,
Aurélie Névéol,
Patrick Paroubek,
Philippe Thomas,
Tomohiro Nishiyama,
Sebastian Möller,
Eiji Aramaki,
Yuji Matsumoto,
Roland Roller,
Pierre Zweigenbaum
Abstract:
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social m…
▽ More
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
HeaRT: Health Record Timeliner to visualise patients' medical history from health record text
Authors:
Shuntaro Yada,
Eiji Aramaki
Abstract:
Electronic health records (EHRs), which contain patients' medical histories, tend to be written in freely formatted (unstructured) text because they are complicated by their nature. Quickly understanding a patient's history is challenging and critical because writing styles vary among doctors, which may even cause clinical incidents. This paper proposes a Health Record Timeliner system (HeaRT), wh…
▽ More
Electronic health records (EHRs), which contain patients' medical histories, tend to be written in freely formatted (unstructured) text because they are complicated by their nature. Quickly understanding a patient's history is challenging and critical because writing styles vary among doctors, which may even cause clinical incidents. This paper proposes a Health Record Timeliner system (HeaRT), which visualises patients' clinical histories directly from natural language text in EHRs. Unlike only a few previous attempts, our system achieved feasible and practical performance for the first time, by integrating a state-of-the-art language model that recognises clinical entities (e.g. diseases, medicines, and time expressions) and their temporal relations from the raw text in EHRs and radiology reports. By chronologically aligning the clinical entities to the clinical events extracted from a medical report, this web-based system visualises them in a Gantt chart-like format. Our novel evaluation method showed that the proposed system successfully generated coherent timelines from the two sets of radiology reports describing the same CT scan but written by different radiologists. Real-world assessments are planned to improve the remaining issues.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
JaMIE: A Pipeline Japanese Medical Information Extraction System
Authors:
Fei Cheng,
Shuntaro Yada,
Ribeka Tanaka,
Eiji Aramaki,
Sadao Kurohashi
Abstract:
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline sy…
▽ More
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the effective annotation strategy for targeting report types, and the superiority of the latest contextual embedding models.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
End-to-end Biomedical Entity Linking with Span-based Dictionary Matching
Authors:
Shogo Ujiie,
Hayate Iso,
Shuntaro Yada,
Shoko Wakamiya,
Eiji Aramaki
Abstract:
Disease name recognition and normalization, which is generally called biomedical entity linking, is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset cannot be accurately predicted. This study intro…
▽ More
Disease name recognition and normalization, which is generally called biomedical entity linking, is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset cannot be accurately predicted. This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features to address this problem. Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models, in an end-to-end fashion. Experiments using two major datasets demonstrate that our model achieved competitive results with strong baselines, especially for unseen concepts during training.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models
Authors:
Yuta Nakamura,
Shouhei Hanaoka,
Yukihiro Nomura,
Naoto Hayashi,
Osamu Abe,
Shuntaro Yada,
Shoko Wakamiya,
Eiji Aramaki
Abstract:
For the safe sharing pre-trained language models, no guidelines exist at present owing to the difficulty in estimating the upper bound of the risk of privacy leakage. One problem is that previous studies have assessed the risk for different real-world privacy leakage scenarios and attack methods, which reduces the portability of the findings. To tackle this problem, we represent complex real-world…
▽ More
For the safe sharing pre-trained language models, no guidelines exist at present owing to the difficulty in estimating the upper bound of the risk of privacy leakage. One problem is that previous studies have assessed the risk for different real-world privacy leakage scenarios and attack methods, which reduces the portability of the findings. To tackle this problem, we represent complex real-world privacy leakage scenarios under a universal parameterization, \textit{Knowledge, Anonymization, Resource, and Target} (KART). KART parameterization has two merits: (i) it clarifies the definition of privacy leakage in each experiment and (ii) it improves the comparability of the findings of risk assessments. We show that previous studies can be simply reviewed by parameterizing the scenarios with KART. We also demonstrate privacy risk assessments in different scenarios under the same attack method, which suggests that KART helps approximate the upper bound of risk under a specific attack or scenario. We believe that KART helps integrate past and future findings on privacy risk and will contribute to a standard for sharing language models.
△ Less
Submitted 17 March, 2022; v1 submitted 31 December, 2020;
originally announced January 2021.
-
Syndromic surveillance using search query logs and user location information from smartphones against COVID-19 clusters in Japan
Authors:
Shohei Hisada,
Taichi Murayama,
Kota Tsubouchi,
Sumio Fujita,
Shuntaro Yada,
Shoko Wakamiya,
Eiji Aramaki
Abstract:
[Background] Two clusters of coronavirus disease 2019 (COVID-19) were confirmed in Hokkaido, Japan in February 2020. To capture the clusters, this study employs Web search query logs and user location information from smartphones. [Material and Methods] First, we anonymously identified smartphone users who used a Web search engine (Yahoo! JAPAN Search) for the COVID-19 or its symptoms via its comp…
▽ More
[Background] Two clusters of coronavirus disease 2019 (COVID-19) were confirmed in Hokkaido, Japan in February 2020. To capture the clusters, this study employs Web search query logs and user location information from smartphones. [Material and Methods] First, we anonymously identified smartphone users who used a Web search engine (Yahoo! JAPAN Search) for the COVID-19 or its symptoms via its companion application for smartphones (Yahoo Japan App). We regard these searchers as Web searchers who are suspicious of their own COVID-19 infection (WSSCI). Second, we extracted the location of the WSSCI via the smartphone application. The spatio-temporal distribution of the number of WSSCI are compared with the actual location of the known two clusters. [Result and Discussion] Before the early stage of the cluster development, we could confirm several WSSCI, which demonstrated the basic feasibility of our WSSCI-based approach. However, it is accurate only in the early stage, and it was biased after the public announcement of the cluster development. For the case where the other cluster-related resources, such as fine-grained population statistics, are not available, the proposed metric would be helpful to catch the hint of emerging clusters.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset
Authors:
Zhiwei Gao,
Shuntaro Yada,
Shoko Wakamiya,
Eiji Aramaki
Abstract:
Since the outbreak of coronavirus disease 2019 (COVID-19) in the late 2019, it has affected over 200 countries and billions of people worldwide. This has affected the social life of people owing to enforcements, such as "social distancing" and "stay at home." This has resulted in an increasing interaction through social media. Given that social media can bring us valuable information about COVID-1…
▽ More
Since the outbreak of coronavirus disease 2019 (COVID-19) in the late 2019, it has affected over 200 countries and billions of people worldwide. This has affected the social life of people owing to enforcements, such as "social distancing" and "stay at home." This has resulted in an increasing interaction through social media. Given that social media can bring us valuable information about COVID-19 at a global scale, it is important to share the data and encourage social media studies against COVID-19 or other infectious diseases. Therefore, we have released a multilingual dataset of social media posts related to COVID-19, consisting of microblogs in English and Japanese from Twitter and those in Chinese from Weibo. The data cover microblogs from January 20, 2020, to March 24, 2020. This paper also provides a quantitative as well as qualitative analysis of these datasets by creating daily word clouds as an example of text-mining analysis. The dataset is now available on Github. This dataset can be analyzed in a multitude of ways and is expected to help in efficient communication of precautions related to COVID-19.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.