Data Augmentation Techniques for Chinese Disease Name Normalization

Wenqian Cui Xiangling Fu Shaohui Liu Mingjun Gu Xien Liu Ji Wu Irwin King

Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Our proposed methods rely on the Structural Invariance property of disease names and the Hierarchy property of the disease classification system. The goal is to equip the models with extensive understanding of the disease names and the hierarchical structure of the disease name classification system. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data.

keywords:

Data Augmentation , Disease Name Normalization , Medical Natural Language Processing

^†^†journal: Journal of Biomedical Informatics

\affiliation

[1]organization=The Chinese University of Hong Kong, country=Hong Kong \affiliation[2]organization=Bei**g University of Posts and Telecommunications, country=Bei**g \affiliation[3]organization=Tsinghua University, country=Bei**g

{graphicalabstract}

⁰⁰footnotetext: The non-standard abbreviations used in this paper: • DDA: Disease Data Augmentation • AR: Axis-word Replacement • MGA: Multi-Granularity Aggregation • ngm:

n

-gram matching score • UDN: Unnormalized Disease Name • SDN: Standard Disease Name • NoDC: Number of Disease Concepts • EDA: Easy Data Augmentation • BT: Back Translation

{highlights}

Disease name normalization is a crucial task that classifies disease names written in diverse formats into standard names.

The primary challenge in disease name normalization lies in the scarcity of annotated training data, hindering the development of robust models.

Our novel data augmentation approach, Disease Data Augmentation (DDA), focuses on manipulating both the key elements (axis words) and the hierarchical structures of the disease names.

Regular data augmentation methods fail to perform well on the disease name normalization task.

Our proposed DDA approach demonstrates remarkable efficacy in enhancing the performance of disease normalization tasks across various baseline models.

Particularly effective in scenarios with smaller datasets, our DDA approach achieves impressive results, recovering nearly 80% of the full performance for specific evaluation metrics.

By striking a perfect balance between model complexity and performance, our DDA approach outperforms various Large Language Model (LLM) baselines, showcasing its efficiency and effectiveness.

1 Introduction

Disease names play a pivotal role in modern intelligent healthcare systems as it is involved in diverse tasks such as intelligent consultation [1], auxiliary diagnosis [2, 3], automated International Classification of Diseases (ICD) coding [4, 5, 6], Diagnosis-Related Groups prediction [7, 8, 9], etc. However, in clinical settings, doctors often write disease names according to their own habits and preferences, leading to numerous variations for the same disease. Therefore, to carry out additional operations on disease names, it is necessary to normalize them into standard names. As a result, disease name normalization, which entails classifying the diagnosis terms in clinical documents to standard names or classifications, plays a critical role in the ecosystem. Figure 1 illustrates the disease name normalization task.

One of the main challenges in the disease normalization task is data scarcity. Specifically, a substantial proportion of disease names and concepts are typically not covered in the training set, leading to few-shot or zero-shot scenarios in the normalization process. For example, in CHIP-CDN dataset [10], only about 25% of all the diseases are provided, and the scarcity is even more pronounced in NCBI Disease Corpus [11] and BioCreative V [12] datasets as indicated in Figure 2. In this case, it is extremely difficult for the models to gain comprehensive knowledge about the disease system. Although collecting more data seems to be a natural solution to address this challenge, it is more difficult to perform in the medical field due to privacy concerns and the requirement for expertise. Hence, in this work, we utilize data augmentation as a workaround to address the data scarcity problem.

Refer to caption — Figure 1: Examples and illustrations of the disease name normalization task

We design a data augmentation approach including a set of data augmentation methods and some supporting modules for Chinese disease name normalization tasks called Disease Data Augmentation (DDA). Our data augmentation methods are based on the following two characteristics of the disease names. Firstly, disease names exhibit Structural Invariance. Disease names consist of various key elements (axis words) such as anatomical region, clinical manifestations, etiology, and pathology. When replacing one of the elements with another of the same category, it will typically still result in a valid disease name. For example, when the anatomical region “{CJK*}UTF8gbsn髂 (Iliac)” of the disease “{CJK*}UTF8gbsn髂总动脉夹层 (Common iliac artery dissection)” is replaced by another region “{CJK*}UTF8gbsn颈 (Carotid)”, we derive a name with the same type of disease but locates in another region “{CJK*}UTF8gbsn颈总动脉夹层” (Common carotid artery dissection)”. Therefore, in disease name normalization, we can create new training data by replacing specific elements in pairs of clinical and standard disease names simultaneously. Secondly, the classification system of disease names demonstrates Hierarchy property, allowing for more specified descriptions to be encompassed into larger, more coarse groups. For instance, the more detailed disease definition, “{CJK*}UTF8gbsn急性喉炎 (Acute Laryngitis)”, can also be viewed as “{CJK*}UTF8gbsn喉炎 (Laryngitis)”. Hence, we can augment the training data by assigning the label of a fine-grained disease to its father disease in the classification system. By augmenting the training data under the above rules, we provide the models with an extensive understanding of disease names, particularly those that are absent in the original training set. Additionally, our methods enhance the models’ comprehension of the hierarchical structure and the relationships among different disease names.

Our experiments demonstrate that our DDA approach outperforms all other data augmentation counterparts and effectively enhances the performance of various disease name normalization baselines. Furthermore, our approach can perform much better with smaller datasets and can achieve nearly 80% of the full performance even when no data from the training set is provided.

The Statement of Significance of this paper is as follows.

•

Problem or Issue: The primary challenge in disease name normalization, which involves classifying variously formatted disease names into standardized terms, is the scarcity of annotated training data. This hinders the development of effective and robust models.
•

What is Already Known: Disease names have two main characteristics: they are composed of several key elements, and their classification system exhibits a hierarchical structure.
•

What this Paper Adds: This paper introduces a novel data augmentation approach named Disease Data Augmentation (DDA), which leverages the two previously mentioned characteristics of disease names. We demonstrate through experiments that DDA significantly enhances the performance of the Chinese disease name normalization task compared to baseline approaches and across various backbone models.

The subsequent sections of this paper are structured as follows. Section 2 provides an in-depth exploration of the background of the disease name normalization task and discusses related work pertinent to our research. In Section 3, we present a detailed explanation of our methodology. Section 4 outlines the experiments conducted, details the dataset utilized, and presents the results obtained. The final sections conclude the paper and discuss the limitations and future works of the study.

2 Background and Related Work

{forest}

for tree= grow=east, growth parent anchor=west, parent anchor=east, child anchor=west, edge path=[\forestoptionedge,-¿, ¿=latex] (!u.parent anchor) – +(10pt,0pt) —- (.child anchor) \forestoptionedge label; [Biomedical Entity Linking, basic, l sep=7mm, [Other Biomedical Entity Linking Tasks, xnode, l sep=5mm, ] [Disease Name Normalization, xnode, l sep=5mm, [Language Model-based Approach: [13, 14], tnode] [Data Augmentation: Ours, tnode] [Task-specific Model Architecture: [1, 10, 15, 16], tnode] ] [Disease Entity Linking, xnode, l sep=5mm, [Language Model-based Approach: [17], tnode] [Machine Learning-based Approach: [18, 19, 20, 21, 22, 23, 24, 25], tnode] [Rule-based Approach: [26], tnode] ] ]

Figure 3: A taxonomy of biomedical entity linking methods. Our approach falls into the data augmentation category within the disease name normalization task.

2.1 Disease Name Normalization

Disease name normalization refers to the process of matching or classifying disease terms written by doctors in clinical documents to their standard names based on certain classification systems. The term “Disease name normalization” is widely used in literature, but its meaning varies and can refer to different tasks. While some literature defines disease name normalization as the process of retrieving and matching disease terms in lengthy medical description texts, such as medical literature abstracts in the NCBI Disease Corpus [11] and BioCreative-V-CDR-Corpus [12] datasets, we argue that this task does not align well with the name “normalization” because the unnormalized and normalized entity should fall into a same concept, such as disease name. The task that retrieves from lengthy medical description texts should be categorized as Disease Entity Linking. This falls under the broader category of Biomedical Entity Linking (BEL), which, according to [27], is described as “the task of map** of spans of text within biomedical documents to normalized, unique identifiers within an ontology”. We see Disease Entity Linking as an end-to-end approach to classify disease names mentioned in the description text, and this larger task (Disease Entity Linking) can be divided into two subtasks: identifying disease-related corpora in the lengthy description text and normalizing the identified corpora into standard names based on the classification system. In this work, we define the disease name normalization task as the second subtask of the Disease Entity Linking task, which is consistent with the definitions in [1, 13, 10], and we use the 10^th version of the ICD system as the standard classification system. Figure 3 shows the taxonomy of the Biomedical Entity Linking task, including our work and the related works.

2.2 Data Augmentation on Text Data

Data augmentation is a technique that generates new data from existing datasets to increase data volume and help prevent model overfitting. While it is simpler to augment image data without losing meaning, text augmentation is more challenging due to its unstructured nature [28]. Some approaches, like those suggested by [29], apply character-level modifications such as replacement, insertion, swap, and deletion, though these can introduce grammatical errors. Back translation [30] maintains semantic integrity but lacks diversity and depends on the quality of the translation tools.

There are also more complex methods used for text augmentation. [31] uses lexicalized probabilistic context-free grammars to capture the complex structure of natural language and replace words, resulting in effective results. However, this grammar-based approach is challenging to apply to specialized domains like medicine. Pre-trained language models are also used for augmentation; for example, [28] and [32] utilize the MLM objective in BERT [32] to regenerate masked words, and [33] compare different pre-trained model methods. However, these methods can alter the original text’s meaning after several MLM replacements. Additionally, Semi-supervised learning can also augment data using the vast amount of unlabeled data. [34] use MixUp to guess low-entropy labels of augmented data and combine labeled and unlabeled data to derive a loss term, while [35] perform data augmentation on unlabeled data for consistency training. However, our focus here is solely on the labeled data rather than the unlabeled data. For an extensive overview of text data augmentation methods, we refer the readers to [36].

2.3 Data augmentation on medical data

While the majority of studies focus on the impact of data augmentation on general text data, some studies explore the potential of data augmentation operations on medical text data. Several works concentrate on synonym replacement in medical terms. [37] and [38] utilize the Unified Medical Language System (UMLS) [39] to identify medical synonyms for replacements in classification texts. [37] also replaces both medical terms in raw texts and the classification label to generate new training data, focusing on the ICD-coding task. While their work mainly centers on replacing the entire medical term, we investigate the possibility of replacing the components within the medical terms. Furthermore, [40] examines the performance of EDA, conditional pre-trained language models, and back translation for data augmentation on social media texts for mental health classification. [41] proposes Segment Reordering as a data augmentation technique to preserve the semantic meaning of medical texts. [42] use pre-trained language models fine-tuned on General Semantic Textual Similarity (STS-G) data to generate pseudo-labels on medical STS data and then undergo iterative training.

3 Proposed Methodology

This section introduces the details of the pipeline for our proposed data augmentation approach called Disease Data Augmentation (DDA), as depicted in Figure 4. Our approach consists of three main components: 1) A named entity recognition (NER) module, 2) a data augmentation module, and 3) a semantic filtering module. Specifically, all the inputs will first go through a NER system to locate and identify all the elements, and then the results are sent to the data augmentation (DA) module to generate new pairs of data. A semantic filtering module is at the end to filter out unwanted pairs.

We first define the concept of “axis word” and the type of axis words used in this work. We then introduce the three main modules of our approach. Finally, we illustrate the training paradigm of our approach. For clarification, we use the terms “unnormalized disease name” and “standard disease name” to denote the input and output of the disease normalization system, respectively.

3.1 Axis Word

The disease names are composed of several elements (axis words), which include but are not limited to etiology, pathology, clinical manifestations, anatomical region, chronicity, degree type, characteristic, etc. [43, 44]. Therefore, we define axis words as the elements within the disease names. For ease of expression, we merge etiology and pathology into disease center and select from all remaining elements into three main categories: disease center, anatomical region, and disease characteristic. With these three axis words, a large portion of disease names can be combined. Table 1 shows the definition of the three axis words with an example alongside them.

Table 1: Definition and an example of the axis words used in this work.

Axis Word	Definition	Example
Disease Center	The minimal term that describes the nature of a disease, which may include etiology and pathology. It defines the main category of the disease.	{CJK*}UTF8gbsn增生性毛发囊肿 (Proliferative Trichilemmal Cyst)
Anatomical Region	A part of the human body that has actual meaning in anatomy. This part of the disease name indicates which part of the human body is ill.	{CJK*}UTF8gbsn增生性毛发囊肿 (Proliferative Trichilemmal Cyst)
Disease Characteristic	The characteristic of a disease that indicate the subtype or the cause of the disease.	{CJK*}UTF8gbsn增生性毛发囊肿 (Proliferative Trichilemmal Cyst)

3.2 Named Entity Recognition Module

The first module of our approach is a named entity recognition (NER) system to locate and identify the axis words of all the input disease names. To build the NER system, we select 5,000 diseases from ICD system [45] based on its taxonomy and ask doctors to annotate the labels (i.e., the three axis words) in BIO format [46]. We use the traditional “BiLSTM + CRF” as the NER model architecture. Specifically, there are three BiLSTM layers [47] with a hidden dimension of 100, a fully connected layer, and a CRF layer [48]. The model achieves 0.794 of micro F1 score in our final evaluation.

3.3 Data Augmentation Module

The data augmentation modules consist of four data augmentation methods, and they are divided into two main categories: Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA). The main purpose of our data augmentation methods is to provide the model with additional knowledge, so we focus on exploring the components and relationships within diseases to give the model a comprehensive understanding of the various components and the hierarchical classification system of disease names. Figure 5 illustrates the two categories and four types of data augmentation methods¹¹1We will open source the augmentation code and the augmented result on Github.. We also present pseudo-code for all the four data augmentation methods in D.

3.3.1 Axis-word Replacement (AR)

Axis-word Replacement method is designed based on the assumption that disease names exhibit Structural Invariance property. This means that replacing an axis word in a disease name with another word of the same type still results in a meaningful disease name. Since there are often matches of axis words between an unnormalized disease name and a standard disease name in the disease name normalization task, simultaneously replacing the same axis word in both the unnormalized name and the standard name can typically ensure that the newly generated pair will still match. We leverage both the ICD and task data (data from the disease name normalization training set) to perform Axis-word Replacement. The detailed descriptions of each category of Axis-word Replacements are as follows:

•

AR1: AR1 is illustrated in the top left corner of Figure 5. First, we select a pair of diseases (disease A and disease B) that share one or more axis words (axis1 in the figure) but differ in another axis word (axis2 in the figure). Then, we replace axis2 in disease A with the same axis2 in disease B. (Note: disease A can be chosen from any sources, but disease B can only be chosen from the standard ICD system as it serves as the label of a disease name normalization pair.) Algorithm 1 provides the pseudo-code for AR1.
We can choose either of the three axis words to perform replacement:
- –
  
  AR1 - Region: Perform AR1 by fixing the disease center and replacing the anatomical region.
- –
  
  AR1 - Center: Perform AR1 by fixing the anatomical region and replacing the disease center.
- –
  
  AR1 - Characteristic: Perform AR1 by fixing both the disease center and the anatomical region and replacing the disease characteristic.
•

AR2: AR2 is illustrated in the top right corner of Figure 5. First, we select a pair of unnormalized-standard diseases from the disease name normalization training set. Let the unnormalized disease be disease A, and the standard disease be disease B. Then, find disease C from the ICD system that shares one or more axis words (axis1 in the figure) but differ in another axis word (axis2). Finally, we replace axis2 in disease A to be the same axis2 in disease C, so that the replaced disease A and disease C can form a new disease name normalization pair. Algorithm 2 provides the pseudo-code for AR2.
Similarly, we can choose either of the three axis words to perform replacement:
- –
  
  AR2 - Region: Perform AR2 by fixing the disease center and replacing the anatomical region.
- –
  
  AR2 - Center: Perform AR2 by fixing the anatomical region and replacing the disease center.
- –
  
  AR2 - Characteristic: Perform AR2 by fixing both the disease center and the anatomical region and replacing the disease characteristic.

3.3.2 Multi-Granularity Aggregation (MGA)

Multi-Granularity Aggregation (MGA) method is designed based on the hierarchical structure of the ICD system, and the granularity levels of the structure is organized by the length of the ICD codes. For example, in ICD-10 Bei**g Clinical Version 601, the disease name of the four-digit code “A18.2” is “{CJK*}UTF8gbsn外周结核性淋巴结炎 (Peripheral Tuberculous Lymphadenitis)”, and it has in total 10 child diseases that have a fine-grained description, with six-digit codes ranging from “A18.201” to “A18.210”, such as “A18.201: {CJK*}UTF8gbsn腹股沟淋巴结结核 (Inguinal lymph node tuberculosis)” and “A18.202: {CJK*}UTF8gbsn颌下淋巴结结核 (Submandibular lymph node tuberculosis)”. This shows that the ICD system exhibits a tree-like structure, where a coarse-defined disease can be associated with multiple fine-grained diseases.

The granularity levels of the hierarchical structure include the first 3, 4, and 6-digit codes. For example, {CJK*}UTF8gbsn“K81.0: 急性胆囊炎(Acute Cholecystitis)” and {CJK*}UTF8gbsn“K81.1: 慢性胆囊炎(Chronic Cholecystitis)” share the first 3-digit code but differ in the 4th-digit code. As a result, they are both from the Cholecystitis category but differ in the type of the disease. We observe that the meaning between disease names that share the first 3-digit code but differ in the 4th-digit code can be quite distinct, but the meaning would be much more similar if the disease names share the first 4-digit code. Therefore, We implement MGA augmentation using the following methods:

•

MGA - Code: We assign the label of a 6-digit disease name to its corresponding 4-digit disease name. We refer to this method as “aggregation” because typically a 4-digit disease name can be linked to several 6-digit disease names, allowing the model to learn which diseases are similar. MGA-code is depicted in the bottom left part of Figure 5. Algorithm 3 provides the pseudo-code for MGA - Code.
We can perform aggregation using disease names from different sources:
- –
  
  MGA - Code 1: The 6-digit diseases are obtained from the ICD system.
- –
  
  MGA - Code 2: The 6-digit diseases are obtained from the diseases in the task training set with 6-digit ICD disease labels.
•

MGA - Region: In addition to the ICD system, anatomical regions also exhibit a tree-like hierarchical structure, where smaller regions can be grouped together to form a larger region. We identify disease names that share the same center but where the region of one disease is the larger region of another. We then assign the classification labels of the smaller-region disease names with their corresponding larger-region disease names. The MGA-Region method is depicted in the bottom right part of Figure 5. In this method, the larger-region disease names, serving as the standard names, must be sourced from the standard ICD system. Algorithm 4 provides the pseudo-code for MGA - Region.
Similarly, we can perform aggregation using disease names from different sources:
- –
  
  MGA - Region 1: The lower region disease names are obtained from the ICD system.
- –
  
  MGA - Region 2: The lower region disease names are obtained from the names in the task training set.

Remark 1.

In the human body, a region is considered the larger-region in relation to another if it covers a larger area. To determine the larger or smaller regions of a region, we create an expert-annotated region tree document that organizes anatomical regions into a tree data structure. This region tree is used to identify upper and lower relations. Similar results can be obtained using other sources containing knowledge bases of human anatomy.

Remark 2.

Before performing the data augmentation methods, we exclude all the diseases having an ICD code starting with P, Q, and any letter after T, for those diseases are mainly related to pregnancy, giving birth, and long description texts, which we found do not follow the above assumptions we made. We then perform the four data augmentation methods on the remaining disease names to form the augmented dataset.

3.4 Semantic Filtering Module

We design a filtering module to remove generated disease pairs with low confidence. As discussed in previous sections, we perform data augmentation by replacing the axis words within a disease name (AR) and manipulating matching relationships by aggregation (MGA). However, although our methods follow the nature and characteristics of the disease names, replacing an axis word of an unnormalized disease with another does not always result in an authentic disease, and the aggregation operation is not always accurate. Therefore, to ensure the quality of the generated data, we propose a semantic filtering module as a post-processing step. Our filtering method is under the assumption that the unnormalized names should not deviate too much from the standard name, and if so, the generated disease names or the relationship have a high probability of being inauthentic.

We measure the level of deviation or similarity in the semantic filtering module based on two criteria. The first part is a normalized $n$ -gram matching (ngm) score between an unnormalized disease name (UDN) and a standard disease name (SDN):

\text{ngm}(UDN,SDN)=\frac{{\sum_{n=1}^{\min(j,k)}\left|n\text{-gram}(UDN)\cap n% \text{-gram}(SDN)\right|}}{{\min(j,k)}},

(1)

where $j$ and $k$ are the lengths of UDN and SDN, respectively. Specifically, for each pair, we generate $n$ -grams from $n$ equals 1 to the length of the shorter name in the pair. We then calculate the number of matched pairs and divide it by the length of the shorter name. This equation is utilized to penalize the pairs that have a large difference in length and do not share a fair amount of common characters. It measures the similarity in the character level. The second part is a cosine similarity score between the contextual embeddings of UDN and SDN outputted by BERT [49], i.e.,

\text{{similarity}}(UDN,SDN)=\text{{cosine}}(\text{{BERT}}(UDN),\text{{BERT}}(% SDN)).

(2)

It measures the similarity from the contextual semantic level. The final dataset is derived by filtering out generated data pairs below the threshold of the normalized $n$ -gram score or the cosine similarity score,

	$\displaystyle\text{{Final Dataset}}=\{(GeneratedPairs)\|$	$\displaystyle\text{{ngm}}(UDN,SDN)>\alpha$		(3)
		$\displaystyle\land\text{{similarity}}(UDN,SDN)>\beta\}.$		(3)

where we set $\alpha$ and $\beta$ to be the threshold for the normalized $n$ -gram score and the cosine similarity score, respectively. In this work, we set $\alpha$ and $\beta$ to be 0.7 and 0.8, respectively. The final number of paired disease names generated by each data augmentation method is shown in A.

3.5 Training Paradigm

We train the models in a two-step fashion: the augmented data is used in the pre-training phase, and the original task data is then used in the fine-tuning phase. The reason is that although the semantic filtering module assists in eliminating fictitious disease names produced by the data augmentation module, it does not ensure that the remaining generated names are all genuine, and these fictitious disease names have the potential to negatively impact the overall task performance. Considering that our primary objective is to utilize a large volume of data to equip the model with extensive knowledge, we can leverage the generated disease pairs in the pre-training stage. After that, we finetune the models with the original task data to get the final results. Since we leverage several baselines to evaluate our approach, and they have different training objectives, we make the pre-training objective exactly the same as the fine-tuning objective for each baseline method.

4 Experiments

In this section, we conduct experiments to answer the following four research questions (RQs).

•

RQ1: How does the proposed approach compare in effectiveness to different data augmentation baselines?
•

RQ2: What is the individual contribution of each component in the proposed approach to the final outcome?
•

RQ3: Considering our focus on tackling data scarcity, will the proposed approach demonstrate greater effectiveness on smaller datasets?
•

RQ4: In the age of Large Language Models (LLMs), how does our proposed approach perform in comparison to LLM baselines?

4.1 Dataset

4.1.1 CHIP-CDN

We evaluate the effectiveness of our data augmentation approach on a Chinese disease name normalization dataset called CHIP-CDN. CHIP-CDN originates in the CHIP-2019 competition and was collected in CBLUE²²2CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. [10]. The dataset contains 6,000 unnormalized-standard disease pairs in the training set, 1,000 pairs in the validation set, and 2,000 pairs in the test set. In this dataset, the data pairs are not strictly a one-to-one map**. Some unnormalized names are matched to several different standard names.

4.1.2 Other Related Datasets

We also try to find some English datasets to perform the experiments. In previous sections, we mentioned the inconsistency of the disease name normalization concept in various literature. Most of them use two main datasets to perform the task, namely NCBI Disease Corpus [11] and BioCreative-V-CDR-Corpus [12]. Both of them contain a certain number of PubMed abstracts written in English, and the task is to identify the disease concepts within the texts. However, as mentioned earlier, this task falls under the Biomedical Entity Linking rather than the Disease Name Normalization task, so we are not able to utilize them. Table 2 summarizes the abovementioned datasets.

Table 2: Summary of different disease name normalization datasets, where “Train/Val/Test” is the dataset split and “NoDC” represents “Number of Disease Concepts”.

Dataset Name	Language	Train/Val/Test	NoDC	Source
CHIP-CDN	Chinese	6,000/2,000/10,000	10,325	Electronic Medical Records
NCBI Disease Corpus	English	593/100/100	790	PubMed Abstracts
BioCreative-V-CDR-Corpus	English	500/500/500	1,082	PubMed Articles

4.2 Experimental Setup

We assess our approach using four baseline models: BiLSTM [47], BERT-base [49], CDN-Baseline (from CBLUE) [10], and Bi-hardNCE [1]. For the BiLSTM model, we employ two BiLSTM layers with a hidden dimension of 256, followed by an MLP layer for classification. For the BERT-base model, we utilize the CLS vector [49] within the BERT architecture for classification. CDN-Baseline is the baseline method presented in the CBLUE paper [10], which introduces the CHIP-CDN dataset. It is based on BERT-base model and follows a “recall-match” approach, where all relevant standard disease names for an unnormalized disease are recalled first, and the unnormalized disease is then matched to the final decision. Bi-hardNCE is a contrastive learning-based method that has demonstrated effectiveness in symptom detection tasks and is also based on BERT-base model. It treats disease name normalization as a retrieval problem. The selection of these baseline models aims to showcase the effectiveness of our approach across different model types and training objectives. Specifically, we validate the effectiveness of DDA on both non-pretrained (BiLSTM) and pre-trained models (the other three), on models with simple (BERT-base) and complex (CDN-Baseline and Bi-hardNCE) pre-training objectives.

We report all the metrics on the validation set. For the BiLSTM model and BERT-base model, we assess the model performance using accuracy. For these two models, we treat disease name normalization as a multi-class classification rather than a multi-label classification task. Therefore, if an unnormalized disease is matched to several standard diseases, we consider the data sample correctly predicted as long as one of the standard diseases is correctly predicted. We design the experiments in this way to simplify the model as much as possible and to more clearly illustrate the effectiveness of DDA. For CDN-Baseline, we adhere to the settings in CBLUE [10], which uses F1 as the evaluation metric tailored to the multi-label setting. The F1 is calculated using precision (P) and recall (R), which are defined using the number of “unnormalized-standard” disease pairs³³3If an unnormalized disease name is matched to three standard disease names, we say there are three disease pairs here.:

P=\frac{k}{n},R=\frac{k}{m},F1=\frac{2\times P\times R}{P+R},

(4)

where $m$ is the total number of data pairs in the evaluation dataset, $n$ is the number of predicted pairs, and $k$ is the number of correctly predicted pairs. As for Bi-hardNCE, it is structured as a retrieval problem, so we report the RECALL@5 and NDCG@5 metrics following the original paper [1].

4.3 Comparison with Different Data Augmentation Approaches (RQ1)

We evaluate the effectiveness of our data augmentation approach by comparing it to two baseline approaches: EDA [29] and Back Translation (BT)⁴⁴4we use the youdao translation tool at https://fanyi.youdao.com/ [30]. We select EDA and BT as our benchmarks because they are commonly employed in various studies and represent the two primary categories of NLP data augmentation approaches—noise-based and paraphrase-based approaches—as outlined in [36]. Furthermore, the decision is supported by the work of [40], who also used EDA and Back Translation as baseline approaches for their medical data augmentation approach.

Table 3: Comparison for the choice of different data augmentation approaches across multiple baseline models using the CHIP-CDN dataset.

DA Approaches	BiLSTM	BERT-base	CDN-Baseline	Bi-hardNCE	Bi-hardNCE
(Metric)	(Acc)	(Acc)	(F1)	(RECALL@5)	(NDCG@5)
None	0.455	0.558	0.554	0.857	0.816
EDA	0.451	0.519	0.561	0.795	0.798
BT	0.466	0.556	0.578	0.845	0.828
DDA (ours)	0.518	0.579	0.592	0.866	0.840

As shown in Table 3, both EDA and back-translation have a detrimental impact on performance in certain scenarios (especially EDA), but DDA enhances performance across all scenarios. An intuitive explanation of this phenomenon is that general data augmentation methods have the potential to alter the meaning of disease names significantly. For example, if random deletion [29] is applied to “{CJK*}UTF8gbsn阻塞性睡眠呼吸暂停 (Obstructive Sleep Apnoea)”, it can result in “{CJK*}UTF8gbsn阻塞性睡眠 (Obstructive Sleep)”, which represents a completely different disease. As a result, the matching relationship between the unnormalized and standard disease names is lost. In contrast, our data augmentation method maintains this matching relationship, leading to enhanced performance.

We notice that the performance improvement is more pronounced in the BiLSTM model compared to the BERT-based models. This could be attributed to the fact that the pre-trained language models already contain some similar knowledge, but our proposed approach can further enhance their performance, demonstrating the effectiveness of DDA. Furthermore, the consistent performance improvement across all scenarios indicates that DDA is well-suited for the task and can serve as a plug-and-play module, offering benefits to various baseline models with different training objectives.

4.4 Ablation Study (RQ2)

We conduct further assessments to illustrate the effectiveness of each category of data augmentation method on all the baseline models. We assess their impact by removing each type of method one by one and observing the resulting performance. Similarly, we assess the impact of the semantic filtering module by removing the filtering rules one by one. As shown in Table 4, the removal of data generated by either type of method led to a decline in performance. We also observe a performance decline when removing either of the filtering methods. This shows that all the data augmentation and filtering methods are effective.

Table 4: Ablation study for the DDA approach. We remove our proposed data augmentation and semantic filtering methods one by one and evaluate the results.

Settings	BiLSTM	BERT-base	CDN-Baseline	Bi-hardNCE	Bi-hardNCE
(Metric)	(Acc)	(Acc)	(F1)	(RECALL@5)	(NDCG@5)
DDA (full)	0.518	0.579	0.592	0.866	0.840
- AR	0.487	0.568	0.588	0.861	0.833
- MGA	0.455	0.558	0.554	0.857	0.816
- ngm	0.505	0.572	0.581	0.858	0.830
- similarity	0.485	0.560	0.574	0.857	0.826

4.5 Smaller Datasets Experiments (RQ3)

We are particularly interested in evaluating the performance improvements over smaller datasets derived from CHIP-CDN since the data scarcity problem is more severe in smaller datasets, and it can further validate our assumption that our approach gives the model comprehensive knowledge about the disease names and the classification system. Therefore, we conduct experiments to evaluate the scenario in which the training set size is restricted (from 5% to 100% of the original training set size). For the convenience of training, we only leverage standard disease names in the ICD system during data augmentation. No data from the disease name normalization training set is used.

Table 5: Comparison between the performance of zero-shot inference and full fine-tuning over various baseline models.

Settings	BiLSTM	BERT-base	CDN-Baseline	Bi-hardNCE	Bi-hardNCE
(Metric)	(Acc)	(Acc)	(F1)	(RECALL@5)	(NDCG@5)
DDA (full)	0.518	0.579	0.592	0.866	0.840
Zero-Shot	0.034	0.073	0.113	0.672	0.670

The performance gap between whether to use our data augmentation or not is significantly larger when fewer training data is used, as depicted in Figure 6. When the size of the training set increases, both curves steadily increase. We also notice that the performance gain is higher when the size of the training set is smaller. We further perform zero-shot inferences for all four baseline models, where the inference is conducted without fine-tuning the models. The comparison results between zero-shot and full fine-tuning are shown in Table 5. It is particularly evident that Bi-hardNCE is able to recover nearly 80% of the full performance for RECALL@5 and NDCG@5 in zero-shot settings. All the results above shows that the model has learned the general knowledge about the disease name system as expected.

4.6 Comparisons with LLM Baselines (RQ4)

In the field of Natural Language Processing, the Large Language Models (LLMs) have been develo** very fast. LLMs can solve various problems by generating natural language and have demonstrated the ability to perform incredibly well on a wide range of tasks. This raises a crucial question: Can the existing approach for the disease name normalization task be replaced by LLMs? To answer this question, we compare the task performance between LLM-based approaches and our proposed approach.

We choose to use CDN-Baseline with DDA to compare with three LLM baselines: ChatGPT, GPT-4, and ChatGLM. We use the results reported in [13] for the LLM baselines. They also address the task using the “recall-match” two-step fashion, where BM25 [50] is the method to recall all the relevant disease names, and then the LLMs are prompted to select (match) the final answer from the retrieved names. The reason why they use such a pipeline is likely because it is hard to make LLMs directly perform the classification problem with a large label space (in this case, 40,474). We compare them with CDN-Baseline with DDA because they not only share the pipeline but also the F1 evaluation metric.

We visualize the results by comparing the model size of the task performance in Figure 7. The size of ChatGPT and GPT-4 are estimated based on online sources ([51, 52]). The result demonstrates our approach has the best tradeoff between model performance and model size. Specifically, our proposed approach achieves on par performance with ChatGPT, despite our model size being over 3,000 times smaller. Our approach can significantly outperform a model over 50 times larger in size (ChatGLM 6B vs. CDN-Baseline with DDA 110M).

5 Conclusion

In this work, we investigate the critical task of disease name normalization in Chinese, a process essential for intelligent healthcare applications such as consultation, auxiliary diagnosis, and ICD coding. We identify the primary challenge of this task to be the scarcity of labeled data for model training. To address the issue, we introduce a novel data augmentation approach comprising two categories of data augmentation methods and some supporting modules. Our data augmentation methods involve Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA), which generate new training pairs by manipulating key elements of disease names and aggregating based on the hierarchical nature of disease classifications in the ICD system. We demonstrate through experiments that, unlike general text augmentation approaches, our approach significantly enhances performance across various baseline models for the Chinese disease name normalization task. It also achieves the best tradeoff between performance and model size when compared with LLM baselines. Our findings suggest that our data augmentation approach can serve as a robust tool to mitigate data scarcity of the disease name normalization task.

6 Limitation and Future Work

While the proposed approach has shown effectiveness, it still has some limitations. Firstly, there is no guarantee that the generated disease names are authentic, which could introduce biases to the model due to misinformation. We attempt to address this issue by utilizing the pretrain-finetune paradigm, but this approach does not fully resolve the problem. Secondly, although we believe it is possible, it remains unclear whether the approach can be effectively applied to English disease names. We have observed that in English disease names, a single word may represent multiple types of axis words, making it challenging to adapt the concept in English. An example is that “Pylephlebitis” incorporates not only the meaning of portal vein but also the meaning of inflammation. Furthermore, conducting such experiments requires a high-quality disease name normalization dataset in English.

In this work, we have demonstrated the effectiveness of our DDA approach. However, we have not conducted a theoretical analysis to elucidate the underlying mechanisms that contribute to its effectiveness. Therefore, our future research will focus on exploring these mechanisms. Additionally, to further prevent the injection of misinformation, we plan to develop loss function terms in our future work that will enable the effective selection of more valuable data from the results of the data augmentation module.

References

[1] S. Zhang, J. Sun, Y. Huang, X. Ding, Y. Zheng, Medical symptom detection in intelligent pre-consultation using bi-directional hard-negative noise contrastive estimation, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4551–4559.
[2] F. Zhang, Z. Li, B. Zhang, H. Du, B. Wang, X. Zhang, Multi-modal deep learning model for auxiliary diagnosis of alzheimer’s disease, Neurocomputing 361 (2019) 185–195.
[3] K. Yu, L. Tan, L. Lin, X. Cheng, Z. Yi, T. Sato, Deep-learning-empowered breast cancer auxiliary diagnosis for 5gb remote e-health, IEEE Wireless Communications 28 (3) (2021) 54–61.
[4] F. Duarte, B. Martins, C. S. Pinto, M. J. Silva, Deep neural models for icd-10 coding of death certificates and autopsy reports in free-text, Journal of biomedical informatics 80 (2018) 64–77.
[5] Y. Yu, M. Li, L. Liu, Z. Fei, F.-X. Wu, J. Wang, Automatic icd code assignment of chinese clinical notes based on multilayer attention birnn, Journal of biomedical informatics 91 (2019) 103114.
[6] L. Liu, O. Perez-Concha, A. Nguyen, V. Bennett, L. Jorm, Hierarchical label-wise attention transformer model for explainable icd coding, Journal of biomedical informatics 133 (2022) 104161.
[7] H. Shin, J. Lee, Y. An, S. Cho, A scoring model to detect abusive medical institutions based on patient classification system: Diagnosis-related group and ambulatory patient group, Journal of Biomedical Informatics 117 (2021) 103752.
[8] M. M. Islam, G.-H. Li, T. N. Poly, Y.-C. Li, Deepdrg: Performance of artificial intelligence model for real-time prediction of diagnosis-related groups, in: Healthcare, Vol. 9, MDPI, 2021, p. 1632.
[9] W. E. Lowell, G. E. Davis, Predicting length of stay for psychiatric diagnosis-related groups using neural networks, Journal of the American Medical Informatics Association 1 (6) (1994) 459–466.
[10] N. Zhang, M. Chen, Z. Bi, X. Liang, L. Li, X. Shang, K. Yin, C. Tan, J. Xu, F. Huang, L. Si, Y. Ni, G. Xie, Z. Sui, B. Chang, H. Zong, Z. Yuan, L. Li, J. Yan, H. Zan, K. Zhang, B. Tang, Q. Chen, CBLUE: A Chinese biomedical language understanding evaluation benchmark, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 7888–7915.
[11] R. I. Doğan, R. Leaman, Z. Lu, Ncbi disease corpus: a resource for disease name recognition and concept normalization, Journal of biomedical informatics 47 (2014) 1–10.
[12] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, Z. Lu, Biocreative v cdr task corpus: a resource for chemical disease relation extraction, Database 2016 (2016).
[13] W. Zhu, X. Wang, H. Zheng, M. Chen, B. Tang, Promptcblue: A chinese prompt tuning benchmark for the medical domain, arXiv preprint arXiv:2310.14151 (2023).
[14] Q. Liu, X. Wu, X. Zhao, Y. Zhu, D. Xu, F. Tian, Y. Zheng, Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications, arXiv preprint arXiv:2310.18339 (2023).
[15] Y. Zhang, K. Zhong, G. Liu, A novel method for medical semantic word sense disambiguation by using graph neural network, in: 2023 9th International Symposium on System Security, Safety, and Reliability (ISSSR), IEEE, 2023, pp. 263–272.
[16] S. Jiang, H. Wu, L. Luo, Infusing biomedical knowledge into bert for chinese biomedical nlp tasks with adversarial training, in: 2022 3rd Asia Service Sciences and Software Engineering Conference, 2022, pp. 108–114.
[17] Q. Wang, Z. Gao, R. Xu, Exploring the in-context learning ability of large language model for biomedical concept linking, arXiv preprint arXiv:2307.01137 (2023).
[18] R. Leaman, R. Islamaj Doğan, Z. Lu, Dnorm: disease name normalization with pairwise learning to rank, Bioinformatics 29 (22) (2013) 2909–2917.
[19] Y. Lou, T. Qian, F. Li, J. Zhou, D. Ji, M. Cheng, Investigating of disease name normalization using neural network and pre-training, IEEE Access 8 (2020) 85729–85739.
[20] D. Pujary, C. Thorne, W. Aziz, Disease normalization with graph embeddings, in: Intelligent Systems and Applications: Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 2, Springer, 2021, pp. 209–217.
[21] Z. Miftahutdinov, A. Kadurin, R. Kudrin, E. Tutubalina, Medical concept normalization in clinical trials with drug and disease representation learning, Bioinformatics 37 (21) (2021) 3856–3864.
[22] M. Sung, M. Jeong, Y. Choi, D. Kim, J. Lee, J. Kang, Bern2: an advanced neural biomedical named entity recognition and normalization tool, Bioinformatics 38 (20) (2022) 4837–4839.
[23] M. Sung, H. Jeon, J. Lee, J. Kang, Biomedical entity representations with synonym marginalization, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 3641–3650.
[24] H. Liu, Y. Xu, A deep learning way for disease name representation and normalization, in: Natural Language Processing and Chinese Computing: 6th CCF International Conference, NLPCC 2017, Dalian, China, November 8–12, 2017, Proceedings 6, Springer, 2018, pp. 151–157.
[25] D. Wright, NormCo: Deep disease normalization for biomedical knowledge base construction, University of California, San Diego, 2019.
[26] R. I. Dogan, Z. Lu, An inference method for disease name normalization, in: 2012 AAAI Fall Symposium Series, 2012.
[27] E. French, B. T. McInnes, An overview of biomedical entity linking throughout the years, Journal of biomedical informatics 137 (2023) 104252.
[28] N. Ng, K. Cho, M. Ghassemi, SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 1268–1283.
[29] J. Wei, K. Zou, EDA: Easy data augmentation techniques for boosting performance on text classification tasks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 6382–6388.
[30] R. Sennrich, B. Haddow, A. Birch, Improving neural machine translation models with monolingual data, in: K. Erk, N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 86–96.
[31] H. H. Kim, D. Woo, S. J. Oh, J.-W. Cha, Y.-S. Han, Alp: Data augmentation using lexicalized pcfgs for few-shot text classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 10894–10902.
[32] X. Wu, S. Lv, L. Zang, J. Han, S. Hu, Conditional bert contextual augmentation, in: International conference on computational science, Springer, 2019, pp. 84–95.
[33] V. Kumar, A. Choudhary, E. Cho, Data augmentation using pre-trained transformer models, in: W. M. Campbell, A. Waibel, D. Hakkani-Tur, T. J. Hazen, K. Kilgour, E. Cho, V. Kumar, H. Glaude (Eds.), Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, Association for Computational Linguistics, Suzhou, China, 2020, pp. 18–26.
[34] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, C. A. Raffel, Mixmatch: A holistic approach to semi-supervised learning, Advances in neural information processing systems 32 (2019).
[35] Q. Xie, Z. Dai, E. Hovy, T. Luong, Q. Le, Unsupervised data augmentation for consistency training, Advances in Neural Information Processing Systems 33 (2020) 6256–6268.
[36] B. Li, Y. Hou, W. Che, Data augmentation approaches in natural language processing: A survey, AI Open 3 (2022) 71–90.
[37] M. Falis, H. Dong, A. Birch, B. Alex, Horses to zebras: Ontology-guided data augmentation and synthesis for icd-9 coding, in: Proceedings of the 21st Workshop on Biomedical Language Processing, 2022, pp. 389–401.
[38] M. Abdollahi, X. Gao, Y. Mei, S. Ghosh, J. Li, M. Narag, Substituting clinical features using synthetic medical phrases: Medical text data augmentation techniques, Artificial Intelligence in Medicine 120 (2021) 102167.
[39] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology, Nucleic acids research 32 (suppl_1) (2004) D267–D270.
[40] G. Ansari, M. Garg, C. Saxena, Data augmentation for mental health classification on social media, in: S. Bandyopadhyay, S. L. Devi, P. Bhattacharyya (Eds.), Proceedings of the 18th International Conference on Natural Language Processing (ICON), NLP Association of India (NLPAI), National Institute of Technology Silchar, Silchar, India, 2021, pp. 152–161.
[41] Y. Wang, F. Liu, K. Verspoor, T. Baldwin, Evaluating the utility of model configurations and data augmentation on clinical semantic textual similarity, in: Proceedings of the 19th SIGBioMed workshop on biomedical language processing, 2020, pp. 105–111.
[42] Y. Wang, K. Verspoor, T. Baldwin, Learning from unlabelled data for clinical semantic textual similarity, in: Proceedings of the 3rd Clinical Natural Language Processing Workshop, 2020, pp. 227–233.
[43] W. K. Funkhouser, Pathology: the clinical description of human disease, in: Essential Concepts in Molecular Pathology, Elsevier, 2020, pp. 177–190.
[44] H. Ashraf, J. P. Colombo, V. Marcucci, J. Rhoton, O. Olowoyo, A clinical overview of acute and chronic pancreatitis: the medical and surgical management, Cureus 13 (11) (2021).
[45] W. H. Organization, The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines, Vol. 1, World Health Organization, 1992.
[46] L. A. Ramshaw, M. P. Marcus, Text chunking using transformation-based learning, in: Natural language processing using very large corpora, Springer, 1999, pp. 157–176.
[47] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.
[48] C. Sutton, A. McCallum, et al., An introduction to conditional random fields, Foundations and Trends in Machine Learning 4 (4) (2012) 267–373.
[49] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[50] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (4) (2009) 333–389.
[51] C. Emmanuel, GPT-3.5 and GPT-4 Comparison: — chudeemmanuel3, https://medium.com/@chudeemmanuel3/gpt-3-5-and-gpt-4-comparison-47d837de2226, [Accessed 04-02-2024].
[52] GPT 3 vs. GPT 4: What You Need to Know — getgenie.ai, https://getgenie.ai/gpt-3-vs-gpt-4/, [Accessed 04-02-2024].

Appendix A Data Augment Result Statics

The number of data pairs generated by each data augmentation method in our proposed approach are as follows:

•

AR1: 332231
•

AR2: 48857
•

MGA - Code: 32145
•

MGA - Region: 6239

Appendix B Hyperparameter Settings

Table 6 shows the hyperparameter settings of our choices. For models that randomly initialize their parameters like BilSTM [47], it is possible to set a large learning rate and a large number of iterations to ensure adequate training. However, for models that rely on a pre-trained model checkpoint such as BERT [49] as the backbone, we observe that setting a small learning rate and a small number of training iterations can lead to improved performance, likely due to its ability to mitigate catastrophic forgetting of knowledge within the original checkpoint.

Table 6: Hyperparameter settings for all the baseline models

Model	Stage	Batch Size	Learning Rate	Epoch
BiLSTM	Pre-training	256	1e-3	100
BiLSTM	Fine-tuning	64	1e-3	100
BERT	Pre-training	256	1e-5	10
BERT	Fine-tuning	64	1e-4	100
CDN-Baseline	Pre-training	256	5e-6	1
CDN-Baseline	Fine-tuning	64	5e-5	3
Bi-hardNCE	Pre-training	16	3e-5	1
Bi-hardNCE	Fine-tuning	16	3e-5	10

Appendix C Examples for each data augmentation technique

Table 7 gives an example for every data augmentation technique.

Table 7: Examples of the data generated by each data augmentation technique. The “find” and “replace” operations correspond to the operation illustrated in Figure 5.

Technique	Example
AR1 - Region	Find: {CJK}UTF8gbsn踝关节骨折脱位 $\rightarrow$ {CJK}UTF8gbsn腰椎骨折
AR1 - Region	Replace: {CJK}UTF8gbsn腰椎骨折脱位 $\rightarrow$ {CJK}UTF8gbsn腰椎骨折
AR1 - Center	Find: {CJK}UTF8gbsn左内踝关节囊肿 $\rightarrow$ {CJK}UTF8gbsn踝关节骨折
AR1 - Center	Replace: {CJK}UTF8gbsn左内踝关节骨折 $\rightarrow$ {CJK}UTF8gbsn踝关节骨折
AR1 - Characteristic	Find: {CJK}UTF8gbsn重度慢性牙周炎 $\rightarrow$ {CJK}UTF8gbsn急性牙周炎
AR1 - Characteristic	Replace: {CJK}UTF8gbsn重度急性牙周炎 $\rightarrow$ {CJK}UTF8gbsn急性牙周炎
AR2 - Region	Find: {CJK}UTF8gbsn踝关节骨折脱位 $\rightarrow$ {CJK}UTF8gbsn踝关节骨折
AR2 - Region	Replace: {CJK}UTF8gbsn腰椎骨折脱位 $\rightarrow$ {CJK}UTF8gbsn腰椎骨折
AR2 - Center	Find: {CJK}UTF8gbsn左内踝关节囊肿 $\rightarrow$ {CJK}UTF8gbsn踝关节囊肿
AR2 - Center	Replace: {CJK}UTF8gbsn左内踝关节骨折 $\rightarrow$ {CJK}UTF8gbsn踝关节骨折
AR2 - Characteristic	Find: {CJK}UTF8gbsn重度慢性牙周炎 $\rightarrow$ {CJK}UTF8gbsn慢性牙周炎
AR2 - Characteristic	Replace: {CJK}UTF8gbsn重度急性牙周炎 $\rightarrow$ {CJK}UTF8gbsn急性牙周炎
MGA - Code 1	{CJK}UTF8gbsn急性脑膜炎症 $\rightarrow$ {CJK}UTF8gbsn脑膜炎
MGA - Code 2	{CJK}UTF8gbsn急性脑膜炎 $\rightarrow$ {CJK}UTF8gbsn脑膜炎
MGA - Region 1	{CJK}UTF8gbsn副乳腺恶性肿瘤 $\rightarrow$ {CJK}UTF8gbsn乳腺恶性肿瘤
MGA - Region 2	{CJK}UTF8gbsn右乳房乳腺恶性肿瘤 $\rightarrow$ {CJK}UTF8gbsn乳腺恶性肿瘤

Appendix D Pseudo-code

In this section, we present the pseudo-code for the four proposed data augmentation methods.

Table 8: Annotations used in the algorithms. Note that “disease names” can represent both unnormalized disease names or standard disease names.

Descriptions	Notations
Axis word	$a1$ , $a2$ , $a3$ , etc.
List of axis words	$A1$ , $A2$ , $A3$ , etc.
Axis type - Disease Center	$dce$
Axis type - Anatomical Region	$al$
Axis type - Disease Characteristic	$dch$
Larger region	$lar$
List of shared axis words between two diseases	$SA$
List of differing axis words between two diseases	$DiA$
List of differing axis words in the first disease when comparing two diseases	$DiA1$
List of differing axis words in the second disease when comparing two diseases	$DiA2$
Unnormalized disease names (UDN)	$u1$ , $u2$ , $u3$ , etc.
Standard disease names (SDN)	$s1$ , $s2$ , $s3$ , etc.
Disease names (can be either a UDN or an SDN)	$d1$ , $d2$ , $d3$ , etc.

Algorithm 1 Axis-word Replacement 1 (AR1)

1:Input:

training\_set

- List of disease pairs from the disease name

3: normalization training set.

ICD\_list

- The standard ICD system.

5:Output:

augmented\_pairs

- List of augmented disease pairs.

6:procedure AR1(

training\_set,ICD\_list

)

augmented\_pairs\leftarrow[]

8: for each

d1

(training\_set\cup ICD\_list)

A1\leftarrow NER(d1)

10: for each

s1

ICD\_list

11:

A2\leftarrow NER(s1)

12:

SA,DiA1,DiA2\leftarrow comparing\_axis\_words(A1,A2)

13: if

\text{len}(SA)\neq 0

and len

(DiA1)

= len

(DiA2)=1

then

14:

d2\leftarrow d1.replace\_axis(DiA1[0],DiA2[0])

15:

augmented\_pairs.append((d2,s1))

16: end if

17: end for

18: end for

19: return

augmented\_pairs

20:end procedure

Algorithm 2 Axis-word Replacement 2 (AR2)

1:Input:

training\_set

- List of disease pairs from the disease name

3: normalization training set.

ICD\_list

- The standard ICD system.

5:Output:

augmented\_pairs

- List of augmented disease pairs.

6:procedure AR2(

training\_set,ICD\_list

)

augmented\_pairs\leftarrow[]

8: for each

(u1,s1)

training\_set

A1\leftarrow NER(u1)

10:

A2\leftarrow NER(s1)

11: if

A1=A2

then

12: for each

s2

ICD\_list

13:

A3\leftarrow NER(s2)

14:

SA,DiA1,DiA2\leftarrow comparing\_axis\_words(A2,A3)

15: if len

(A2)=

len

(A3)

and len

(DiA1)

=len

(DiA2)=1

then

16:

s3\leftarrow s1.replace\_axis(DiA1[0],DiA2[0])

17:

u2\leftarrow u1.replace\_axis(DiA1[0],DiA2[0])

18:

augmented\_pairs.append((u2,s3))

19: end if

20: end for

21: end if

22: end for

23: return

augmented\_pairs

24:end procedure

Algorithm 3 Multi-Granularity Aggregation - Code (MGA-Code)

1:Input:

training\_set

- List of disease pairs from the disease name

3: normalization training set.

ICD\_list

- The standard ICD system.

5:Output:

augmented\_pairs

- List of augmented disease pairs.

6:procedure MGA-Code(

training\_set,ICD\_list

)

augmented\_pairs\leftarrow[]

8: for each

d1

(training\_set\cup ICD\_list)

9: if

len(

ICD-code

(d1))=6

then

10:

code_{6}=

ICD-code

(d1)

11:

code_{4}=code_{6}[0:3]

// Extract the first four digits

12:

s1=

map_disease

(code_{4})

13:

augmented\_pairs.append((d1,s1))

14: end if

15: end for

16: return

augmented\_pairs

17:end procedure

Algorithm 4 Multi-Granularity Aggregation - Region (MGA-Region)

1:Input:

training\_set

- List of disease pairs from the disease name

3: normalization training set.

ICD\_list

- The standard ICD system.

5:Output:

augmented\_pairs

- List of augmented disease pairs.

6:procedure MGA-Region(

training\_set,ICD\_list

)

augmented\_pairs\leftarrow[]

8: for each

d1

(training\_set\cup ICD\_list)

A1\leftarrow NER(d1)

10: for each

s1

ICD\_list

11:

A2\leftarrow NER(s1)

12:

SA,DiA1,DiA2\leftarrow comparing\_axis\_words(A1,A2)

13: if len

(SA)\geq 1

and len

(DiA1)

=len

(DiA2)=1

then

14: if

type(DiA1)=al

and

DiA2[0]

DiA1[0].lar

then

15:

augmented\_pairs.append((d1,s1))

16: end if

17: end if

18: end for

19: end for

20: return

augmented\_pairs

21:end procedure