HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: cuted
  • failed: heuristica

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2310.00100v4 [cs.CL] 13 Jan 2024

Multilingual Natural Language Processing Model for Radiology Reports
The Summary is all you need!

Mariana Lindo1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Ana Sofia Santos1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    André Ferreira1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Jianning Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Gijs Luijten2,424{}^{2,4}start_FLOATSUPERSCRIPT 2 , 4 end_FLOATSUPERSCRIPT    Gustavo Correia11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Moon Kim22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Benedikt Michael Schaarschmidt33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT    Cornelius Deuschl33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT    Johannes Haubold2,323{}^{2,3}start_FLOATSUPERSCRIPT 2 , 3 end_FLOATSUPERSCRIPT    Jens Kleesiek22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Jan Egger22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Victor Alves11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
(11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTALGORITMI Research Centre/LASI, University of Minho, Braga, Portugal
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTInstitute for Artificial Intelligence in Medicine, University Medicine Essen (AöR), Essen, Germany
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTDepartment of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen (AöR), Essen, Germany
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTInstitute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria
)
Abstract

The impression section of a radiology report summarizes important radiology findings and plays a critical role in communicating these findings to physicians. However, the preparation of these summaries is time-consuming and error-prone for radiologists. Recently, numerous models for radiology report summarization have been developed. Nevertheless, there is currently no model that can summarize these reports in multiple languages. Such a model could greatly improve future research and the development of Deep Learning models that incorporate data from patients with different ethnic backgrounds. In this study, the generation of radiology impressions in different languages was automated by fine-tuning a model, publicly available, based on a multilingual text-to-text Transformer to summarize findings available in English, Portuguese, and German radiology reports. In a blind test, two board-certified radiologists indicated that for at least 70% of the system-generated summaries, the quality matched or exceeded the corresponding human-written summaries, suggesting substantial clinical reliability. Furthermore, this study showed that the multilingual model outperformed other models that specialized in summarizing radiology reports in only one language, as well as models that were not specifically designed for summarizing radiology reports, such as ChatGPT.

Keywords: Multilingual, Natural Language Processing, Radiology Report, Summarization

 

 

Introduction

Medical imaging techniques like X-rays and Computed Tomography (CT) are widely employed for disease diagnosis, treatment planning, and guidance Hill et al. [2001]. In routine clinical practice, these imaging studies are conveyed through a written radiology report (RR), which serves as the official record documenting diagnostic, interventional, or therapeutic examinations or procedures utilizing medical imaging data Liang et al. [2022]. Within the standard clinical workflow, a radiologist initially records detailed findings in the report and subsequently provides a summary of the most significant observations. The summary, also known as the impression, holds paramount importance in an RR, as referring physicians focus their attention on it when reviewing the report Bosmans et al. [2011]. However, radiologists often encounter difficulties in accurately documenting critical findings in the summary, which may result in forgetting information or omitting findings they consider less important. Therefore, report writing is prone to errors and can lead to misunderstandings. In addition, writing the summary is time-consuming and quite tedious, and dictating the findings can be highly repetitive Weber et al. [2020].

Recently, the rapid advancements in Deep Learning (DL) techniques have paved the way for the introduction of summary models in the field of Natural Language Processing (NLP) Narayan et al. [2020]. However, to the best of our knowledge, no multilingual summary models have been developed specifically for summarizing RRs across different languages. Such a model could greatly facilitate future research, utilization of Data Mining techniques, and the development of DL models aiming to incorporate data from patients of diverse ethnic backgrounds. Furthermore, a summarization model could alleviate the stress and workload on radiologists by automatically generating succinct summaries that retain essential information, thus enhancing communication between radiologists and referring physicians Kahn Jr et al. [2009]. The primary objective of this project is to develop a multilingual NLP model dedicated to summarizing RRs in various languages.

2 Related Work

The study by Zhang et al. [2018] was one of the pioneers in develo** models for summarizing RRs. They automated the generation of impressions by using a Recurrent Neural Network-based pointer generator Vinyals et al. [2015], which was responsible for the summary of the findings, and a Bidirectional Long Short-Term Memory encoder Zhang et al. [2015], which was used to separately encode background information and findings of an RR. After training and testing the model with 60,990 and 17,425 RRs, respectively, the results showed that the model outperformed the existing non-neuronal and neuronal baselines under the ROUGE metrics Lin [2004]. In addition, experienced radiologists confirmed that 67% of the system summaries tested were at least as good as the corresponding human-written summaries.

Cai et al. [2021] developed a pre-trained language model in thoracic radiology, called ChestXRay Bidirectional Encoder Representations from Transformers (ChestXRay BERT), to automatically summarize RRs. They started by collecting 85 radiology-related scientific articles from PubMed Central Roberts [2001] and pre-trained the ChestXRayBERT on them. Then, a Transformer decoder was added to ChestXRayBERT to create an abstract summary model, which was fine-tuned using RRs. For training and evaluation of model performance, the Indiana University Chest X-Ray (IU X-ray) Demner-Fushman et al. [2016] and MIMIC Chest X-ray (MIMIC-CXR) Johnson et al. [2019] datasets were combined and then divided into training/validation and test sets. The results showed that ChestXRayBERT achieved significant improvement compared with other neural network-based abstract summary models.

3 Model Architecture

To develop a model capable of summarizing RRs in multiple languages, the Multilingual Text-to-Text Transfer Transformer (mT5) Xue et al. [2020], which can be accessed on the Hugging Face Hub, was used as a starting point. The mT5 model represents a multilingual variant of the Text-to-Text Transfer Transformer (T5) Raffel et al. [2020], which is a pre-trained language model specifically designed for a broad range of text-based NLP tasks, including abstractive summarization and translation. The T5 model employs a uniform Sequence-to-sequence (Seq2seq) structure to address these NLP challenges Chen and Bansal [2018]. The T5 architecture shares similarities with the original Transformer model Vaswani et al. [2017]. However, there are a few distinctions: the layer normalization operates outside the residual path, biases have been eliminated, and a different positional embedding scheme is used.

The mT5 architecture and training procedure are very similar to those of T5. In particular, mT5 is based on the T5.1.1 checkpoint nconstant google , which improves upon T5 by using Gaussian Error Gated Linear Units nonlinearities Shazeer [2020], scaling the output dimensionality of the Feed-Forward Neural Network, other sublayers, and embeddings in the larger models. Furthermore, mT5 is pretrained exclusively on unlabeled data without utilizing dropout. The pretraining dataset for mT5 is mC4, a multilingual variant of the Colossal Clean Crawled Corpus dataset. This dataset encompasses natural text in 101 languages gathered from the publicly available Common Crawl Web Scrape. The mT5 model offers five model sizes, one of which is the base size with 580M parameters, and was used in this project. One difference between T5 and mT5 is that mT5 was pre-trained on mC4 only, without any supervised training. As a result, this model needs to be fine-tuned before it can be used for a downstream task Xue et al. [2020].

4 Material and Methods

This study was approved by the ethics committee (22-10997-BO) and only fully anonymized data have been used.

To train the multilingual final model, we started a local open-source network and used the processing pipeline shown in Figure 1, which illustrates the use of 5 datasets and 7 models. Each dataset is formally denoted as Dlnsubscriptsuperscriptabsent𝑛𝑙{}^{n}_{l}start_FLOATSUPERSCRIPT italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, where n is the name of the dataset and l is the language in which the dataset was used. The value of l can be EN (English), PT (Portuguese), and/or GER (German). Similarly, each model is formally represented as Mtlpsubscriptsuperscriptabsent𝑝𝑡𝑙{}^{p}_{tl}start_FLOATSUPERSCRIPT italic_p end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_l end_POSTSUBSCRIPT, where p is the purpose for which the model was developed, and tl is the target language(s) for which the model was trained. The value of p can be base if the model used was the base version; summaries if the model was trained for summary texts; rr-1000 if the model was tuned to summarize RRs with a Max New Tokens Parameter (MNTP) of 1,000; and translation if the model was trained specifically to translate RRs.

Since the mT5 model was not specifically fine-tuned for a downstream task, the initial step involved fine-tuning the model for English text summarization. Only the English instances from the Multilingual Amazon Reviews Corpus (MARC) dataset were employed, resulting in 125,893, 3,166, and 3,146 reviews for the training, validation, and test splits, respectively, in the DENMARCsubscriptsuperscriptabsent𝑀𝐴𝑅𝐶𝐸𝑁{}^{MARC}_{EN}start_FLOATSUPERSCRIPT italic_M italic_A italic_R italic_C end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT dataset. The review_body columns served as inputs, while the review_title columns were used as targets. The optimizer AdamW and a linear decay learning rate scheduler, decreasing from the maximum value of 2e55{}^{-5}start_FLOATSUPERSCRIPT - 5 end_FLOATSUPERSCRIPT to 0, were employed for training. To ensure concise review summaries, the MNTP was set to 50 tokens. The fine-tuned model checkpoint derived from the MENbasesubscriptsuperscriptabsent𝑏𝑎𝑠𝑒𝐸𝑁{}^{base}_{EN}start_FLOATSUPERSCRIPT italic_b italic_a italic_s italic_e end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT model for English text summarization was defined as MENsummariessubscriptsuperscriptabsent𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠𝐸𝑁{}^{summaries}_{EN}start_FLOATSUPERSCRIPT italic_s italic_u italic_m italic_m italic_a italic_r italic_i italic_e italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT, trained for 10 epochs with a Batch Size (BS) of 8. The workstation used has 64 GB of RAM, the CPU is an Intel Xeon E5-1650, and the GPU is an NVIDIA P6000 with 24 GB of memory.

Subsequently, the MENsummariessubscriptsuperscriptabsent𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠𝐸𝑁{}^{summaries}_{EN}start_FLOATSUPERSCRIPT italic_s italic_u italic_m italic_m italic_a italic_r italic_i italic_e italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT model was further fine-tuned for summarizing English RRs using the MIMIC-CXR dataset. The RRs in this dataset consisted of semi-structured text, with various sections such as patient history, findings, and impressions. To address imbalanced data issues, some of the most frequent impressions were removed, ensuring that one impression occurred less than 2% of the time. The balanced dataset, DENMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝐸𝑁{}^{MIMIC-CXR}_{EN}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT, comprised 5,816, 1,455, and 1,818 instances for training, validation, and testing, respectively. The resulting model, MENrr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁{}^{rr-1000}_{EN}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT was trained with an MNTP value of 1,000 and a BS of 1, due to the lack of graphical capacity of the workstation used.

To develop a model capable of summarizing RRs in other language, starting with Portuguese, reports in Portuguese were introduced. Initially, RRs from a Portuguese hospital were employed, but this led to overfitting as most of the reports were identical. Consequently, the generated summaries became repetitive, irrespective of the input findings. To overcome this issue, an alternative approach was adopted, which involved translating the DENMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝐸𝑁{}^{MIMIC-CXR}_{EN}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT dataset into Portuguese. This resulted in the creation of a new dataset defined as DPTMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝑃𝑇{}^{MIMIC-CXR}_{PT}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT, comprising 10,369, 2,593, and 3,241 instances for training, validation, and testing, respectively. The translation of the DENMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝐸𝑁{}^{MIMIC-CXR}_{EN}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT dataset was performed using the GoogleTranslator library. However, this process proved to be time-consuming and unsuitable for private datasets due to patient confidentiality concerns. Moreover, nowadays, with the increasing international mobility of people, it is becoming more and more attractive to have specialized and automatic translations of RRs. Thus, an alternative method was sought to translate the RRs. Considering that the mT5 model is proficient in Seq2seq tasks, including translation, it was fine-tuned to specifically translate impressions and findings from English to Portuguese. The DPTMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝑃𝑇{}^{MIMIC-CXR}_{PT}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT dataset served as the input to develop the MENPTtranslationsubscriptsuperscriptabsent𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝐸𝑁𝑃𝑇{}^{translation}_{EN-PT}start_FLOATSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s italic_l italic_a italic_t italic_i italic_o italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N - italic_P italic_T end_POSTSUBSCRIPT model, which was trained for 20 epochs to reach stability, with BS 1 and MNTP 1,000.

With the MENPTtranslationsubscriptsuperscriptabsent𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝐸𝑁𝑃𝑇{}^{translation}_{EN-PT}start_FLOATSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s italic_l italic_a italic_t italic_i italic_o italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N - italic_P italic_T end_POSTSUBSCRIPT model proficient in translating English to Portuguese impressions and findings, the same sections of the English public IU X-Ray dataset were translated. However, the translated dataset exhibited imbalanced data, necessitating the implementation of a solution. To address this issue, the dataset was balanced, resulting in the creation of the DPTIUXRaysubscriptsuperscriptabsent𝐼𝑈𝑋𝑅𝑎𝑦𝑃𝑇{}^{IUX-Ray}_{PT}start_FLOATSUPERSCRIPT italic_I italic_U italic_X - italic_R italic_a italic_y end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT dataset, which comprised 1,273, 319, and 399 instances for training, validation, and testing, respectively. Using this dataset, the MPTrr1000subscriptsuperscriptabsent𝑟𝑟1000𝑃𝑇{}^{rr-1000}_{PT}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT model was developed and fine-tuned specifically for summarizing Portuguese RRs. The training process lasted for 10 epochs to achieve stability, with a BS of 1 and an MNTP of 1,000 being employed.

The next step was to train a model capable of summarizing RRs in German. For this purpose, RRs related to CTs from a private dataset were utilized to construct the DGEGermanRRssubscriptsuperscriptabsent𝐺𝑒𝑟𝑚𝑎𝑛𝑅𝑅𝑠𝐺𝐸{}^{GermanRRs}_{GE}start_FLOATSUPERSCRIPT italic_G italic_e italic_r italic_m italic_a italic_n italic_R italic_R italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT. This dataset consisted of 34,166, 8,542, and 10,677 instances allocated for training, validation, and testing, respectively. Using the German RRs dataset, the MGErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐺𝐸{}^{rr-1000}_{GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT model was developed and fine-tuned specifically for summarizing German RRs. The training process was carried out for 17 epochs to ensure stability, employing a BS of 1 and setting the MNTP to 1,000.

Up until now, various summary models have been created to summarize radiology reports in English, Portuguese, and German individually. However, these existing models do not align with the objective of this project, which aims to develop a single model capable of summarizing reports in multiple languages. To achieve this goal, a final step was undertaken, involving the training of a multilingual model, the MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT. To do so, radiology reports from the datasets DENMIMICCXRsubscriptsuperscriptabsent𝑀𝐼𝑀𝐼𝐶𝐶𝑋𝑅𝐸𝑁{}^{MIMIC-CXR}_{EN}start_FLOATSUPERSCRIPT italic_M italic_I italic_M italic_I italic_C - italic_C italic_X italic_R end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT, DPTIUXRaysubscriptsuperscriptabsent𝐼𝑈𝑋𝑅𝑎𝑦𝑃𝑇{}^{IUX-Ray}_{PT}start_FLOATSUPERSCRIPT italic_I italic_U italic_X - italic_R italic_a italic_y end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT, and DGEGermanRRssubscriptsuperscriptabsent𝐺𝑒𝑟𝑚𝑎𝑛𝑅𝑅𝑠𝐺𝐸{}^{GermanRRs}_{GE}start_FLOATSUPERSCRIPT italic_G italic_e italic_r italic_m italic_a italic_n italic_R italic_R italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT were utilized. Nonetheless, the number of German reports exceeded those in English by approximately 6 times, and English reports were around 5 times more abundant than Portuguese ones. Training the multilingual model with reports in the same ratio as their occurrence would lead to overfitting towards generating German summaries, given the significantly higher number of German instances. To counter this, only as many reports as those in Portuguese (1,991) were used for training the multilingual model in each language. The training, testing, and validation splits consisted of 1,591, 200, and 200 instances, respectively, for each language.

The MGErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐺𝐸{}^{rr-1000}_{GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT model was chosen as the checkpoint model for fine-tuning due to the favorable language transfer learning abilities of mT5-based models. Language transfer learning involves leveraging knowledge from one language to enhance performance in another. The mT5 model is specifically designed to comprehend shared representations across multiple languages during its pre-training phase. This facilitates the model’s understanding of the underlying structure and common patterns across different languages, making the MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT model suitable for summarizing RRs in English, Portuguese, and German after fine-tuning.

Refer to caption
Figure 1: Processing steps followed to train the final model.

5 Results

Different variants of the ROUGE metric were used to quantitatively evaluate the results obtained with the different summarization models, as this is the most commonly used metric to evaluate summaries. Table 1 summarizes the results.

A potential shortcoming of the ROUGE metrics is that they only measure the similarity between the Generated Summary (GS) and the Reference Summary (RS) but do not adequately reflect the overall grammar or utility of the predictions. Therefore, evaluations were also performed with a radiologist to understand the clinical validity of the summaries generated by the models. In this evaluation, 30 examples were randomly selected from the test sets. All models were applied to these examples, and the GSs were presented to the radiologist along with the corresponding human-written RSs. The radiologist was asked to decide which of the summaries was better, or if they were of roughly the same quality. The expert was also asked to rate the GSs on a scale of 5 (very good) to 1 (very poor) in terms of Readability (R), Factual Correctness, Completeness (FCC), and Overall Quality (OQ). The average results of these metrics are shown in Table 2, as well as the percentage of GSs that were rated as better than or equal to the RSs.

In addition to the qualitative analysis performed by the radiologist, a second analysis was performed by presenting ChatGPT (Shahriar and Hayawi [2023]; Li et al. [2023]) with sections of findings from various reports and asking it to provide a summary of them. Some examples of summaries obtained by the ChatGPT and by the models are shown in Table 3.

6 Discussion and Conclusion

Based on the analysis of the results presented in Table 1, it can be concluded that the ROUGE values obtained fall within the range reported in existing literature for the task of summarizing radiology reports Cai et al. [2021]. By examining Table 2, it becomes apparent that the multilingual MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT generally has the best qualitative results compared to the other checkpoint models that summarize radiology reports in only one language. While the MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT produced summaries in English with slightly less factual accuracy, completeness, and overall quality than the MENrr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁{}^{rr-1000}_{EN}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT and summaries in Portuguese with slightly less readability than the MPTrr1000subscriptsuperscriptabsent𝑟𝑟1000𝑃𝑇{}^{rr-1000}_{PT}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT, the summaries produced by the multilingual model in German were better than the summaries produced by the MGErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐺𝐸{}^{rr-1000}_{GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT in all metrics. Moreover, the percentage of summaries generated that are equal to or better than the reference summaries is always higher for the MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT than for the best model summarizing RRs in English, Portuguese, or German. This indicates that GSs produced by the multilingual model are more realistic and, in most cases, better than those produced by a radiologist. It is also important to note that all values for the metrics of readability, factual correctness and completeness, and overall quality are between 4 and 5, i.e., from good to very good, while some of the same values for the monolingual summarization models have lower values than 4.

Training a single model such as MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT to summarize radiology reports in multiple languages produces better results than training separate models for each language. One of the reasons that could lead to better performance is that the MENbasesubscriptsuperscriptabsent𝑏𝑎𝑠𝑒𝐸𝑁{}^{base}_{EN}start_FLOATSUPERSCRIPT italic_b italic_a italic_s italic_e end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT, which served as the basis for develo** the final MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT, has shared representations for all languages. This means that it can learn multilingual embeddings by leveraging the similarities and common structures in different languages, allowing it to better generalize across languages. By jointly fine-tuning on English, Portuguese, and German radiology reports, the model can capture cross-linguistic information, which improves its ability to produce summaries in all three languages. In addition, the model is also able to apply language transfer learning in the summarization task, meaning that knowledge acquired in one language can improve performance in another. For example, the model can learn effective summarization strategies in English, and some of this knowledge can be transferred to Portuguese and German, resulting in improved performance in those languages as well.

The information presented in Table 3 further supports the ability of the developed models to summarize RRs. The results demonstrate that the generated summaries are more similar to the original summaries, which is expected given that ChatGPT is not specialized in summarizing RRs. It was also observed that ChatGPT produces summaries that are either the same length or longer than the original reports. Consequently, it can be concluded that ChatGPT does not synthesize information but rather rephrases the information contained in the findings, lacking the capability to discern the relative importance of different pieces of information. Therefore, these results confirm that employing specialized models for summarizing RRs is more effective than using a nonspecialized model for this task.

In conclusion, this research resulted in the development of the MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT model, publicly available 111Available at https://huggingface.co/MarianaLC/mt5-rr-1000., which can generate summaries of RRs in English, Portuguese and German that are as good as or even superior to those produced by radiologists. The favorable outcomes suggest that the GSs can be utilized in conjunction with medical imaging and other data to develop multimodal DL models, such as those used for diagnosis, prediction, or treatment, without relying solely on data from patients of a single ethnicity. Moreover, the generated summaries can alleviate the workload of radiologists, who would only need to validate the accuracy of the generated summaries and provide additional information if necessary, rather than creating summaries from scratch. The radiologist that evaluated the English and Portuguese summaries even expressed a preference for this approach, stating, "I would rather do it this way than write a summary from scratch!". Future work could see an assessment of the RECIST (Response Evaluation Criteria in Solid Tumours) rules from the summary, e.g. when is it a progressive disease (PD), stable disease (SD), stable response (SR) or treatment response (TR).

Table 1: Quantitative results of the summarization models.
Model Name ROUGE-1 ROUGE-2 ROUGE-L ROUGE L-sum

MENsummariessubscriptsuperscriptabsent𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑒𝑠𝐸𝑁{}^{summaries}_{EN}start_FLOATSUPERSCRIPT italic_s italic_u italic_m italic_m italic_a italic_r italic_i italic_e italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT

26.33

16.94

25.61

25.68

MENrr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁{}^{rr-1000}_{EN}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT

40.56

26.96

37.72

39.04

MPTrr1000subscriptsuperscriptabsent𝑟𝑟1000𝑃𝑇{}^{rr-1000}_{PT}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT

42.48

32.39

41.27

41.71

MGErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐺𝐸{}^{rr-1000}_{GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT

41.50

28.08

38.26

40.37

MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT

46.11

32.31

43.54 44.93
Table 2: Results of the qualitative radiologist evaluation.
Model Name Language GS \geq RS (%) R (1-5) FCC (1-5) OQ (1-5)

MENrr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁{}^{rr-1000}_{EN}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT

-

66.67

4.73

4.10

4.20

MPTrr1000subscriptsuperscriptabsent𝑟𝑟1000𝑃𝑇{}^{rr-1000}_{PT}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_T end_POSTSUBSCRIPT

63.33

4.80

4.03

3.93

MGErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐺𝐸{}^{rr-1000}_{GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_E end_POSTSUBSCRIPT

50.00

4.33

4.20

4.00

MEN,PT,GErr1000subscriptsuperscriptabsent𝑟𝑟1000𝐸𝑁𝑃𝑇𝐺𝐸{}^{rr-1000}_{EN,PT,GE}start_FLOATSUPERSCRIPT italic_r italic_r - 1000 end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_N , italic_P italic_T , italic_G italic_E end_POSTSUBSCRIPT

English

80.00

4.73

4.07

4.07

Portuguese

73.33

4.70

4.13

4.10

German

93.33 4.90 4.23 4.40
Table 3: Results of the qualitative radiologist evaluation of the summarization models.
Summary
Findings Original ChatGPT Model

PA and lateral views of the chest provided. Subtle opacity in the upper lungs may represent an early pneumonia. The hila appear somewhat prominent which may indicate mild vascular congestion. No overt edema or large effusion is seen. Dense atherosclerotic calcification along the aorta is noted. The heart is top-normal in size. Severe degenerative disease at the right shoulder is again seen. There is a dextroscoliosis of the T-spine.

Subtle opacities in the upper lungs is concerning for an early pneumonia. Mild pulmonary vascular congestion also noted.

The chest X-ray includes PA and lateral views. There is a subtle opacity in the upper lungs, indicating a possible early pneumonia. The hila (structures near the lungs’ root) appear slightly prominent, suggesting mild vascular congestion. No significant edema or large effusion is observed. Dense atherosclerotic calcification is noted along the aorta. The heart size is within normal limits. Severe degenerative disease is visible in the right shoulder, and there is a dextroscoliosis of the thoracic spine.

Subtle opacity in the upper lungs may represent an early pneumonia. Mild vascular congestion.

Contornos mediastinais de tamanho normal do coração. Atelectasia subsegmentar versus cicatriz no meio do pulmão direito e no lobo inferior esquerdo. Sem doença focal do espaço aéreo. Sem derrame pleural ou pneumotórax. Baixos volumes pulmonares. As estruturas ósseas visualizadas não têm aparência digna de nota.

Baixos volumes pulmonares com cicatriz bilateral irregular versus atelectasia. Caso contrário, nenhuma anormalidade pulmonar aguda ou pulmonar.

O relatório indica que o coração possui contornos mediastinais normais. Observou-se a presença de atelectasia subsegmentar ou cicatriz no meio do pulmão direito e no lobo inferior esquerdo. Não foram encontradas doenças focais no espaço aéreo, nem derrame pleural ou pneumotórax. Os volumes pulmonares estão baixos. Quanto às estruturas ósseas visualizadas, nenhuma apresentou características dignas de nota.

Atelectasia subsegmentar versus cicatriz no pulmão direito e no lobo inferior esquerdo. Sem doença focal do espaço aéreo. Baixos volumes pulmonares.

Acknowledgement
We acknowledge the Plattform für KI-Translation Essen (KITE) project from the REACT-EU initiative (https://kite.ikim.nrw/, FRE-0801977), and this work was supported by the FCT within the R&D Units Project Scope: UIDB/00319/2020.

Conflicts of interest
All authors have read and approved the manuscript. The authors also declare no conflicts of interest regarding the publication of this paper.

References

  • Bosmans et al. [2011] Jan ML Bosmans, Joost J Weyler, Arthur M De Schepper, and Paul M Parizel. The radiology report as seen by radiologists and referring clinicians: results of the cover and rover surveys. Radiology, 259(1):184–195, 2011.
  • Cai et al. [2021] Xiaoyan Cai, Sen Liu, Junwei Han, Libin Yang, Zhenguo Liu, and Tianming Liu. Chestxraybert: A pretrained language model for chest radiology report summarization. IEEE Transactions on Multimedia, 2021.
  • Chen and Bansal [2018] Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018.
  • Demner-Fushman et al. [2016] Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2016.
  • Hill et al. [2001] Derek LG Hill, Philipp G Batchelor, Mark Holden, and David J Hawkes. Medical image registration. Physics in medicine & biology, 46(3):R1, 2001.
  • Johnson et al. [2019] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  • Kahn Jr et al. [2009] Charles E Kahn Jr, Curtis P Langlotz, Elizabeth S Burnside, John A Carrino, David S Channin, David M Hovsepian, and Daniel L Rubin. Toward best practices in radiology reporting. Radiology, 252(3):852–856, 2009.
  • Li et al. [2023] Jianning Li, Amin Dada, Jens Kleesiek, and Jan Egger. Chatgpt in healthcare: A taxonomy and systematic review. medRxiv, pages 2023–03, 2023.
  • Liang et al. [2022] Siting Liang, Klaus Kades, Matthias Fink, Peter Full, Tim Weber, Jens Kleesiek, Michael Strube, and Klaus Maier-Hein. Fine-tuning bert models for summarizing german radiology findings. In Proceedings of the 4th Clinical Natural Language Processing Workshop, pages 30–40, 2022.
  • Lin [2004] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
  • Narayan et al. [2020] Shashi Narayan, Joshua Maynez, Jakub Adamek, Daniele Pighin, Blaž Bratanič, and Ryan McDonald. Stepwise extractive summarization and planning with structured transformers. arXiv preprint arXiv:2010.02744, 2020.
  • [12] Colin Raffel nconstant google. Experimental t5 pre-trained model checkpoints. URL https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Roberts [2001] Richard J Roberts. Pubmed central: The genbank of the published literature, 2001.
  • Shahriar and Hayawi [2023] Sakib Shahriar and Kadhim Hayawi. Let’s have a chat! a conversation with chatgpt: Technology, applications, and limitations. arXiv preprint arXiv:2302.13817, 2023.
  • Shazeer [2020] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. Advances in neural information processing systems, 28, 2015.
  • Weber et al. [2020] Tim Frederik Weber, Manuela Spurny, Felix Christian Hasse, Oliver Sedlaczek, Georg Martin Haag, Christoph Springfeld, Theresa Mokry, Dirk Jäger, Hans-Ulrich Kauczor, and Anne Katrin Berger. Improving radiologic communication in oncology: a single-centre experience with structured reporting for cancer patients. Insights into Imaging, 11:1–11, 2020.
  • Xue et al. [2020] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  • Zhang et al. [2015] Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia conference on language, information and computation, pages 73–78, 2015.
  • Zhang et al. [2018] Yuhao Zhang, Daisy Yi Ding, Tianpei Qian, Christopher D Manning, and Curtis P Langlotz. Learning to summarize radiology findings. arXiv preprint arXiv:1809.04698, 2018.