TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

Sarra Baghdadi12, Majd Saleh12 and Stéphane Paquelet1 1Institute of Research and Technology b-com, Rennes, France
[email protected], [email protected], [email protected]
2Equal contribution
(June 27, 2024)
Abstract

Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250250250250 notes. It achieved an F1-score of 84.6%percent84.684.6\%84.6 % when evaluated on a linear text segmentation problem and 72.8%percent72.872.8\%72.8 % on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.

Keywords:
Title detection; text segmentation; NLP; language models; transformers, BERT, information retrieval; medical text cleaning.

I Introduction

Text segmentation is the task of partitioning a document into topically coherent segments [1, 2, 3, 4]. It represents an important pre-processing step in several Natural Language Processing (NLP) downstream tasks inducing information retrieval and summarization[1, 2, 3, 4, 5, 6]. Two types of text segmentation can be distinguished: 1) linear segmentation where a document is divided into sequential contiguous segments, and 2) hierarchical segmentation where higher level segments are further segmented into smaller sub-segments [3].

Existing solutions to text segmentation are applied either to unformatted, i.e. plain-text, documents like [3, 4] or to formatted ones like [7, 8, 9]. In the first family, recent text segmentation methods are based on the semantic representation of words and/or of sentences. For instance, word embeddings and sentence embeddings are used to detect topic changes in a sequence of sentences in [3, 4]. In the second family, document layout, orthographic features, geometric features and stylistic properties, e.g. font properties and line spaces, are exploited to extract the document structure. Some of the aforementioned features can be extracted using the Document Image Analysis (DIA) [10, 9] while others can be extracted using regex matches of predefined patterns [8]. In both families, titles and subtitles play an important role in text segmentation. For example, in semantic-aspects-based methods, supervised solutions like [3] and [4] are trained on a large corpus named WIKI-727K [4] which is automatically labeled such that topic-change cutoff points are completely dependent on titles. Similarly, visual-aspects-based methods like [9] are mainly based on title detection.

Refer to caption
Figure 1: An extract from a discharge summary from the MIMIC-III database annotated with titles (yellow) and subtitles (blue)

In this article, the focus is set on the hierarchical segmentation of unformatted medical documents by detecting titles and subtitles using their semantic vector representations. Particularly, the considered use-case is the segmentation of the discharge summaries of the MIMIC-III database [11] which represents an unformatted free-text corpus. Figure 1 shows an extract of a discharge summary where we highlighted titles in yellow and subtitles in blue. The motivation underlying the aforementioned segmentation task is two-fold:

  • To clean the MIMIC-III discharge summaries corpus such that it can be efficiently used to domain-adapt pretrained language models. For example, noisy sections like ”Discharge medications” or ”Admission labs” can be detected and removed since their content doesn’t represent a fluent language and can harm the domain adaption task.

  • To enable the extraction of specialized sub-corpora and facilitate the information retrieval tasks. For instance, sub-corpora about radiology reports or cardiac diseases reports can be extracted and used to train relevant information extraction systems.

While the aforementioned advantages have been discussed on the considered use-case, the proposed method is general and can be extended to solve many similar tasks as will be shown in Section IV.

The rest of the paper is organized as follows: we start by exploring related works in Section II. Training and test corpora are described in Section III. The proposed method is presented in Section IV. Experimental results are discussed in Section V while Section VI concludes the paper and discuses future work.

II Related work

Refer to caption
Figure 2: Two-level neural networks for text segmentation: the first-level network builds sentence embeddings from token embeddings while the second-level network labels the sequence of sentence representations as: topic-change (1)1(1)( 1 ) or no-topic-change (0)0(0)( 0 )

One remarkable family of solutions to topical text segmentation was proposed in [4, 3]. As depicted in Figure 2, these methods are based on two-level neural networks that divide the segmentation problem into two sub-problems: 1) starting from token embeddings, build sentence embeddings, and 2) based on the sentence representations, annotate the sentences’ sequence with binary labels i.e. 00 for no-topic-change and 1111 for topic-change. Specifically, these methods start by dividing the input document into sentences (sentence tokenization) using the NLTK PUNKT tokenizer [12]. Then, they use pretrained token embeddings to represent input tokens wisubscriptw𝑖\textbf{w}_{i}w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Concretely, word2vec embeddings [13, 14] are used in [4] while FastText embeddings [15] are used in [3]. Token embeddings of each sentence are fed to a sentence embedding model in order to create sentence representations i.e. each sentence is represented by a dense vector sjsubscripts𝑗\textbf{s}_{j}s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This sentence embedding model is a 2-layer bidirectional LSTM [16] in the method proposed in [4] while it represents a transformer encoder [17] in the method proposed in [3]. The sequence of sentence embeddings is then fed to a sentence labeling model which assigns a binary label li{0,1}subscript𝑙𝑖01l_{i}\in\{0,1\}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } to each sentence where li=1subscript𝑙𝑖1l_{i}=1italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 encoding the existence of a topic change. This sentence labeling model is a 2-layer bidirectional LSTM in [4] while it is a transformer encoder in [3]. Both of the aforementioned solutions are trained in a supervised manner where the labeled corpus WIKI-727K [4] is used for training. Note that the transformer-based solution adds another feed-forward head besides the sentence labeling model. This head is responsible for explicit coherence modeling [3]. A similar work was proposed in [18] where authors tested a solution based on a transformer to encode sentences (first level) and a bidirectional LSTM to detect topic changes (second level). For deeper details about how transformers and LSTMs work, interested readers are referred to this tutorial [19].

While the aforementioned solutions are supervised, many solutions based on unsupervised-learning have been proposed. For instance, the method proposed in [20] uses RoBERTa embeddings [21] to calculate semantic similarity scores between sentences. These scores are then used to perform segmentation. Particularly, the variation in similarity scores over time is leveraged to recognize changes in topics. The problem of representing sentences from token embeddings was addressed in two ways: 1) to use sentence-BERT [22], a version of BERT [23] fine-tuned on sentence representation tasks, and 2) to aggregate token embeddings from the 2nd to the last layer of RoBERTa using max pooling.

Before the deep-learning era, several interesting statistical solutions were proposed. Interested readers are referred for example to 1) TextTiling [24] which captures semantic similarity between sentences using word frequency vectors, 2) TopicTiling [25] an algorithm based on TextTiling [24] and Latent Dirichlet Allocation (LDA) [26], 3) LDA-based topic modeling [27], and 4) the statistical models for text segmentation proposed in [2].

The solutions discussed so far represent the semantic-aspects-based text segmentation family. Let us explore now some visual-aspects-based text segmentation methods i.e. segmentation applied on formatted documents. In fact, in the financial domain, several interesting solutions to the document-structure-extraction problem have been proposed thanks to the FinTOC shared tasks [8, 7, 9]. These latter aim at extracting the structure of complex layout formatted documents stored in pdf format. The objective is to detect titles and to hierarchically organize them into a Table of Contents (ToC). For instance, authors of [9] proposed a title detection algorithm based on Document Image Analysis (DIA). Particularly, they started by Faster R-CNN [28] model which is pertained on the PubLayNet dataset [29] and they fine tuned it on the FinTOC-2022 training set. In order to recognize the hierarchical levels of the detected titles, they extracted orthographic and layout features from the detected titles and they proposed a supervised random forest model to predict the title level.

III Data analysis and annotation

The experimental work has been performed on the discharge summaries corpus, a sub-corpus from the MIMIC-III [11] database. MIMIC-III contains more than two million reports including discharge summaries, EEG reports, radiology reports, and many others. Figure 3-a shows the distribution of MIMIC-III reports’ families. The discharge summaries corpus contains around 600006000060~{}00060 000 reports. Figure 3-b shows the distribution of discharge summaries’ lengths given in number of tokens. The average discharge summary length is 1435143514351435 tokens (roughly 3333 pages).

We divided the discharge summaries corpus to a test corpus of 250250250250 randomly selected reports and a training corpus containing the rest of reports. The test corpus has been manually annotated using the web application ACUITEE [30] as illustrated in Figure 1. Note that two hierarchical level of titles are considered: titles (highlighted in yellow) and sub-titles (highlighted in blue).

In order to annotate the training corpus, we conducted a deep analysis of the discharge summaries structure and we proposed a rule-based title detection system, namely TocRegex. This latter has been implemented using regular expressions. For example, the first considered pattern was any character sequence that:

  • begins with a newline ”\n” followed by a valid title content and terminates with a colon followed by newline ”:\n”; or

  • begins with a double newline ”\n\n” followed by a valid title content and terminates with a colon ”:”; or

  • begins with a beginning-of-document character followed by a valid title content and terminates with a colon ”:”

The definition of ”valid title content” as well as the full list of the 12121212 considered regular expressions are discussed in the Appendix A.
Figure 4 shows the frequencies of the top 35 matches detected using the above-described pattern. For instance, the title ”history of present illness” has been detected 10e510𝑒510e510 italic_e 5 times in the corpus. We note that several false positives have also matched the considered pattern like e.g. ”tablet(s)*refills” which appears frequently in medication lists. In order to filter-out such kind of false positives, a manual curation has been applied to the detected titles. As a result, the total remaining numbers of unique titles is 563575635756~{}35756 357. They have appeared 293072529307252~{}930~{}7252 930 725 times in the corpus i.e. the average number of titles is around 49494949 per discharge summary.

Refer to caption
Figure 3: a) distribution of sub-corpora sizes of the MIMIC-III database. 2) distribution of reports’ lengths of the discharge summaries corpus.
Refer to caption
Figure 4: Top 35353535 ”candidate” titles extracted using the first pattern of the TocRegex solution including true positives (blue) and false positives (light brown)

IV The proposed solution: TocBERT

In this section, we explore the proposed solution TocBERT (Table of Content BERT). The hierarchical title detection task is formulated as a sequence labeling problem i.e. token classification problem. Particularly, it is considered as a named entity recognition (NER) problem with three entity types: ”I-title” for titles, ”I-Stitle” for subtitles and ”O” for other tokens. Note that we adopt the ”Inside–outside–beginning” (IOB) standard labeling format. Figure 5 shows two examples of labeled sequences where tokens are tagged with title, subtitle or outside labels. For instance, the second sequence has one title ”Physical exam” and four subtitles ”HEENT”, ”Neck”, ”Lungs” and ”Extremities”.

Refer to caption
Figure 5: Labeling token sequences in TocBERT: Example 1) two titles ”Past Medical History” and ”Family History” are labeled with ”I-title” while other tokens are labeled with ”O”; Example 2) one title ”Physical exam” is labeled with ”I-title” and four subtitles ”HEENT”, ”Neck”, ”Lungs” and ”Extremities” are labeled with ”I-Stitle”, while other (outside) tokens are labeled with ”O”.

TocBERT is based on fine-tuning the pretrained model Bio-ClinicalBERT [31] on the above-mentioned NER task. Note that bidirectional transformers, like the BERT family, are more convenient for tackling sequence labeling tasks, like NER, compared to generative i.e. auto-regressive transformers like GPT [32]. Bio-ClinicalBERT represents a variant of BERT adapted to biomedical and clinical domains. We fine-tuned this pretrained model using the discharge summaries training corpus that has been semi-automatically labeled using TocRegex followed by a manual curation as described in the previous section III.
Bio-ClinicalBERT, like BERT, is pretrained using a fixed size vocabulary: 289962899628~{}99628 996 tokens. This vocabulary was created using the word-piece algorithm [33]. Some of the resultant tokens represent sub-words. For this reason, the first step of TocBERT is to project labels of the full-word tokens (extracted by simple pre-tokenization procedure) to sub-word tokens (extracted using Bio-ClinicalBERT tokenizer). The second step consists in preparing convenient training windows from the training corpus. This is important because the maximum window size of BERT is 512512512512 (sub-word) tokens while the average length of a discharge summary is 1435143514351435 (full-word) tokens. To this end, discharge summaries have been segmented into windows of 384384384384 words where we used the approximate formula: 1111 token =0.75absent0.75=0.75= 0.75 words. The total size of the resultant training set is 144000144000144~{}000144 000 labeled windows. Finally, the last step consists of training the TocBERT model.

V Experimental results

V-A Experimental configurations

TocBERT was trained on one NVIDIA A100 GPU equipped with 80 GB of RAM. It was trained for 20202020 epochs with a batch size of 16161616 training samples. The training time was around 17171717 hours. The inference hardware was an NVIDIA RTX A3000 laptop GPU equipped with 6 GB of dedicated RAM.

V-B Results

Table I shows the experimental results of the proposed text segmentation solutions, TocBERT and TocRegex, in the hierarchical segmentation configurations, i.e. detecting both titles and sub-titles. While TocBERT is trained on a corpus labeled using TocRegex, the former considerably outperforms the latter in all the considered performance criteria, i.e. precision, recall and F1-score.

TABLE I: Experimental results: hierarchical text segmentation
Precision Recall F1-score
TocBERT 0.714 0.754 0.728
TocRegex 0.667 0.563 0.606

Table II shows the results of the aforementioned solutions in the linear text segmentation configurations. TocBERT and TocRegex show comparable overall performance measured by the F1-score metric. TocBERT shows higher sensitivity (recall) while it is less specific. Note that TocBERT is completely based on the semantic representation of titles and their context. It doesn’t depend on the existence of visual aspects like newlines and colons. This explains why it detects more titles (higher recall). On the other hand, TocRegex is more specific since it is entirely based on patterns that almost always exist only in titles.

Comparing the results of hierarchical and linear segmentation, the value of TocBERT is clearly its strong capacity to exploit the context and the semantic aspects to discriminate between titles and sub-titles.

TABLE II: Experimental results: linear text segmentation
Precision Recall F1-score
TocBERT 0.829 0.878 0.846
TocRegex 0.932 0.784 0.845
TABLE III: Regular expressions used to detect titles
[Uncaptioned image]

In terms of inference execution time, TocRegex takes 87878787 ms while TocBERT takes 195195195195 ms, in average, to segment a discharge summary. While TocRegex is faster, both solutions satisfy near-real-time requirements.

Refer to caption
Figure 6: Grou** of some key concepts towards constructing topics’ ontology

Finally, we observed that the detected groups of titles and subtitles can be exploited to automatically construct an ontology of topics. Figure 6 shows an example of initial manual grou** of key concepts that help constructing such ontology. The semantic vector representations of the detected titles can play an important role in organizing the topics’ ontology. The interest of building this ontology is to facilitate tasks like text segmentation and information retrieval. This idea will be investigated in future works.

VI Conclusion

In this paper, we proposed a new solution, TocBERT, for the hierarchical segmentation of medical reports. The segmentation task was formulated as a named entity recognition problem. TocBERT was initialized by a pretrained model, Bio-ClinicalBERT, and fine tuned on a the MIMIC-III discharge summaries corpus. This latter was semi-automatically labeled with titles and sub-titles. TocBERT showed very good results considerably outperforming a carefully-designed rule-based system. Particularly, it showed a good performance in discriminating between titles and subtitles by leveraging their semantic representations and employing their context.

The semantic representations of the extracted titles can be exploited to automatically construct an ontology of topics. Such an ontology can further facilitate tasks like text segmentation and information retrieval. The investigation of this idea is left for future work.

Appendix A TocRegex

Table III shows the regular expressions list used in the proposed rule-based solution, TocRegex. In the first line, the four supported title-content patterns are defined while the remaining 12121212 lines list the full-title patterns.

References

  • [1] A. A. Alemi and P. Ginsparg, “Text segmentation based on semantic word embeddings,” arXiv, 2015. https://arxiv.longhoe.net/abs/1503.05543.
  • [2] D. Beeferman, A. Berger, and J. Lafferty, “Statistical models for text segmentation,” Machine Learning, vol. 34, pp. 177–210, 1999. https://link.springer.com/article/10.1023/A:1007506220214.
  • [3] G. Glavas and S. Somasundaran, “Two-level transformer and auxiliary coherence modeling for improved text segmentation,” arXiv, 2020. http://arxiv.longhoe.net/abs/2001.00891.
  • [4] O. Koshorek, A. Cohen, N. Mor, M. Rotman, and J. Berant, “Text segmentation as a supervised learning task,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (M. Walker, H. Ji, and A. Stent, eds.), (New Orleans, Louisiana), pp. 469–473, Association for Computational Linguistics, June 2018. https://aclanthology.org/N18-2075.
  • [5] L. Miculicich and B. Han, “Document summarization with text segmentation,” arXiv, 2023. https://arxiv.longhoe.net/abs/2301.08817.
  • [6] Y. Zhu, H. Yuan, S. Wang, J. Liu, W. Liu, C. Deng, H. Chen, Z. Dou, and J.-R. Wen, “Large language models for information retrieval: A survey,” arXiv, 2024. https://arxiv.longhoe.net/abs/2308.07107.
  • [7] E. Giguet, G. Lejeune, and J.-B. Tanguy, “Daniel@FinTOC’2 shared task: Title detection and structure extraction,” in Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, 2020. https://aclanthology.org/2020.fnp-1.30/.
  • [8] I. Maarouf, J. Kang, A. Azzi, S. Bellato, M. Gan, and M. El-Haj, “The financial document structure extraction shared task (FinTOC2021): Fintoc 2021,” in 3rd Financial Narrative Processing Workshop (FNP 2021), pp. 111–119, Oct. 2021. https://aclanthology.org/2021.fnp-1.21.pdf.
  • [9] P. Cassotti, C. Musto, M. DeGemmis, G. Lekkas, and G. Semeraro, “swapuniba@fintoc2022: Fine-tuning pre-trained document image analysis model for title detection on the financial domain,” in Proceedings of the 4th Financial Narrative Processing Workshop, 2022. http://www.lrec-conf.org/proceedings/lrec2022/workshops/FNP/index.html.
  • [10] L. O’Gorman and R. Kasturi, eds., Document image analysis. Washington, DC, USA: IEEE Computer Society Press, 1997. https://cse.usf.edu/~r1k/DocumentImageAnalysis/DIA.pdf.
  • [11] A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care databases,” Scientific Data, vol. 3, no. 1, 2016. https://doi.org/10.1038/sdata.2016.35.
  • [12] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009. https://www.nltk.org/book/.
  • [13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of the International Conference on Learning Representations ICLR 2013, (Scottsdale, Arizona, USA), Association for Computational Linguistics, July 2013. https://doi.org/10.48550/arXiv.1301.3781.
  • [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), (Stateline, Nevada, USA), Association for Computational Linguistics, July 2013. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
  • [15] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 06 2017. https://doi.org/10.1162/tacl_a_00051.
  • [16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, p. 1735–1780, nov 1997. https://doi.org/10.1162/neco.1997.9.8.1735.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA), p. 6000–6010, Curran Associates Inc., 2017. https://dl.acm.org/doi/pdf/10.5555/3295222.3295349.
  • [18] M. Lukasik, B. Dadachev, K. Papineni, and G. Simões, “Text segmentation by cross segment attention,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. https://aclanthology.org/2020.emnlp-main.380.
  • [19] M. Saleh and S. Paquelet, “Anatomy of neural language models,” arXiv, 2024. https://arxiv.longhoe.net/abs/2401.03797.
  • [20] A. Solbiati, K. Heffernan, G. Damaskinosa, S. Poddar, S. Modi, and J. Cali, “Unsupervised topic segmentation of meetings with bert embeddings,” arXiv, 2021. https://arxiv.longhoe.net/abs/2106.12978.
  • [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” 2020. https://openreview.net/forum?id=SyxS0T4tvS.
  • [22] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (K. Inui, J. Jiang, V. Ng, and X. Wan, eds.), (Hong Kong, China), pp. 3982–3992, Association for Computational Linguistics, Nov. 2019. https://aclanthology.org/D19-1410.
  • [23] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv, 2019. https://arxiv.longhoe.net/abs/1810.04805.
  • [24] M. A. Hearst, “TextTiling: segmenting text into multi-paragraph subtopic passages,” Comput. Linguist., vol. 23, p. 33–64, mar 1997. https://dl.acm.org/doi/10.5555/972684.972687.
  • [25] M. Riedl and C. Biemann, “TopicTiling: A text segmentation algorithm based on LDA,” in Proceedings of ACL 2012 Student Research Workshop (J. C. K. Cheung, J. Hatori, C. Henriquez, and A. Irvine, eds.), (Jeju Island, Korea), pp. 37–42, Association for Computational Linguistics, July 2012. https://aclanthology.org/W12-3307.
  • [26] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, p. 993–1022, mar 2003. https://dl.acm.org/doi/pdf/10.5555/944919.944937.
  • [27] M. Riedl and C. Biemann, “Text segmentation with topic models,” Journal for Language Technology and Computational Linguistics (JLCL), vol. 27, no. 47-69, pp. 13–24, 2012. https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/2012-riedletal-jlcl.pdf.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015. https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
  • [29] X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: Largest dataset ever for document layout analysis,” in 2019 International Conference on Document Analysis and Recognition (ICDAR), (Los Alamitos, CA, USA), pp. 1015–1022, IEEE Computer Society, sep 2019. https://doi.ieeecomputersociety.org/10.1109/ICDAR.2019.00166.
  • [30] M. Saleh, “ACUITEE : Annotation and curation user interface for terms extraction engines,” 2021. https://acuitee.labs.b-com.com/.
  • [31] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. **di, T. Naumann, and M. McDermott, “Publicly available clinical BERT embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop (A. Rumshisky, K. Roberts, S. Bethard, and T. Naumann, eds.), (Minneapolis, Minnesota, USA), pp. 72–78, Association for Computational Linguistics, June 2019. https://aclanthology.org/W19-1909.
  • [32] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” Technical report, vol. 3, no. 1, 2018. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf.
  • [33] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. R. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” ArXiv, vol. abs/1609.08144, 2016. https://api.semanticscholar.org/CorpusID:3603249.