-
Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study
Authors:
Adrien Bazoge,
Emmanuel Morin,
Beatrice Daille,
Pierre-Antoine Gourraud
Abstract:
Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation str…
▽ More
Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain
Authors:
Yanis Labrak,
Adrien Bazoge,
Oumaima El Khettari,
Mickael Rouvier,
Pacome Constant dit Beaufils,
Natalia Grabar,
Beatrice Daille,
Solen Quiniou,
Emmanuel Morin,
Pierre-Antoine Gourraud,
Richard Dufour
Abstract:
The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for t…
▽ More
The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
Authors:
Yanis Labrak,
Adrien Bazoge,
Emmanuel Morin,
Pierre-Antoine Gourraud,
Mickael Rouvier,
Richard Dufour
Abstract:
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-sourc…
▽ More
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.
△ Less
Submitted 9 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain
Authors:
Yanis Labrak,
Adrien Bazoge,
Richard Dufour,
Mickael Rouvier,
Emmanuel Morin,
Béatrice Daille,
Pierre-Antoine Gourraud
Abstract:
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual…
▽ More
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
Authors:
Yanis Labrak,
Adrien Bazoge,
Richard Dufour,
Mickael Rouvier,
Emmanuel Morin,
Béatrice Daille,
Pierre-Antoine Gourraud
Abstract:
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time,…
▽ More
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. In particular, we show that we can take advantage of already existing biomedical PLMs in a foreign language by further pre-train it on our targeted data. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
△ Less
Submitted 4 May, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
Flood forecasting with machine learning models in an operational framework
Authors:
Sella Nevo,
Efrat Morin,
Adi Gerzi Rosenthal,
Asher Metzger,
Chen Barshai,
Dana Weitzner,
Dafi Voloshin,
Frederik Kratzert,
Gal Elidan,
Gideon Dror,
Gregory Begelman,
Grey Nearing,
Guy Shalev,
Hila Noga,
Ira Shavitt,
Liora Yuklea,
Moriah Royz,
Niv Giladi,
Nofar Peled Levi,
Ofir Reich,
Oren Gilon,
Ronnie Maor,
Shahar Timnat,
Tal Shechter,
Vladimir Anisimov
, et al. (6 additional authors not shown)
Abstract:
The operational flood forecasting system by Google was developed to provide accurate real-time flood warnings to agencies and the public, with a focus on riverine floods in large, gauged rivers. It became operational in 2018 and has since expanded geographically. This forecasting system consists of four subsystems: data validation, stage forecasting, inundation modeling, and alert distribution. Ma…
▽ More
The operational flood forecasting system by Google was developed to provide accurate real-time flood warnings to agencies and the public, with a focus on riverine floods in large, gauged rivers. It became operational in 2018 and has since expanded geographically. This forecasting system consists of four subsystems: data validation, stage forecasting, inundation modeling, and alert distribution. Machine learning is used for two of the subsystems. Stage forecasting is modeled with the Long Short-Term Memory (LSTM) networks and the Linear models. Flood inundation is computed with the Thresholding and the Manifold models, where the former computes inundation extent and the latter computes both inundation extent and depth. The Manifold model, presented here for the first time, provides a machine-learning alternative to hydraulic modeling of flood inundation. When evaluated on historical data, all models achieve sufficiently high-performance metrics for operational use. The LSTM showed higher skills than the Linear model, while the Thresholding and Manifold models achieved similar performance metrics for modeling inundation extent. During the 2021 monsoon season, the flood warning system was operational in India and Bangladesh, covering flood-prone regions around rivers with a total area of 287,000 km2, home to more than 350M people. More than 100M flood alerts were sent to affected populations, to relevant authorities, and to emergency organizations. Current and future work on the system includes extending coverage to additional flood-prone locations, as well as improving modeling capabilities and accuracy.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Conditional Frechet Inception Distance
Authors:
Michael Soloveitchik,
Tzvi Diskin,
Efrat Morin,
Ami Wiesel
Abstract:
We consider distance functions between conditional distributions. We focus on the Wasserstein metric and its Gaussian case known as the Frechet Inception Distance (FID). We develop conditional versions of these metrics, analyze their relations and provide a closed form solution to the conditional FID (CFID) metric. We numerically compare the metrics in the context of performance evaluation of mode…
▽ More
We consider distance functions between conditional distributions. We focus on the Wasserstein metric and its Gaussian case known as the Frechet Inception Distance (FID). We develop conditional versions of these metrics, analyze their relations and provide a closed form solution to the conditional FID (CFID) metric. We numerically compare the metrics in the context of performance evaluation of modern conditional generative models. Our results show the advantages of CFID compared to the classical FID and mean squared error (MSE) measures. In contrast to FID, CFID is useful in identifying failures where realistic outputs which are not related to their inputs are generated. On the other hand, compared to MSE, CFID is useful in identifying failures where a single realistic output is generated even though there is a diverse set of equally probable outputs.
△ Less
Submitted 28 February, 2022; v1 submitted 21 March, 2021;
originally announced March 2021.
-
Recent Advances in End-to-End Spoken Language Understanding
Authors:
Natalia Tomashenko,
Antoine Caubriere,
Yannick Esteve,
Antoine Laurent,
Emmanuel Morin
Abstract:
This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques inclu…
▽ More
This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques including speaker adaptation, a modification of the connectionist temporal classification (CTC) training criterion, and sequential pretraining.
△ Less
Submitted 29 September, 2019;
originally announced September 2019.
-
Deep Retrieval-Based Dialogue Systems: A Short Review
Authors:
Basma El Amel Boussaha,
Nicolas Hernandez,
Christine Jacquin,
Emmanuel Morin
Abstract:
Building dialogue systems that naturally converse with humans is being an attractive and an active research domain. Multiple systems are being designed everyday and several datasets are being available. For this reason, it is being hard to keep an up-to-date state-of-the-art. In this work, we present the latest and most relevant retrieval-based dialogue systems and the available datasets used to b…
▽ More
Building dialogue systems that naturally converse with humans is being an attractive and an active research domain. Multiple systems are being designed everyday and several datasets are being available. For this reason, it is being hard to keep an up-to-date state-of-the-art. In this work, we present the latest and most relevant retrieval-based dialogue systems and the available datasets used to build and evaluate them. We discuss their limitations and provide insights and guidelines for future work.
△ Less
Submitted 30 July, 2019;
originally announced July 2019.
-
Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability
Authors:
Antoine Caubrière,
Natalia Tomashenko,
Antoine Laurent,
Emmanuel Morin,
Nathalie Camelin,
Yannick Estève
Abstract:
We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture.…
▽ More
We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-to-end SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability.
△ Less
Submitted 18 June, 2019;
originally announced June 2019.
-
End-to-end named entity extraction from speech
Authors:
Sahar Ghannay,
Antoine Caubrière,
Yannick Estève,
Antoine Laurent,
Emmanuel Morin
Abstract:
Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-opt…
▽ More
Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-optimal in regards to the final task, reduced space search at the ASR output level...) and it is known that more integrated approaches outperform sequential ones, when they can be applied. In this paper, we present a first study of end-to-end approach that directly extracts named entities from speech, though a unique neural architecture. On a such way, a joint optimization is able for both ASR and NER. Experiments are carried on French data easily accessible, composed of data distributed in several evaluation campaign. Experimental results show that this end-to-end approach provides better results (F-measure=0.69 on test data) than a classical pipeline approach to detect named entity categories (F-measure=0.65).
△ Less
Submitted 30 May, 2018;
originally announced May 2018.
-
Tools for Terminology Processing
Authors:
C. Enguehard,
B. Daille,
E. Morin
Abstract:
Automatic terminology processing appeared 10 years ago when electronic corpora became widely available. Such processing may be statistically or linguistically based and produces terminology resources that can be used in a number of applications : indexing, information retrieval, technology watch, etc. We present the tools that have been developed in the IRIN Institute. They all take as input texts…
▽ More
Automatic terminology processing appeared 10 years ago when electronic corpora became widely available. Such processing may be statistically or linguistically based and produces terminology resources that can be used in a number of applications : indexing, information retrieval, technology watch, etc. We present the tools that have been developed in the IRIN Institute. They all take as input texts (or collection of texts) and reflect different states of terminology processing: term acquisition, term recognition and term structuring.
△ Less
Submitted 14 December, 2014;
originally announced December 2014.
-
Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking
Authors:
Estelle Delpech,
Béatrice Daille,
Emmanuel Morin,
Claire Lemaire
Abstract:
This paper proposes a method for extracting translations of morphologically constructed terms from comparable corpora. The method is based on compositional translation and exploits translation equivalences at the morpheme-level, which allows for the generation of "fertile" translations (translation pairs in which the target term has more words than the source term). Ranking methods relying on corp…
▽ More
This paper proposes a method for extracting translations of morphologically constructed terms from comparable corpora. The method is based on compositional translation and exploits translation equivalences at the morpheme-level, which allows for the generation of "fertile" translations (translation pairs in which the target term has more words than the source term). Ranking methods relying on corpus-based and translation-based features are used to select the best candidate translation. We obtain an average precision of 91% on the Top1 candidate translation. The method was tested on two language pairs (English-French and English-German) and with a small specialized comparable corpora (400k words per language).
△ Less
Submitted 21 October, 2012;
originally announced October 2012.
-
Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach
Authors:
Estelle Delpech,
Béatrice Daille,
Emmanuel Morin,
Claire Lemaire
Abstract:
This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extra…
▽ More
This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation.
△ Less
Submitted 11 September, 2012;
originally announced September 2012.
-
Vers la reconnaissance de mini-messages manuscrits
Authors:
Emmanuel Prochasson,
Emmanuel Morin,
Christian Viard-Gaudin
Abstract:
Handwriting is an alternative method for entering texts which composed Short Message Services. However, a whole new language features the texts which are produced. They include for instance abbreviations and other consonantal writing which sprung up for time saving and fashion. We have collected and processed a significant number of such handwritten SMS, and used various strategies to tackle thi…
▽ More
Handwriting is an alternative method for entering texts which composed Short Message Services. However, a whole new language features the texts which are produced. They include for instance abbreviations and other consonantal writing which sprung up for time saving and fashion. We have collected and processed a significant number of such handwritten SMS, and used various strategies to tackle this challenging area of handwriting recognition. We proposed to study more specifically three different phenomena: consonant skeleton, rebus, and phonetic writing. For each of them, we compare the rough results produced by a standard recognition system with those obtained when using a specific language model to take care of them.
△ Less
Submitted 16 September, 2009;
originally announced September 2009.
-
Language Models for Handwritten Short Message Services
Authors:
Emmanuel Ep Prochasson,
Christian Viard-Gaudin,
Emmanuel Morin
Abstract:
Handwriting is an alternative method for entering texts composing Short Message Services. However, a whole new language features the texts which are produced. They include for instance abbreviations and other consonantal writing which sprung up for time saving and fashion. We have collected and processed a significant number of such handwriting SMS, and used various strategies to tackle this cha…
▽ More
Handwriting is an alternative method for entering texts composing Short Message Services. However, a whole new language features the texts which are produced. They include for instance abbreviations and other consonantal writing which sprung up for time saving and fashion. We have collected and processed a significant number of such handwriting SMS, and used various strategies to tackle this challenging area of handwriting recognition. We proposed to study more specifically three different phenomena: consonant skeleton, rebus, and phonetic writing. For each of them, we compare the rough results produced by a standard recognition system with those obtained when using a specific language model.
△ Less
Submitted 16 September, 2009;
originally announced September 2009.
-
Restricted Complexity, General Complexity
Authors:
Edgar Morin
Abstract:
Why has the problematic of complexity appeared so late? And why would it be justified?
Why has the problematic of complexity appeared so late? And why would it be justified?
△ Less
Submitted 10 October, 2006;
originally announced October 2006.