Search | arXiv e-print repository

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

Abstract: Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by rele… ▽ More Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2210.11269 [pdf, other]

Inference from Real-World Sparse Measurements

Authors: Arnaud Pannatier, Kyle Matoba, François Fleuret

Abstract: Real-world problems often involve complex and unstructured sets of measurements, which occur when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are m… ▽ More Real-world problems often involve complex and unstructured sets of measurements, which occur when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are methodologically difficult. Current state-of-the-art models are graph neural networks and require domain-specific knowledge for proper setup. We propose an attention-based model focused on robustness and practical applicability, with two key design contributions. First, we adopt a ViT-like transformer that takes both context points and read-out positions as inputs, eliminating the need for an encoder-decoder structure. Second, we use a unified method for encoding both context and read-out positions. This approach is intentionally straightforward and integrates well with other systems. Compared to existing approaches, our model is simpler, requires less specialized knowledge, and does not suffer from a problematic bottleneck effect, all of which contribute to superior performance. We conduct in-depth ablation studies that characterize this problematic bottleneck in the latent representations of alternative models that inhibit information utilization and impede training efficiency. We also perform experiments across various problem domains, including high-altitude wind nowcasting, two-day weather forecasting, fluid dynamics, and heat diffusion. Our attention-based model consistently outperforms state-of-the-art models in handling irregularly sampled data. Notably, our model reduces the root mean square error (RMSE) for wind nowcasting from 9.24 to 7.98 and for heat diffusion tasks from 0.126 to 0.084. △ Less

Submitted 15 April, 2024; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: 27 pages, 12 figures, Published at TMLR https://openreview.net/forum?id=y9IDfODRns

arXiv:2206.07144 [pdf, other]

Efficiently Training Low-Curvature Neural Networks

Authors: Suraj Srinivas, Kyle Matoba, Himabindu Lakkaraju, Francois Fleuret

Abstract: The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of… ▽ More The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of non-linearity. Using this, we demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance, which leads to improved robustness and stable gradients, with only a marginally increased training time. To achieve this, we minimize a data-independent upper bound on the curvature of a neural network, which decomposes overall curvature in terms of curvatures and slopes of its constituent layers. To efficiently minimize this bound, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to their standard high-curvature counterparts, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network models. △ Less

Submitted 10 January, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2203.01016 [pdf, other]

The Theoretical Expressiveness of Maxpooling

Authors: Kyle Matoba, Nikolaos Dimitriadis, François Fleuret

Abstract: Over the decade since deep neural networks became state of the art image classifiers there has been a tendency towards less use of max pooling: the function that takes the largest of nearby pixels in an image. Since max pooling featured prominently in earlier generations of image classifiers, we wish to understand this trend, and whether it is justified. We develop a theoretical framework analyzin… ▽ More Over the decade since deep neural networks became state of the art image classifiers there has been a tendency towards less use of max pooling: the function that takes the largest of nearby pixels in an image. Since max pooling featured prominently in earlier generations of image classifiers, we wish to understand this trend, and whether it is justified. We develop a theoretical framework analyzing ReLU based approximations to max pooling, and prove a sense in which max pooling cannot be efficiently replicated using ReLU activations. We analyze the error of a class of optimal approximations, and find that whilst the error can be made exponentially small in the kernel size, doing so requires an exponentially complex approximation. Our work gives a theoretical basis for understanding the trend away from max pooling in newer architectures. We conclude that the main cause of a difference between max pooling and an optimal approximation, a prevalent large difference between the max and other values within pools, can be overcome with other architectural decisions, or is not prevalent in natural images. △ Less

Submitted 2 March, 2022; originally announced March 2022.

Comments: 31 pages, 6 figures

arXiv:2101.12509 [pdf, ps, other]

Challenges for Using Impact Regularizers to Avoid Negative Side Effects

Authors: David Lindner, Kyle Matoba, Alexander Meulemans

Abstract: Designing reward functions for reinforcement learning is difficult: besides specifying which behavior is rewarded for a task, the reward also has to discourage undesired outcomes. Misspecified reward functions can lead to unintended negative side effects, and overall unsafe behavior. To overcome this problem, recent work proposed to augment the specified reward function with an impact regularizer… ▽ More Designing reward functions for reinforcement learning is difficult: besides specifying which behavior is rewarded for a task, the reward also has to discourage undesired outcomes. Misspecified reward functions can lead to unintended negative side effects, and overall unsafe behavior. To overcome this problem, recent work proposed to augment the specified reward function with an impact regularizer that discourages behavior that has a big impact on the environment. Although initial results with impact regularizers seem promising in mitigating some types of side effects, important challenges remain. In this paper, we examine the main current challenges of impact regularizers and relate them to fundamental design decisions. We discuss in detail which challenges recent approaches address and which remain unsolved. Finally, we explore promising directions to overcome the unsolved challenges in preventing negative side effects with impact regularizers. △ Less

Submitted 23 February, 2021; v1 submitted 29 January, 2021; originally announced January 2021.

Comments: Presented at the SafeAI workshop at AAAI 2021

arXiv:1109.3873 [pdf, other]

A Computable Figure of Merit for Quasi-Monte Carlo Point Sets

Authors: Makoto Matsumoto, Mutsuo Saito, Kyle Matoba

Abstract: Let $\mathcal{P} \subset [0,1)^S$ be a finite point set of cardinality $N$ in an $S$-dimensional cube, and let $f:[0,1)^S \to \mathbb{R}$ be an integrable function. A QMC integration of $f$ by $\mathcal{P}$ is the average of values of $f$ at each point in $\mathcal{P}$, which approximates the integration of $f$ over the cube. Assume that $\mathcal{P}$ is constructed from an $\mathbb{F}2$-vector sp… ▽ More Let $\mathcal{P} \subset [0,1)^S$ be a finite point set of cardinality $N$ in an $S$-dimensional cube, and let $f:[0,1)^S \to \mathbb{R}$ be an integrable function. A QMC integration of $f$ by $\mathcal{P}$ is the average of values of $f$ at each point in $\mathcal{P}$, which approximates the integration of $f$ over the cube. Assume that $\mathcal{P}$ is constructed from an $\mathbb{F}2$-vector space $P\subset (\F2^n)^S$ by means of a digital net with $n$-digit precision. As an $n$-digit discretized version of Josef Dick's method, we introduce Walsh figure of merit (WAFOM) $\textnormal{WF}(P)$ of $P$, which satisfies a Koksma-Hlawka type inequality, namely, QMC integration error is bounded by $C_{S,n}||f||_n \textnormal{WF}(P)$ under $n$-smoothness of $f$, where $C_{S,n}$ is a constant depending only on $S,n$. We show a Fourier inversion formula for $\textnormal{WF}(P)$ which is computable in $O(n SN)$ steps. This effectiveness enables us a random search for $P$ with small value of $\textnormal{WF}(P)$, which would be difficult for other figures of merit such as discrepancy. From an analogy to coding theory, we expect that random search may find better point sets than mathematical constructions. In fact, a naïve search finds point sets $P$ with small $\textnormal{WF}(P)$. In experiments, we show better performance of these point sets in QMC integration than widely used QMC rules. We show some experimental evidence on the effectiveness of our point sets to even non-smooth integrands appearing in finance. △ Less

Submitted 20 February, 2012; v1 submitted 18 September, 2011; originally announced September 2011.

Showing 1–6 of 6 results for author: Matoba, K