Skip to main content

Showing 1–12 of 12 results for author: Pagliardini, M

.
  1. arXiv:2402.02622  [pdf, other

    cs.CL cs.LG

    DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

    Authors: Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

    Abstract: The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B param… ▽ More

    Submitted 21 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  2. arXiv:2311.16079  [pdf, other

    cs.CL cs.AI cs.LG

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

    Abstract: Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by rele… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  3. arXiv:2310.15393  [pdf, other

    cs.LG cs.AI cs.CL

    DoGE: Domain Reweighting with Generalization Estimation

    Authors: Simin Fan, Matteo Pagliardini, Martin Jaggi

    Abstract: The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (… ▽ More

    Submitted 5 February, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

  4. arXiv:2310.10845  [pdf, other

    cs.CL cs.LG

    CoTFormer: More Tokens With Attention Make Up For Less Depth

    Authors: Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

    Abstract: The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this work, we establish an approximate parallel between using chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transf… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  5. arXiv:2306.01160  [pdf, other

    cs.LG cs.AI cs.CL

    Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

    Authors: Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret

    Abstract: Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t. the sequence length -- becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  6. arXiv:2210.15659  [pdf, other

    stat.ML cs.LG

    A Primal-dual Approach for Solving Variational Inequalities with General-form Constraints

    Authors: Tatjana Chavdarova, Matteo Pagliardini, Tong Yang, Michael I. Jordan

    Abstract: Yang et al. (2023) recently addressed the open problem of solving Variational Inequalities (VIs) with equality and inequality constraints through a first-order gradient method. However, the proposed primal-dual method called ACVI is applicable when we can compute analytic solutions of its subproblems; thus, the general case remains an open problem. In this paper, we adopt a warm-starting technique… ▽ More

    Submitted 29 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2206.10575

  7. arXiv:2202.05737  [pdf, other

    cs.LG

    Improving Generalization via Uncertainty Driven Perturbations

    Authors: Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Michael I. Jordan, Tatjana Chavdarova

    Abstract: Recently Shah et al., 2020 pointed out the pitfalls of the simplicity bias - the tendency of gradient-based algorithms to learn simple models - which include the model's high sensitivity to small input perturbations, as well as sub-optimal margins. In particular, while Stochastic Gradient Descent yields max-margin boundary on linear models, such guarantee does not extend to non-linear models. To m… ▽ More

    Submitted 28 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

  8. arXiv:2202.04414  [pdf, other

    cs.LG

    Agree to Disagree: Diversity through Disagreement for Better Transferability

    Authors: Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy

    Abstract: Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of p… ▽ More

    Submitted 23 November, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: 23 pages, 17 figures

  9. arXiv:2112.05000  [pdf, other

    cs.LG stat.ML

    The Peril of Popular Deep Learning Uncertainty Estimation Methods

    Authors: Yehao Liu, Matteo Pagliardini, Tatjana Chavdarova, Sebastian U. Stich

    Abstract: Uncertainty estimation (UE) techniques -- such as the Gaussian process (GP), Bayesian neural networks (BNN), Monte Carlo dropout (MCDropout) -- aim to improve the interpretability of machine learning models by assigning an estimated uncertainty value to each of their prediction outputs. However, since too high uncertainty estimates can have fatal consequences in practice, this paper analyzes the a… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: Presented at the Bayesian Deep Learning Workshop at NeurIPS 2021

  10. arXiv:2006.14567  [pdf, other

    stat.ML cs.LG

    Taming GANs with Lookahead-Minmax

    Authors: Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, Francois Fleuret, Martin Jaggi

    Abstract: Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtrackin… ▽ More

    Submitted 23 June, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

    Journal ref: ICLR 2021

  11. arXiv:1904.05033  [pdf, ps, other

    cs.CL cs.AI cs.IR cs.LG

    Better Word Embeddings by Disentangling Contextual n-Gram Information

    Authors: Prakhar Gupta, Matteo Pagliardini, Martin Jaggi

    Abstract: Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In this paper, we show how training word embeddings jointly with bigram and even trigram embeddings, results in improved unigram embeddings. We claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the unigrams, resulting in better stand-alo… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: NAACL 2019

  12. arXiv:1703.02507  [pdf, other

    cs.CL cs.AI cs.IR

    Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

    Authors: Matteo Pagliardini, Prakhar Gupta, Martin Jaggi

    Abstract: The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsupervis… ▽ More

    Submitted 28 December, 2018; v1 submitted 7 March, 2017; originally announced March 2017.

    Comments: NAACL 2018

    ACM Class: I.2.7

    Journal ref: NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics, pages 528-540