Skip to main content

Showing 1–10 of 10 results for author: Simig, D

.
  1. arXiv:2308.12284  [pdf, other

    cs.CL cs.AI cs.LG

    D4: Improving LLM Pretraining via Document De-Duplication and Diversification

    Authors: Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari S. Morcos

    Abstract: Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there ha… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  2. arXiv:2306.15091  [pdf, other

    cs.CL

    Understanding In-Context Learning via Supportive Pretraining Data

    Authors: Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli Celikyilmaz, Tianlu Wang

    Abstract: In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Spec… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  3. arXiv:2305.14588  [pdf, other

    cs.CL cs.LG

    Evaluating end-to-end entity linking on domain-specific knowledge bases: Learning about ancient technologies from museum collections

    Authors: Sebastian Cadavid-Sanchez, Khalil Kacem, Rafael Aparecido Martins Frade, Johannes Boehm, Thomas Chaney, Danial Lashkari, Daniel Simig

    Abstract: To study social, economic, and historical questions, researchers in the social sciences and humanities have started to use increasingly large unstructured textual datasets. While recent advances in NLP provide many tools to efficiently process such data, most existing approaches rely on generic solutions whose performance and suitability for domain-specific tasks is not well understood. This work… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  4. arXiv:2305.07185  [pdf, other

    cs.LG

    MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

    Authors: Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis

    Abstract: Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a glo… ▽ More

    Submitted 19 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

  5. arXiv:2303.09540  [pdf, other

    cs.LG cs.AI cs.CV

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    Authors: Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos

    Abstract: Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically… ▽ More

    Submitted 22 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

  6. arXiv:2212.12017  [pdf, other

    cs.CL

    OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

    Authors: Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, ** Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov

    Abstract: Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diver… ▽ More

    Submitted 30 January, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

    Comments: 56 pages. v2->v3: fix OPT-30B evaluation results across benchmarks (previously we reported lower performance of this model due to an evaluation pipeline bug)

  7. arXiv:2210.01734  [pdf, other

    cs.CL cs.LG

    Text Characterization Toolkit

    Authors: Daniel Simig, Tianlu Wang, Verna Dankers, Peter Henderson, Khuyagbaatar Batsuren, Dieuwke Hupkes, Mona Diab

    Abstract: In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. We p… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

  8. arXiv:2205.05812  [pdf, other

    cs.CL cs.LG

    Open Vocabulary Extreme Classification Using Generative Models

    Authors: Daniel Simig, Fabio Petroni, Pouya Yanki, Kashyap Popat, Christina Du, Sebastian Riedel, Majid Yazdani

    Abstract: The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set. The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags. However in real world scenarios this label set, although large, is often incomplete and experts frequently need to refine it. To develop systems that sim… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  9. arXiv:2205.01068  [pdf, other

    cs.CL cs.LG

    OPT: Open Pre-trained Transformer Language Models

    Authors: Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

    Abstract: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open… ▽ More

    Submitted 21 June, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

  10. arXiv:2112.10668  [pdf, other

    cs.CL cs.AI

    Few-shot Learning with Multilingual Language Models

    Authors: Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, **gfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

    Abstract: Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study t… ▽ More

    Submitted 10 November, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: Accepted to EMNLP 2022; 34 pages