Skip to main content

Showing 1–18 of 18 results for author: Piktus, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.18796  [pdf, other

    cs.CL cs.AI

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    Authors: Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis

    Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality o… ▽ More

    Submitted 1 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

  2. arXiv:2311.05640  [pdf, other

    cs.CL

    FinGPT: Large Generative Models for a Small Language

    Authors: Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo

    Abstract: Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: 17 pages (10 main), 7 figures, 5 tables

  3. arXiv:2306.01481  [pdf, other

    cs.CL

    GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

    Authors: Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, Jimmy Lin

    Abstract: Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR researc… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  4. arXiv:2305.16264  [pdf, other

    cs.CL cs.AI cs.LG

    Scaling Data-Constrained Language Models

    Authors: Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

    Abstract: The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the… ▽ More

    Submitted 25 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: 50 pages (9 main), 39 figures, 15 tables

  5. arXiv:2302.14534  [pdf, other

    cs.IR cs.CL

    Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

    Authors: Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast

    Abstract: We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines. Spacerini makes state-of-the-art sparse and dense retrieval models more accessible to non-IR practitioners while minimizing deployment effort. This is useful for NLP researchers who want… ▽ More

    Submitted 24 March, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

  6. arXiv:2302.14035  [pdf, other

    cs.CL cs.AI

    The ROOTS Search Tool: Data Transparency for LLMs

    Authors: Aleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo Laurençon, Gérard Dupont, Alexandra Sasha Luccioni, Yacine Jernite, Anna Rogers

    Abstract: ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investig… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

  7. arXiv:2210.01970  [pdf, other

    cs.LG

    Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

    Authors: Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško, Albert Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela

    Abstract: Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support… ▽ More

    Submitted 6 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

  8. arXiv:2207.06220  [pdf, other

    cs.IR cs.AI

    Improving Wikipedia Verifiability with AI

    Authors: Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, Sebastian Riedel

    Abstract: Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not supp… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  9. arXiv:2112.09924  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

    Authors: Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel

    Abstract: In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus t… ▽ More

    Submitted 24 May, 2022; v1 submitted 18 December, 2021; originally announced December 2021.

  10. arXiv:2107.13602  [pdf, other

    cs.CL cs.IR

    Domain-matched Pre-training Tasks for Dense Retrieval

    Authors: Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, Yashar Mehdad

    Abstract: Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder m… ▽ More

    Submitted 28 July, 2021; originally announced July 2021.

  11. arXiv:2102.07033  [pdf, other

    cs.CL cs.AI cs.LG

    PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

    Authors: Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, Sebastian Riedel

    Abstract: Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowl… ▽ More

    Submitted 13 February, 2021; originally announced February 2021.

  12. arXiv:2011.05448  [pdf, other

    cs.CL

    Generating Fact Checking Briefs

    Authors: Angela Fan, Aleksandra Piktus, Fabio Petroni, Guillaume Wenzek, Marzieh Saeidi, Andreas Vlachos, Antoine Bordes, Sebastian Riedel

    Abstract: Fact checking at scale is difficult -- while the number of active fact checking websites is growing, it remains too small for the needs of the contemporary media ecosystem. However, despite good intentions, contributions from volunteers are often error-prone, and thus in practice restricted to claim detection. We investigate how to increase the accuracy and efficiency of fact checking by providing… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

  13. arXiv:2009.02252  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    KILT: a Benchmark for Knowledge Intensive Language Tasks

    Authors: Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel

    Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, develo** general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research… ▽ More

    Submitted 27 May, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: accepted at NAACL 2021

  14. arXiv:2005.11401  [pdf, other

    cs.CL cs.LG

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

    Abstract: Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for… ▽ More

    Submitted 12 April, 2021; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Accepted at NeurIPS 2020

  15. arXiv:2005.04611  [pdf, other

    cs.CL

    How Context Affects Language Models' Factual Predictions

    Authors: Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel

    Abstract: When pre-trained on large unsupervised textual corpora, language models are able to store and retrieve factual knowledge to some extent, making it possible to use them directly for zero-shot cloze-style question answering. However, storing factual knowledge in a fixed number of weights of a language model clearly has limitations. Previous approaches have successfully provided access to information… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted at AKBC 2020

  16. arXiv:2001.07524  [pdf, other

    cs.LG cs.AI stat.ML

    Node Masking: Making Graph Neural Networks Generalize and Scale Better

    Authors: Pushkar Mishra, Aleksandra Piktus, Gerard Goossen, Fabrizio Silvestri

    Abstract: Graph Neural Networks (GNNs) have received a lot of interest in the recent times. From the early spectral architectures that could only operate on undirected graphs per a transductive learning paradigm to the current state of the art spatial ones that can apply inductively to arbitrary graphs, GNNs have seen significant contributions from the research community. In this paper, we utilize some theo… ▽ More

    Submitted 16 May, 2021; v1 submitted 17 January, 2020; originally announced January 2020.

  17. arXiv:1911.03587  [pdf, other

    cs.CL

    How Decoding Strategies Affect the Verifiability of Generated Text

    Authors: Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, Sebastian Riedel

    Abstract: Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generat… ▽ More

    Submitted 29 September, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

    Comments: accepted at Findings of EMNLP 2020

  18. arXiv:1905.09755  [pdf, other

    cs.CL cs.LG

    Misspelling Oblivious Word Embeddings

    Authors: Bora Edizel, Aleksandra Piktus, Piotr Bojanowski, Rui Ferreira, Edouard Grave, Fabrizio Silvestri

    Abstract: In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded clos… ▽ More

    Submitted 23 May, 2019; originally announced May 2019.

    Comments: 9 Pages