Skip to main content

Showing 1–21 of 21 results for author: Pyysalo, S

.
  1. arXiv:2405.15290  [pdf, other

    cond-mat.mtrl-sci

    Question Answering models for information extraction from perovskite materials science literature

    Authors: M. Sipilä, F. Mehryary, S. Pyysalo, F. Ginter, Milica Todorović

    Abstract: Scientific text is a promising source of data in materials science, with ongoing research into utilising textual data for materials discovery. In this study, we developed and tested a novel approach to extract material-property relationships from scientific publications using the Question Answering (QA) method. QA performance was evaluated for information extraction of perovskite bandgaps based on… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: The following article has been submitted to npj Computational Materials

  2. arXiv:2404.01856  [pdf, other

    cs.CL

    Poro 34B and the Blessing of Multilinguality

    Authors: Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

    Abstract: The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on indiv… ▽ More

    Submitted 24 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  4. arXiv:2403.14009  [pdf, other

    cs.CL

    A New Massive Multilingual Dataset for High-Performance Language Technologies

    Authors: Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

    Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  5. arXiv:2311.05640  [pdf, other

    cs.CL

    FinGPT: Large Generative Models for a Small Language

    Authors: Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo

    Abstract: Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: 17 pages (10 main), 7 figures, 5 tables

  6. arXiv:2305.16264  [pdf, other

    cs.CL cs.AI cs.LG

    Scaling Data-Constrained Language Models

    Authors: Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

    Abstract: The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the… ▽ More

    Submitted 25 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: 50 pages (9 main), 39 figures, 15 tables

  7. arXiv:2305.11016  [pdf, other

    cs.CL

    Silver Syntax Pre-training for Cross-Domain Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted in Findings of the Association for Computational Linguistics: ACL 2023

  8. arXiv:2305.10985  [pdf, other

    cs.CL

    Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at NoDaLiDa 2023

  9. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  10. arXiv:2108.13653  [pdf, other

    cs.CL cs.AI

    Explaining Classes through Word Attribution

    Authors: Samuel Rönnqvist, Amanda Myntti, Aki-Juhani Kyröläinen, Sampo Pyysalo, Veronika Laippala, Filip Ginter

    Abstract: In recent years, several methods have been proposed for explaining individual predictions of deep learning models, yet there has been little study of how to aggregate these predictions to explain how such models view classes as a whole in text classification tasks. In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution t… ▽ More

    Submitted 31 August, 2021; originally announced August 2021.

  11. arXiv:2105.02477  [pdf, other

    cs.CL

    Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases

    Authors: Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

    Abstract: In this paper, we present a quantitative evaluation of differences between alternative translations in a large recently released Finnish paraphrase corpus focusing in particular on non-trivial variation in translation. We combine a series of automatic steps detecting systematic variation with manual analysis to reveal regularities and identify categories of translation differences. We find the par… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: Accepted to Workshop on MOdelling TRAnslation: Translatology in the Digital Age

  12. arXiv:2104.11556  [pdf, other

    cs.CL

    Deep learning for sentence clustering in essay grading support

    Authors: Li-Hsin Chang, Iiro Rastas, Sampo Pyysalo, Filip Ginter

    Abstract: Essays as a form of assessment test student knowledge on a deeper level than short answer and multiple-choice questions. However, the manual evaluation of essays is time- and labor-consuming. Automatic clustering of essays, or their fragments, prior to manual evaluation presents a possible solution to reducing the effort required in the evaluation process. Such clustering presents numerous challen… ▽ More

    Submitted 23 April, 2021; originally announced April 2021.

    Comments: Accepted to EDM 2021

  13. arXiv:2102.07396  [pdf, other

    cs.CL

    Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

    Authors: Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala

    Abstract: We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perf… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

  14. arXiv:2010.11639  [pdf, ps, other

    cs.CL

    Towards Fully Bilingual Deep Language Modeling

    Authors: Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

    Abstract: Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years. While models covering a large number of languages have been introduced, their multilinguality has come at a cost in terms of monolingual performance, and the best-performing models at most tasks not involving cross-lingual transfer remain monolingual… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  15. arXiv:2009.08712  [pdf, other

    cs.CL

    The birth of Romanian BERT

    Authors: Stefan Daniel Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo

    Abstract: Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a… ▽ More

    Submitted 18 September, 2020; originally announced September 2020.

    Comments: 5 pages (4 + reference page), accepted in Findings of EMNLP 2020

  16. arXiv:2006.01563  [pdf, other

    cs.CL

    Exploring Cross-sentence Contexts for Named Entity Recognition with BERT

    Authors: Jouni Luoma, Sampo Pyysalo

    Abstract: Named entity recognition (NER) is frequently addressed as a sequence classification task where each input consists of one sentence of text. It is nevertheless clear that useful information for the task can often be found outside of the scope of a single-sentence context. Recently proposed self-attention models such as BERT can both efficiently capture long-distance relationships in input as well a… ▽ More

    Submitted 17 December, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

    Journal ref: Proceedings of the 28th International Conference on Computational Linguistics, dec,2020, Barcelona, Spain (Online),International Committee on Computational Linguistics, pages 904-914

  17. arXiv:2006.01538  [pdf, other

    cs.CL cs.LG

    WikiBERT models: deep transfer learning for many languages

    Authors: Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter

    Abstract: Deep neural language models such as BERT have enabled substantial recent advances in many natural language processing tasks. Due to the effort and computational cost involved in their pre-training, language-specific models are typically introduced only for a small number of high-resource languages such as English. While multilingual models covering large numbers of languages are available, recent… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: 7 pages, 1 figure

  18. arXiv:2004.10643  [pdf, other

    cs.CL

    Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

    Authors: Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

    Abstract: Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

    Comments: LREC 2020

  19. arXiv:1912.07076  [pdf, other

    cs.CL

    Multilingual is not enough: BERT for Finnish

    Authors: Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

    Abstract: Deep learning-based language models pretrained on large unannotated text corpora have been demonstrated to allow efficient transfer learning for natural language processing, with recent approaches such as the transformer-based BERT model advancing the state of the art across a variety of tasks. While most work on these models has focused on high-resource languages, in particular English, a number… ▽ More

    Submitted 15 December, 2019; originally announced December 2019.

  20. arXiv:1611.04361  [pdf, other

    cs.CL cs.LG cs.NE

    Attending to Characters in Neural Sequence Labeling Models

    Authors: Marek Rei, Gamal K. O. Crichton, Sampo Pyysalo

    Abstract: Sequence labeling architectures use word embeddings for capturing similarity, but suffer when handling previously unseen or rare words. We investigate character-level extensions to such models and propose a novel architecture for combining alternative word representations. By using an attention mechanism, the model is able to dynamically decide how much information to use from a word- or character… ▽ More

    Submitted 14 November, 2016; originally announced November 2016.

    Comments: Proceedings of COLING 2016

    ACM Class: I.5.1; I.2.6; I.2.7

  21. arXiv:cs/0606119  [pdf, ps, other

    cs.CL cs.IR

    Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches

    Authors: Sampo Pyysalo, Tapio Salakoski, Sophie Aubin, Adeline Nazarenko

    Abstract: We study the adaptation of Link Grammar Parser to the biomedical sublanguage with a focus on domain terms not found in a general parser lexicon. Using two biomedical corpora, we implement and evaluate three approaches to addressing unknown words: automatic lexicon expansion, the use of morphological clues, and disambiguation using a part-of-speech tagger. We evaluate each approach separately for… ▽ More

    Submitted 28 June, 2006; originally announced June 2006.

    ACM Class: H.4

    Journal ref: Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006) (2006) 60-67