Skip to main content

Showing 1–18 of 18 results for author: Sahlgren, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.12987  [pdf, other

    cs.CL cs.AI

    GPT-SW3: An Autoregressive Language Model for the Nordic Languages

    Authors: Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren

    Abstract: This paper details the process of develo** the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researcher… ▽ More

    Submitted 23 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  2. arXiv:2303.17183  [pdf, other

    cs.CL cs.AI

    The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

    Authors: Joey Öhman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren

    Abstract: Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, w… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  3. arXiv:2110.05464  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    We Need to Talk About Data: The Importance of Data Readiness in Natural Language Processing

    Authors: Fredrik Olsson, Magnus Sahlgren

    Abstract: In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects. We argue that there is a gap between academic research in NLP and its application to problems outside academia, and that this gap is rooted in poor mutual understanding between academic researchers and their non-academic peers who seek to apply research resul… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

  4. arXiv:2109.07348  [pdf, other

    cs.CL cs.LG

    Cross-lingual Transfer of Monolingual Models

    Authors: Evangelia Gogoulou, Ariel Ekgren, Tim Isbister, Magnus Sahlgren

    Abstract: Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different language… ▽ More

    Submitted 19 May, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: Accepted to LREC 2022

  5. arXiv:2105.09825  [pdf

    cs.CL cs.AI

    A comparative evaluation and analysis of three generations of Distributional Semantic Models

    Authors: Alessandro Lenci, Magnus Sahlgren, Patrick Jeuniaux, Amaru Cuba Gyllensten, Martina Miliani

    Abstract: Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by Transformer neural language models. Although an extensive body of research has been devoted to Distributional Semantic Model (DSM) evaluation, we still lack a… ▽ More

    Submitted 1 April, 2022; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: Language Resources and Evaluation

  6. arXiv:2105.00831  [pdf, other

    cs.CL cs.DC cs.LG

    Federated Word2Vec: Leveraging Federated Learning to Encourage Collaborative Representation Learning

    Authors: Daniel Garcia Bernal, Lodovico Giaretta, Sarunas Girdzijauskas, Magnus Sahlgren

    Abstract: Large scale contextual representation models have significantly advanced NLP in recent years, understanding the semantics of text to a degree never seen before. However, they need to process large amounts of data to achieve high-quality results. Joining and accessing all these data from multiple sources can be extremely challenging due to privacy and regulatory reasons. Federated Learning can solv… ▽ More

    Submitted 19 April, 2021; originally announced May 2021.

  7. arXiv:2104.10441  [pdf, ps, other

    cs.CL cs.LG

    Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

    Authors: Tim Isbister, Fredrik Carlsson, Magnus Sahlgren

    Abstract: Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English lan… ▽ More

    Submitted 21 April, 2021; originally announced April 2021.

  8. arXiv:2104.06719  [pdf, ps, other

    cs.CL cs.LG

    Sentence Embeddings by Ensemble Distillation

    Authors: Fredrik Carlsson Magnus Sahlgren

    Abstract: This paper contributes a new State Of The Art (SOTA) for Semantic Textual Similarity (STS). We compare and combine a number of recently proposed sentence embedding methods for STS, and propose a novel and simple ensemble knowledge distillation scheme that improves on previous approaches. Our experiments demonstrate that a model trained to learn the average embedding space from multiple ensemble st… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  9. arXiv:2102.04310  [pdf, ps, other

    cs.CL

    The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

    Authors: Magnus Sahlgren, Fredrik Carlsson

    Abstract: This paper discusses the current critique against neural network-based Natural Language Understanding (NLU) solutions known as language models. We argue that much of the current debate rests on an argumentation error that we will refer to as the singleton fallacy: the assumption that language, meaning, and understanding are single and uniform phenomena that are unobtainable by (current) language m… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

  10. arXiv:2009.03116  [pdf, other

    cs.CL

    Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity

    Authors: Tim Isbister, Magnus Sahlgren

    Abstract: This paper presents the first Swedish evaluation benchmark for textual semantic similarity. The benchmark is compiled by simply running the English STS-B dataset through the Google machine translation API. This paper discusses potential problems with using such a simple approach to compile a Swedish evaluation benchmark, including translation errors, vocabulary variation, and productive compoundin… ▽ More

    Submitted 29 November, 2020; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: SLTC 2020

  11. arXiv:2009.02043  [pdf, other

    cs.CY cs.AI cs.CL cs.DB cs.LG

    Data Readiness for Natural Language Processing

    Authors: Fredrik Olsson, Magnus Sahlgren

    Abstract: This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we have encountered in our work as an applied research… ▽ More

    Submitted 30 September, 2020; v1 submitted 4 September, 2020; originally announced September 2020.

  12. arXiv:1811.00127  [pdf, ps, other

    cs.CL

    Measuring Issue Ownership using Word Embeddings

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, "what is being talked about, regarding X", and "what do people feel, regarding X". In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are concepts from political science that have been used to… ▽ More

    Submitted 31 October, 2018; originally announced November 2018.

    Comments: Accepted to the 9th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), held in conjunction with the EMNLP 2018 conference

  13. arXiv:1808.04670  [pdf, other

    cs.CL cs.LG

    R-grams: Unsupervised Learning of Semantic Units in Natural Language

    Authors: Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distrib… ▽ More

    Submitted 3 April, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

  14. arXiv:1803.04757  [pdf, other

    cs.CL

    Monitoring Targeted Hate in Online Environments

    Authors: Tim Isbister, Magnus Sahlgren, Lisa Kaati, Milan Obaidi, Nazar Akrami

    Abstract: Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments. This is especially true for journalists, politicians, artists, and other public figures. This paper describes how hate directed towards individuals can be measured in online environments using a simple dictionary-based approach. We present a case study on Swedish polit… ▽ More

    Submitted 13 March, 2018; originally announced March 2018.

    Comments: Accepted for publication at the second workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS)

  15. arXiv:1802.05014  [pdf, ps, other

    cs.CL

    Distributional Term Set Expansion

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses this labeling to supply the user with new, candida… ▽ More

    Submitted 14 February, 2018; originally announced February 2018.

  16. arXiv:1609.08293  [pdf, ps, other

    cs.CL

    The Effects of Data Size and Frequency Range on Distributional Semantic Models

    Authors: Magnus Sahlgren, Alessandro Lenci

    Abstract: This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small, and that the most reliable model over data of var… ▽ More

    Submitted 27 September, 2016; originally announced September 2016.

    Comments: Accepted at EMNLP 2016

  17. arXiv:1501.02670  [pdf, other

    cs.CL

    Navigating the Semantic Horizon using Relative Neighborhood Graphs

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper is concerned with nearest neighbor search in distributional semantic models. A normal nearest neighbor search only returns a ranked list of neighbors, with no information about the structure or topology of the local neighborhood. This is a potentially serious shortcoming of the mode of querying a distributional semantic model, since a ranked list of neighbors may conflate several differ… ▽ More

    Submitted 12 January, 2015; originally announced January 2015.

  18. arXiv:1103.3585  [pdf, other

    cs.DS cs.CL cs.IR

    Incremental dimension reduction of tensors with random index

    Authors: Fredrik Sandin, Blerim Emruli, Magnus Sahlgren

    Abstract: We present an incremental, scalable and efficient dimension reduction technique for tensors that is based on sparse random linear coding. Data is stored in a compactified representation with fixed size, which makes memory requirements low and predictable. Component encoding and decoding are performed on-line without computationally expensive re-analysis of the data set. The range of tensor indices… ▽ More

    Submitted 18 March, 2011; originally announced March 2011.

    Comments: 36 pages, 9 figures

    Journal ref: Revised version published in Knowl. Inf. Syst. 2016 (Open Access)