Skip to main content

Showing 1–7 of 7 results for author: Gyllensten, A C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.12987  [pdf, other

    cs.CL cs.AI

    GPT-SW3: An Autoregressive Language Model for the Nordic Languages

    Authors: Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren

    Abstract: This paper details the process of develo** the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researcher… ▽ More

    Submitted 23 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  2. arXiv:2303.17183  [pdf, other

    cs.CL cs.AI

    The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

    Authors: Joey Öhman, Severine Verlinden, Ariel Ekgren, Amaru Cuba Gyllensten, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Magnus Sahlgren

    Abstract: Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, w… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  3. arXiv:2105.09825  [pdf

    cs.CL cs.AI

    A comparative evaluation and analysis of three generations of Distributional Semantic Models

    Authors: Alessandro Lenci, Magnus Sahlgren, Patrick Jeuniaux, Amaru Cuba Gyllensten, Martina Miliani

    Abstract: Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by Transformer neural language models. Although an extensive body of research has been devoted to Distributional Semantic Model (DSM) evaluation, we still lack a… ▽ More

    Submitted 1 April, 2022; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: Language Resources and Evaluation

  4. arXiv:1811.00127  [pdf, ps, other

    cs.CL

    Measuring Issue Ownership using Word Embeddings

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, "what is being talked about, regarding X", and "what do people feel, regarding X". In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are concepts from political science that have been used to… ▽ More

    Submitted 31 October, 2018; originally announced November 2018.

    Comments: Accepted to the 9th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), held in conjunction with the EMNLP 2018 conference

  5. arXiv:1808.04670  [pdf, other

    cs.CL cs.LG

    R-grams: Unsupervised Learning of Semantic Units in Natural Language

    Authors: Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distrib… ▽ More

    Submitted 3 April, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

  6. arXiv:1802.05014  [pdf, ps, other

    cs.CL

    Distributional Term Set Expansion

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses this labeling to supply the user with new, candida… ▽ More

    Submitted 14 February, 2018; originally announced February 2018.

  7. arXiv:1501.02670  [pdf, other

    cs.CL

    Navigating the Semantic Horizon using Relative Neighborhood Graphs

    Authors: Amaru Cuba Gyllensten, Magnus Sahlgren

    Abstract: This paper is concerned with nearest neighbor search in distributional semantic models. A normal nearest neighbor search only returns a ranked list of neighbors, with no information about the structure or topology of the local neighborhood. This is a potentially serious shortcoming of the mode of querying a distributional semantic model, since a ranked list of neighbors may conflate several differ… ▽ More

    Submitted 12 January, 2015; originally announced January 2015.