Skip to main content

Showing 1–19 of 19 results for author: Karlgren, J

.
  1. arXiv:2402.00053  [pdf, other

    cs.AI cs.LG

    Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework for Knowledge Graph Link Predictors

    Authors: Filip Cornell, Yifei **, Jussi Karlgren, Sarunas Girdzijauskas

    Abstract: The standard evaluation protocol for measuring the quality of Knowledge Graph Completion methods - the task of inferring new links to be added to a graph - typically involves a step which ranks every entity of a Knowledge Graph to assess their fit as a head or tail of a candidate link to be added. In Knowledge Graphs on a larger scale, this task rapidly becomes prohibitively heavy. Previous approa… ▽ More

    Submitted 25 January, 2024; originally announced February 2024.

  2. Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research

    Authors: Ekaterina Garmash, Edgar Tanaka, Ann Clifton, Joana Correia, Sharmistha Jat, Winstead Zhu, Rosie Jones, Jussi Karlgren

    Abstract: In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcri… ▽ More

    Submitted 13 December, 2023; v1 submitted 23 September, 2022; originally announced September 2022.

    Comments: 12 pages, 1 figure

    Journal ref: Volume 14163 of Lecture Notes in Computer Science, pages 48-59, Springer, 2023

  3. arXiv:2207.12504  [pdf, other

    cs.CL

    Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free

    Authors: M. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones

    Abstract: Podcasts are conversational in nature and speaker changes are frequent -- requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-sco… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Published at Interspeech 2022

  4. arXiv:2207.05680  [pdf, other

    cs.MM cs.AI cs.CL cs.SD eess.AS

    The Contribution of Lyrics and Acoustics to Collaborative Understanding of Mood

    Authors: Shahrzad Naseri, Sravana Reddy, Joana Correia, Jussi Karlgren, Rosie Jones

    Abstract: In this work, we study the association between song lyrics and mood through a data-driven analysis. Our data set consists of nearly one million songs, with song-mood associations derived from user playlists on the Spotify streaming platform. We take advantage of state-of-the-art natural language processing models based on transformers to learn the association between the lyrics and moods. We find… ▽ More

    Submitted 31 May, 2022; originally announced July 2022.

  5. Conventions and Mutual Expectations -- understanding sources for web genres

    Authors: Jussi Karlgren

    Abstract: Genres can be understood in many different ways. They are often perceived as a primarily sociological construction, or, alternatively, as a stylostatistically observable objective characteristic of texts. The latter view is more common in the research field of information and language technology. These two views can be quite compatible and can inform each other; this present investigation discusse… ▽ More

    Submitted 1 May, 2022; originally announced May 2022.

    Journal ref: GENRES ON THE WEB: COMPUTATIONAL MODELS AND EMPIRICAL STUDIES, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Springer. 2010

  6. Textual Stylistic Variation: Choices, Genres and Individuals

    Authors: Jussi Karlgren

    Abstract: This chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. Much as operationalised relevance proved a useful goal to strive for in information retrieval, research in textual stylistics, whether application oriented or philologically inclined, needs goals formulated in terms of pertinence, relevance, and utility - notions that agr… ▽ More

    Submitted 1 May, 2022; originally announced May 2022.

  7. Podcast Metadata and Content: Episode Relevance andAttractiveness in Ad Hoc Search

    Authors: Ben Carterette, Rosie Jones, Gareth F. Jones, Maria Eskevich, Sravana Reddy, Ann Clifton, Yongze Yu, Jussi Karlgren, Ian Soboroff

    Abstract: Rapidly growing online podcast archives contain diverse content on a wide range of topics. These archives form an important resource for entertainment and professional use, but their value can only be realized if users can rapidly and reliably locate content of interest. Search for relevant content can be based on metadata provided by content creators, but also on transcripts of the spoken content… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

  8. Socially Intelligent Interfaces for Increased Energy Awareness in the Home

    Authors: Jussi Karlgren, Lennart E. Fahlén, Anders Wallberg, Pär Hansson, Olov Ståhl, Jonas Söderberg, Karl-Petter Åkesson

    Abstract: This paper describes how home appliances might be enhanced to improve user awareness of energy usage. Households wish to lead comfortable and manageable lives. Balancing this reasonable desire with the environmental and political goal of reducing electricity usage is a challenge that we claim is best met through the design of interfaces that allows users better control of their usage and unobtrusi… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

  9. arXiv:2106.09227  [pdf, other

    cs.IR

    Current Challenges and Future Directions in Podcast Information Access

    Authors: Rosie Jones, Hamed Zamani, Markus Schedl, Ching-Wei Chen, Sravana Reddy, Ann Clifton, Jussi Karlgren, Helia Hashemi, Aasish Pappu, Zahra Nazari, Longqi Yang, Oguz Semerci, Hugues Bouchard, Ben Carterette

    Abstract: Podcasts are spoken documents across a wide-range of genres and styles, with growing listenership across the world, and a rapidly lowering barrier to entry for both listeners and creators. The great strides in search and recommendation in research and industry have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth. In this perspective paper, we… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: SIGIR 2021

  10. arXiv:2105.14921  [pdf, ps, other

    cs.CL

    How Lexical Gold Standards Have Effects On The Usefulness Of Text Analysis Tools For Digital Scholarship

    Authors: Jussi Karlgren

    Abstract: This paper describes how the current lexical similarity and analogy gold standards are built to conform to certain ideas about what the models they are designed to evaluate are used for. Topical relevance has always been the most important target notion for information access tools and related language technology technologies, and while this has proven a useful starting point for much of what info… ▽ More

    Submitted 31 May, 2021; originally announced May 2021.

  11. arXiv:2104.00424  [pdf, ps, other

    cs.CL

    High-dimensional distributed semantic spaces for utterances

    Authors: Jussi Karlgren, Pentti Kanerva

    Abstract: High-dimensional distributed semantic spaces have proven useful and effective for aggregating and processing visual, auditory, and lexical information for many tasks related to human-generated data. Human language makes use of a large and varying number of features, lexical and constructional items as well as contextual and discourse-specific data of various types, which all interact to represent… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Journal ref: Natural Language Engineering 25, no. 4 (2019): 503-517

  12. arXiv:2103.15953  [pdf, other

    cs.IR cs.CL

    TREC 2020 Podcasts Track Overview

    Authors: Rosie Jones, Ben Carterette, Ann Clifton, Maria Eskevich, Gareth J. F. Jones, Jussi Karlgren, Aasish Pappu, Sravana Reddy, Yongze Yu

    Abstract: The Podcast Track is new at the Text Retrieval Conference (TREC) in 2020. The podcast track was designed to encourage research into podcasts in the information retrieval and NLP research communities. The track consisted of two shared tasks: segment retrieval and summarization, both based on a dataset of over 100,000 podcast episodes (metadata, audio, and automatic transcripts) which was released c… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

    Journal ref: The Proceedings of the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020)

  13. arXiv:2011.14037  [pdf, ps, other

    cs.CL

    Text Mining for Processing Interview Data in Computational Social Science

    Authors: Jussi Karlgren, Renee Li, Eva M Meyersson Milgrom

    Abstract: We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses. This makes it possible to generate and test hypotheses and to compare textual and non-textual variables, and saves analyst effort. We enc… ▽ More

    Submitted 27 November, 2020; originally announced November 2020.

  14. arXiv:2004.04270  [pdf, other

    cs.CL

    The Spotify Podcast Dataset

    Authors: Ann Clifton, Aasish Pappu, Sravana Reddy, Yongze Yu, Jussi Karlgren, Ben Carterette, Rosie Jones

    Abstract: Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say… ▽ More

    Submitted 5 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

    Comments: 4 pages, 3 figures

  15. arXiv:1612.06671  [pdf, other

    cs.CL

    Inferring the location of authors from words in their texts

    Authors: Max Berggren, Jussi Karlgren, Robert Östling, Mikael Parkvall

    Abstract: For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable through explicit labels but most have no explicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location-indicating w… ▽ More

    Submitted 20 December, 2016; originally announced December 2016.

    Comments: 8 pages. Presented at NoDaLiDa: the 2015 Nordic Conference on Computational Linguistics

    ACM Class: H.3.1; I.2.7

  16. arXiv:1608.04089  [pdf, ps, other

    cs.CL cs.IR stat.ML

    Viewpoint and Topic Modeling of Current Events

    Authors: Kerry Zhang, Jussi Karlgren, Cheng Zhang, Jens Lagergren

    Abstract: There are multiple sides to every story, and while statistical topic models have been highly successful at topically summarizing the stories in corpora of text documents, they do not explicitly address the issue of learning the different sides, the viewpoints, expressed in the documents. In this paper, we show how these viewpoints can be learned completely unsupervised and represented in a human i… ▽ More

    Submitted 14 August, 2016; originally announced August 2016.

    Comments: 16 pages, 4 figures, 4 tables

  17. Stylistic Variation in an Information Retrieval Experiment

    Authors: Jussi Karlgren

    Abstract: Texts exhibit considerable stylistic variation. This paper reports an experiment where a corpus of documents (N= 75 000) is analyzed using various simple stylistic metrics. A subset (n = 1000) of the corpus has been previously assessed to be relevant for answering given information retrieval queries. The experiment shows that this subset differs significantly from the rest of the corpus in terms… ▽ More

    Submitted 8 August, 1996; originally announced August 1996.

    Comments: Proceedings of NEMLAP-2

  18. Dilemma - An Instant Lexicographer

    Authors: Hans Karlgren, Jussi Karlgren, Magnus Nordström, Paul Pettersson, Bengt Wahrolén

    Abstract: Dilemma is intended to enhance quality and increase productivity of expert human translators by presenting to the writer relevant lexical information mechanically extracted from comparable existing translations, thus replacing - or compensating for the absence of - a lexicographer and stand-by terminologist rather than the translator. Using statistics and crude surface analysis and a minimum of… ▽ More

    Submitted 21 October, 1994; originally announced October 1994.

    Comments: 3 pages, LaTeX, in proceedings of COLING 94

  19. Recognizing Text Genres with Simple Metrics Using Discriminant Analysis

    Authors: Jussi Karlgren, Douglass Cutting

    Abstract: A simple method for categorizing texts into predetermined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the… ▽ More

    Submitted 20 October, 1994; originally announced October 1994.

    Comments: 6 pages, LaTeX, In proceedings of COLING 94