-
Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework for Knowledge Graph Link Predictors
Authors:
Filip Cornell,
Yifei **,
Jussi Karlgren,
Sarunas Girdzijauskas
Abstract:
The standard evaluation protocol for measuring the quality of Knowledge Graph Completion methods - the task of inferring new links to be added to a graph - typically involves a step which ranks every entity of a Knowledge Graph to assess their fit as a head or tail of a candidate link to be added. In Knowledge Graphs on a larger scale, this task rapidly becomes prohibitively heavy. Previous approa…
▽ More
The standard evaluation protocol for measuring the quality of Knowledge Graph Completion methods - the task of inferring new links to be added to a graph - typically involves a step which ranks every entity of a Knowledge Graph to assess their fit as a head or tail of a candidate link to be added. In Knowledge Graphs on a larger scale, this task rapidly becomes prohibitively heavy. Previous approaches mitigate this problem by using random sampling of entities to assess the quality of links predicted or suggested by a method. However, we show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes. In this paper, we present a thorough analysis of these effects along with the following findings. First, we empirically find and theoretically motivate why sampling uniformly at random vastly overestimates the ranking performance of a method. We show that this can be attributed to the effect of easy versus hard negative candidates. Second, we propose a framework that uses relational recommenders to guide the selection of candidates for evaluation. We provide both theoretical and empirical justification of our methodology, and find that simple and fast methods can work extremely well, and that they match advanced neural approaches. Even when a large portion of true candidates for a property are missed, the estimation barely deteriorates. With our proposed framework, we can reduce the time and computation needed similar to random sampling strategies while vastly improving the estimation; on ogbl-wikikg2, we show that accurate estimations of the full, filtered ranking can be obtained in 20 seconds instead of 30 minutes. We conclude that considerable computational effort can be saved by effective preprocessing and sampling methods and still reliably predict performance accurately of the true performance for the entire ranking procedure.
△ Less
Submitted 25 January, 2024;
originally announced February 2024.
-
Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research
Authors:
Ekaterina Garmash,
Edgar Tanaka,
Ann Clifton,
Joana Correia,
Sharmistha Jat,
Winstead Zhu,
Rosie Jones,
Jussi Karlgren
Abstract:
In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcri…
▽ More
In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcripts can be performed well by a system supporting both English and Portuguese. We also show experiments on Portuguese podcast genre classification using text metadata. Combining this collection with previously released English-language collection opens up the potential for multi-modal, multi-lingual and multi-dialect podcast information access research.
△ Less
Submitted 13 December, 2023; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free
Authors:
M. Iftekhar Tanveer,
Diego Casabuena,
Jussi Karlgren,
Rosie Jones
Abstract:
Podcasts are conversational in nature and speaker changes are frequent -- requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-sco…
▽ More
Podcasts are conversational in nature and speaker changes are frequent -- requiring speaker diarization for content understanding. We propose an unsupervised technique for speaker diarization without relying on language-specific components. The algorithm is overlap-aware and does not require information about the number of speakers. Our approach shows 79% improvement on purity scores (34% on F-score) against the Google Cloud Platform solution on podcast data.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
The Contribution of Lyrics and Acoustics to Collaborative Understanding of Mood
Authors:
Shahrzad Naseri,
Sravana Reddy,
Joana Correia,
Jussi Karlgren,
Rosie Jones
Abstract:
In this work, we study the association between song lyrics and mood through a data-driven analysis. Our data set consists of nearly one million songs, with song-mood associations derived from user playlists on the Spotify streaming platform. We take advantage of state-of-the-art natural language processing models based on transformers to learn the association between the lyrics and moods. We find…
▽ More
In this work, we study the association between song lyrics and mood through a data-driven analysis. Our data set consists of nearly one million songs, with song-mood associations derived from user playlists on the Spotify streaming platform. We take advantage of state-of-the-art natural language processing models based on transformers to learn the association between the lyrics and moods. We find that a pretrained transformer-based language model in a zero-shot setting -- i.e., out of the box with no further training on our data -- is powerful for capturing song-mood associations. Moreover, we illustrate that training on song-mood associations results in a highly accurate model that predicts these associations for unseen songs. Furthermore, by comparing the prediction of a model using lyrics with one using acoustic features, we observe that the relative importance of lyrics for mood prediction in comparison with acoustics depends on the specific mood. Finally, we verify if the models are capturing the same information about lyrics and acoustics as humans through an annotation task where we obtain human judgments of mood-song relevance based on lyrics and acoustics.
△ Less
Submitted 31 May, 2022;
originally announced July 2022.
-
Conventions and Mutual Expectations -- understanding sources for web genres
Authors:
Jussi Karlgren
Abstract:
Genres can be understood in many different ways. They are often perceived as a primarily sociological construction, or, alternatively, as a stylostatistically observable objective characteristic of texts. The latter view is more common in the research field of information and language technology. These two views can be quite compatible and can inform each other; this present investigation discusse…
▽ More
Genres can be understood in many different ways. They are often perceived as a primarily sociological construction, or, alternatively, as a stylostatistically observable objective characteristic of texts. The latter view is more common in the research field of information and language technology. These two views can be quite compatible and can inform each other; this present investigation discusses knowledge sources for studying genre variation and change by observing reader and author behaviour rather than performing analyses on the information objects themselves.
△ Less
Submitted 1 May, 2022;
originally announced May 2022.
-
Textual Stylistic Variation: Choices, Genres and Individuals
Authors:
Jussi Karlgren
Abstract:
This chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. Much as operationalised relevance proved a useful goal to strive for in information retrieval, research in textual stylistics, whether application oriented or philologically inclined, needs goals formulated in terms of pertinence, relevance, and utility - notions that agr…
▽ More
This chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. Much as operationalised relevance proved a useful goal to strive for in information retrieval, research in textual stylistics, whether application oriented or philologically inclined, needs goals formulated in terms of pertinence, relevance, and utility - notions that agree with reader experience of text. Differences readers are aware of are mostly based on utility - not on textual characteristics per se. Mostly, readers report stylistic differences in terms of genres. Genres, while vague and undefined, are well-established and talked about: very early on, readers learn to distinguish genres. This chapter discusses variation given by genre, and contrasts it to variation occasioned by individual choice.
△ Less
Submitted 1 May, 2022;
originally announced May 2022.
-
Podcast Metadata and Content: Episode Relevance andAttractiveness in Ad Hoc Search
Authors:
Ben Carterette,
Rosie Jones,
Gareth F. Jones,
Maria Eskevich,
Sravana Reddy,
Ann Clifton,
Yongze Yu,
Jussi Karlgren,
Ian Soboroff
Abstract:
Rapidly growing online podcast archives contain diverse content on a wide range of topics. These archives form an important resource for entertainment and professional use, but their value can only be realized if users can rapidly and reliably locate content of interest. Search for relevant content can be based on metadata provided by content creators, but also on transcripts of the spoken content…
▽ More
Rapidly growing online podcast archives contain diverse content on a wide range of topics. These archives form an important resource for entertainment and professional use, but their value can only be realized if users can rapidly and reliably locate content of interest. Search for relevant content can be based on metadata provided by content creators, but also on transcripts of the spoken content itself. Excavating relevant content from deep within these audio streams for diverse types of information needs requires varying the approach to systems prototy**. We describe a set of diverse podcast information needs and different approaches to assessing retrieved content for relevance. We use these information needs in an investigation of the utility and effectiveness of these information sources. Based on our analysis, we recommend approaches for indexing and retrieving podcast content for ad hoc search.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Socially Intelligent Interfaces for Increased Energy Awareness in the Home
Authors:
Jussi Karlgren,
Lennart E. Fahlén,
Anders Wallberg,
Pär Hansson,
Olov Ståhl,
Jonas Söderberg,
Karl-Petter Åkesson
Abstract:
This paper describes how home appliances might be enhanced to improve user awareness of energy usage. Households wish to lead comfortable and manageable lives. Balancing this reasonable desire with the environmental and political goal of reducing electricity usage is a challenge that we claim is best met through the design of interfaces that allows users better control of their usage and unobtrusi…
▽ More
This paper describes how home appliances might be enhanced to improve user awareness of energy usage. Households wish to lead comfortable and manageable lives. Balancing this reasonable desire with the environmental and political goal of reducing electricity usage is a challenge that we claim is best met through the design of interfaces that allows users better control of their usage and unobtrusively informs them of the actions of their peers. A set of design principles along these lines is formulated in this paper. We have built a fully functional prototype home appliance with a socially aware interface to signal the aggregate usage of the users peer group according to these principles, and present the prototype in the paper.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
Current Challenges and Future Directions in Podcast Information Access
Authors:
Rosie Jones,
Hamed Zamani,
Markus Schedl,
Ching-Wei Chen,
Sravana Reddy,
Ann Clifton,
Jussi Karlgren,
Helia Hashemi,
Aasish Pappu,
Zahra Nazari,
Longqi Yang,
Oguz Semerci,
Hugues Bouchard,
Ben Carterette
Abstract:
Podcasts are spoken documents across a wide-range of genres and styles, with growing listenership across the world, and a rapidly lowering barrier to entry for both listeners and creators. The great strides in search and recommendation in research and industry have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth. In this perspective paper, we…
▽ More
Podcasts are spoken documents across a wide-range of genres and styles, with growing listenership across the world, and a rapidly lowering barrier to entry for both listeners and creators. The great strides in search and recommendation in research and industry have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth. In this perspective paper, we highlight the many differences between podcasts and other media, and discuss our perspective on challenges and future research directions in the domain of podcast information access.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
How Lexical Gold Standards Have Effects On The Usefulness Of Text Analysis Tools For Digital Scholarship
Authors:
Jussi Karlgren
Abstract:
This paper describes how the current lexical similarity and analogy gold standards are built to conform to certain ideas about what the models they are designed to evaluate are used for. Topical relevance has always been the most important target notion for information access tools and related language technology technologies, and while this has proven a useful starting point for much of what info…
▽ More
This paper describes how the current lexical similarity and analogy gold standards are built to conform to certain ideas about what the models they are designed to evaluate are used for. Topical relevance has always been the most important target notion for information access tools and related language technology technologies, and while this has proven a useful starting point for much of what information technology is used for, it does not always align well with other uses to which technologies are being put, most notably use cases from digital scholarship in the humanities or social sciences. This paper argues for more systematic formulation of requirements from the digital humanities and social sciences and more explicit description of the assumptions underlying model design.
△ Less
Submitted 31 May, 2021;
originally announced May 2021.
-
High-dimensional distributed semantic spaces for utterances
Authors:
Jussi Karlgren,
Pentti Kanerva
Abstract:
High-dimensional distributed semantic spaces have proven useful and effective for aggregating and processing visual, auditory, and lexical information for many tasks related to human-generated data. Human language makes use of a large and varying number of features, lexical and constructional items as well as contextual and discourse-specific data of various types, which all interact to represent…
▽ More
High-dimensional distributed semantic spaces have proven useful and effective for aggregating and processing visual, auditory, and lexical information for many tasks related to human-generated data. Human language makes use of a large and varying number of features, lexical and constructional items as well as contextual and discourse-specific data of various types, which all interact to represent various aspects of communicative information. Some of these features are mostly local and useful for the organisation of e.g. argument structure of a predication; others are persistent over the course of a discourse and necessary for achieving a reasonable level of understanding of the content. This paper describes a model for high-dimensional representation for utterance and text level data including features such as constructions or contextual data, based on a mathematically principled and behaviourally plausible approach to representing linguistic information. The implementation of the representation is a straightforward extension of Random Indexing models previously used for lexical linguistic items. The paper shows how the implemented model is able to represent a broad range of linguistic features in a common integral framework of fixed dimensionality, which is computationally habitable, and which is suitable as a bridge between symbolic representations such as dependency analysis and continuous representations used e.g. in classifiers or further machine-learning approaches. This is achieved with operations on vectors that constitute a powerful computational algebra, accompanied with an associative memory for the vectors. The paper provides a technical overview of the framework and a worked through implemented example of how it can be applied to various types of linguistic features.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
TREC 2020 Podcasts Track Overview
Authors:
Rosie Jones,
Ben Carterette,
Ann Clifton,
Maria Eskevich,
Gareth J. F. Jones,
Jussi Karlgren,
Aasish Pappu,
Sravana Reddy,
Yongze Yu
Abstract:
The Podcast Track is new at the Text Retrieval Conference (TREC) in 2020. The podcast track was designed to encourage research into podcasts in the information retrieval and NLP research communities. The track consisted of two shared tasks: segment retrieval and summarization, both based on a dataset of over 100,000 podcast episodes (metadata, audio, and automatic transcripts) which was released c…
▽ More
The Podcast Track is new at the Text Retrieval Conference (TREC) in 2020. The podcast track was designed to encourage research into podcasts in the information retrieval and NLP research communities. The track consisted of two shared tasks: segment retrieval and summarization, both based on a dataset of over 100,000 podcast episodes (metadata, audio, and automatic transcripts) which was released concurrently with the track. The track generated considerable interest, attracted hundreds of new registrations to TREC and fifteen teams, mostly disjoint between search and summarization, made final submissions for assessment. Deep learning was the dominant experimental approach for both search experiments and summarization. This paper gives an overview of the tasks and the results of the participants' experiments. The track will return to TREC 2021 with the same two tasks, incorporating slight modifications in response to participant feedback.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
Text Mining for Processing Interview Data in Computational Social Science
Authors:
Jussi Karlgren,
Renee Li,
Eva M Meyersson Milgrom
Abstract:
We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses. This makes it possible to generate and test hypotheses and to compare textual and non-textual variables, and saves analyst effort. We enc…
▽ More
We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses. This makes it possible to generate and test hypotheses and to compare textual and non-textual variables, and saves analyst effort. We encourage studies in social science to use text analysis, especially for exploratory open-ended studies. We discuss how replicability requirements are met by text analysis technology. We note that the most recent learning models are not designed with transparency in mind, and that research requires a model to be editable and its decisions to be explainable. The tools available today, such as the one used in the present study, are not built for processing interview texts. While many of the variables under consideration are quantifiable using lexical statistics, we find that some interesting and potentially valuable features are difficult or impossible to automatise reliably at present. We note that there are some potentially interesting applications for traditional natural language processing mechanisms such as named entity recognition and anaphora resolution in this application area. We conclude with a suggestion for language technologists to investigate the challenge of processing interview data comprehensively, especially the interplay between question and response, and we encourage social science researchers not to hesitate to use text analysis tools, especially for the exploratory phase of processing interview data.?
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
The Spotify Podcast Dataset
Authors:
Ann Clifton,
Aasish Pappu,
Sravana Reddy,
Yongze Yu,
Jussi Karlgren,
Ben Carterette,
Rosie Jones
Abstract:
Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say…
▽ More
Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say, broadcast news, and contain many more genres than typically studied in video research. The medium is therefore a rich domain with many research avenues for the IR and NLP communities. We present the Spotify Podcast Dataset, a set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
△ Less
Submitted 5 December, 2020; v1 submitted 8 April, 2020;
originally announced April 2020.
-
Inferring the location of authors from words in their texts
Authors:
Max Berggren,
Jussi Karlgren,
Robert Östling,
Mikael Parkvall
Abstract:
For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable through explicit labels but most have no explicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location-indicating w…
▽ More
For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable through explicit labels but most have no explicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location-indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are.
We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.
△ Less
Submitted 20 December, 2016;
originally announced December 2016.
-
Viewpoint and Topic Modeling of Current Events
Authors:
Kerry Zhang,
Jussi Karlgren,
Cheng Zhang,
Jens Lagergren
Abstract:
There are multiple sides to every story, and while statistical topic models have been highly successful at topically summarizing the stories in corpora of text documents, they do not explicitly address the issue of learning the different sides, the viewpoints, expressed in the documents. In this paper, we show how these viewpoints can be learned completely unsupervised and represented in a human i…
▽ More
There are multiple sides to every story, and while statistical topic models have been highly successful at topically summarizing the stories in corpora of text documents, they do not explicitly address the issue of learning the different sides, the viewpoints, expressed in the documents. In this paper, we show how these viewpoints can be learned completely unsupervised and represented in a human interpretable form. We use a novel approach of applying CorrLDA2 for this purpose, which learns topic-viewpoint relations that can be used to form groups of topics, where each group represents a viewpoint. A corpus of documents about the Israeli-Palestinian conflict is then used to demonstrate how a Palestinian and an Israeli viewpoint can be learned. By leveraging the magnitudes and signs of the feature weights of a linear SVM, we introduce a principled method to evaluate associations between topics and viewpoints. With this, we demonstrate, both quantitatively and qualitatively, that the learned topic groups are contextually coherent, and form consistently correct topic-viewpoint associations.
△ Less
Submitted 14 August, 2016;
originally announced August 2016.
-
Stylistic Variation in an Information Retrieval Experiment
Authors:
Jussi Karlgren
Abstract:
Texts exhibit considerable stylistic variation. This paper reports an experiment where a corpus of documents (N= 75 000) is analyzed using various simple stylistic metrics. A subset (n = 1000) of the corpus has been previously assessed to be relevant for answering given information retrieval queries. The experiment shows that this subset differs significantly from the rest of the corpus in terms…
▽ More
Texts exhibit considerable stylistic variation. This paper reports an experiment where a corpus of documents (N= 75 000) is analyzed using various simple stylistic metrics. A subset (n = 1000) of the corpus has been previously assessed to be relevant for answering given information retrieval queries. The experiment shows that this subset differs significantly from the rest of the corpus in terms of the stylistic metrics studied.
△ Less
Submitted 8 August, 1996;
originally announced August 1996.
-
Dilemma - An Instant Lexicographer
Authors:
Hans Karlgren,
Jussi Karlgren,
Magnus Nordström,
Paul Pettersson,
Bengt Wahrolén
Abstract:
Dilemma is intended to enhance quality and increase productivity of expert human translators by presenting to the writer relevant lexical information mechanically extracted from comparable existing translations, thus replacing - or compensating for the absence of - a lexicographer and stand-by terminologist rather than the translator. Using statistics and crude surface analysis and a minimum of…
▽ More
Dilemma is intended to enhance quality and increase productivity of expert human translators by presenting to the writer relevant lexical information mechanically extracted from comparable existing translations, thus replacing - or compensating for the absence of - a lexicographer and stand-by terminologist rather than the translator. Using statistics and crude surface analysis and a minimum of prior information, Dilemma identifies instances and suggests their counterparts in parallel source and target texts, on all levels down to individual words. Dilemma forms part of a tool kit for translation where focus is on text structure and over-all consistency in large text volumes rather than on framing sentences, on interaction between many actors in a large project rather than on retrieval of machine-stored data and on decision making rather than on application of given rules. In particular, the system has been tuned to the needs of the ongoing translation of European Community legislation into the languages of candidate member countries. The system has been demonstrated to and used by professional translators with promising results.
△ Less
Submitted 21 October, 1994;
originally announced October 1994.
-
Recognizing Text Genres with Simple Metrics Using Discriminant Analysis
Authors:
Jussi Karlgren,
Douglass Cutting
Abstract:
A simple method for categorizing texts into predetermined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the…
▽ More
A simple method for categorizing texts into predetermined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed.
△ Less
Submitted 20 October, 1994;
originally announced October 1994.