-
A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks
Authors:
Alexander Braylan,
Madalyn Marabella,
Omar Alonso,
Matthew Lease
Abstract:
Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. Many aggregation models have been proposed for categorical or numerical annotation tasks, but far less work has considered m…
▽ More
Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. Many aggregation models have been proposed for categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks involving open-ended, multivariate, or structured responses. While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by devising a task-agnostic method to model distances between labels rather than the labels themselves.
This article extends our prior work with investigation of three new research questions. First, how do complex annotation properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices to maximize aggregation accuracy? Finally, what diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct simulation studies and experiments on real, complex datasets. Regarding testing, we introduce unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior.
Beyond investigating these research questions above, we discuss the foundational concept of annotation complexity, present a new aggregation model as a bridge between traditional models and our own, and contribute a new semi-supervised learning method for complex label aggregation that outperforms prior work.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks
Authors:
Alexander Braylan,
Omar Alonso,
Matthew Lease
Abstract:
When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff's alpha, best known for…
▽ More
When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability, but little work has studied its efficacy and consistency across complex annotation tasks.
We investigate the design and evaluation of IAA measures for complex annotation tasks, with evaluation spanning seven diverse tasks: image bounding boxes, image keypoints, text sequence tagging, ranked lists, free text translations, numeric vectors, and syntax trees. We identify the difficulty of interpretability and the complexity of choosing a distance function as key obstacles in applying Krippendorff's alpha generally across these tasks. We propose two novel, more interpretable measures, showing they yield more consistent IAA measures across tasks and annotation distance functions.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Local versus Global Strategies in Social Query Expansion
Authors:
Omar Alonso,
Vasileios Kandylas,
Serge-Eric Tremblay
Abstract:
Link sharing in social media can be seen as a collaboratively retrieved set of documents for a query or topic expressed by a hashtag. Temporal information plays an important role for identifying the correct context for which such annotations are valid for retrieval purposes. We investigate how social data as temporal context can be used for query expansion and compare global versus local strategie…
▽ More
Link sharing in social media can be seen as a collaboratively retrieved set of documents for a query or topic expressed by a hashtag. Temporal information plays an important role for identifying the correct context for which such annotations are valid for retrieval purposes. We investigate how social data as temporal context can be used for query expansion and compare global versus local strategies for computing such contextual information for a set of hashtags.
△ Less
Submitted 5 August, 2019;
originally announced August 2019.
-
Scalable Knowledge Graph Construction from Twitter
Authors:
Omar Alonso,
Vasileios Kandylas,
Serge-Eric Tremblay
Abstract:
We describe a knowledge graph derived from Twitter data with the goal of discovering relationships between people, links, and topics. The goal is to filter out noise from Twitter and surface an inside-out view that relies on high quality content. The generated graph contains many relationships where the user can query and traverse the structure from different angles allowing the development of new…
▽ More
We describe a knowledge graph derived from Twitter data with the goal of discovering relationships between people, links, and topics. The goal is to filter out noise from Twitter and surface an inside-out view that relies on high quality content. The generated graph contains many relationships where the user can query and traverse the structure from different angles allowing the development of new applications.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
Label Visualization and Exploration in IR
Authors:
Omar Alonso
Abstract:
There is a renaissance in visual analytics systems for data analysis and sharing, in particular, in the current wave of big data applications. We introduce RAVE, a prototype that automates the generation of an interface that uses facets and visualization techniques for exploring and analyzing relevance assessments data sets collected via crowdsourcing. We present a technical description of the mai…
▽ More
There is a renaissance in visual analytics systems for data analysis and sharing, in particular, in the current wave of big data applications. We introduce RAVE, a prototype that automates the generation of an interface that uses facets and visualization techniques for exploring and analyzing relevance assessments data sets collected via crowdsourcing. We present a technical description of the main components and demonstrate its use.
△ Less
Submitted 10 December, 2016;
originally announced December 2016.
-
How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels
Authors:
Ittai Abraham,
Omar Alonso,
Vasilis Kandylas,
Rajesh Patel,
Steven Shelford,
Aleksandrs Slivkins
Abstract:
Crowdsourcing has been part of the IR toolbox as a cheap and fast mechanism to obtain labels for system development and evaluation. Successful deployment of crowdsourcing at scale involves adjusting many variables, a very important one being the number of workers needed per human intelligence task (HIT). We consider the crowdsourcing task of learning the answer to simple multiple-choice HITs, whic…
▽ More
Crowdsourcing has been part of the IR toolbox as a cheap and fast mechanism to obtain labels for system development and evaluation. Successful deployment of crowdsourcing at scale involves adjusting many variables, a very important one being the number of workers needed per human intelligence task (HIT). We consider the crowdsourcing task of learning the answer to simple multiple-choice HITs, which are representative of many relevance experiments. In order to provide statistically significant results, one often needs to ask multiple workers to answer the same HIT. A stop** rule is an algorithm that, given a HIT, decides for any given set of worker answers if the system should stop and output an answer or iterate and ask one more worker. Knowing the historic performance of a worker in the form of a quality score can be beneficial in such a scenario. In this paper we investigate how to devise better stop** rules given such quality scores. We also suggest adaptive exploration as a promising approach for scalable and automatic creation of ground truth. We conduct a data analysis on an industrial crowdsourcing platform, and use the observations from this analysis to design new stop** rules that use the workers' quality scores in a non-trivial manner. We then perform a simulation based on a real-world workload, showing that our algorithm performs better than the more naive approaches.
△ Less
Submitted 19 May, 2016; v1 submitted 1 November, 2014;
originally announced November 2014.
-
A Study on Placement of Social Buttons in Web Pages
Authors:
Omar Alonso,
Vasilis Kandylas
Abstract:
With the explosion of social media in the last few years, web pages nowadays include different social network buttons where users can express if they support or recommend content. Those social buttons are very visual and their presentations, along with the counters, mark the importance of the social network and the interest on the content. In this paper, we analyze the presence of four types of so…
▽ More
With the explosion of social media in the last few years, web pages nowadays include different social network buttons where users can express if they support or recommend content. Those social buttons are very visual and their presentations, along with the counters, mark the importance of the social network and the interest on the content. In this paper, we analyze the presence of four types of social buttons (Facebook, Twitter, Google+1, and LinkedIn) in a large collection of web pages that we tracked over a period of time. We report on the distribution and counts along with some characteristics per domain. Finally, we outline some research directions.
△ Less
Submitted 10 October, 2014;
originally announced October 2014.
-
CrowdSTAR: A Social Task Routing Framework for Online Communities
Authors:
Besmira Nushi,
Omar Alonso,
Martin Hentschel,
Vasileios Kandylas
Abstract:
The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framewor…
▽ More
The online communities available on the Web have shown to be significantly interactive and capable of collectively solving difficult tasks. Nevertheless, it is still a challenge to decide how a task should be dispatched through the network due to the high diversity of the communities and the dynamically changing expertise and social availability of their members. We introduce CrowdSTAR, a framework designed to route tasks across and within online crowds. CrowdSTAR indexes the topic-specific expertise and social features of the crowd contributors and then uses a routing algorithm, which suggests the best sources to ask based on the knowledge vs. availability trade-offs. We experimented with the proposed framework for question and answering scenarios by using two popular social networks as crowd candidates: Twitter and Quora.
△ Less
Submitted 24 July, 2014;
originally announced July 2014.
-
A Data Management Approach for Dataset Selection Using Human Computation
Authors:
Alexandros Ntoulas,
Omar Alonso,
Vasilis Kandylas
Abstract:
As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies.
Getting labels typically involves employing humans to do the annotation, which directly translates to training and working costs. Crowdsourcing platforms have made labeling cheaper and faster, but they still involve significant costs, especially…
▽ More
As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies.
Getting labels typically involves employing humans to do the annotation, which directly translates to training and working costs. Crowdsourcing platforms have made labeling cheaper and faster, but they still involve significant costs, especially for the cases where the potential set of candidate data to be labeled is large. In this paper we describe a methodology and a prototype system aiming at addressing this challenge for Web-scale problems in an industrial setting. We discuss ideas on how to efficiently select the data to use for training of machine learning algorithms in an attempt to reduce cost. We show results achieving good performance with reduced cost by carefully selecting which instances to label. Our proposed algorithm is presented as part of a framework for managing and generating training datasets, which includes, among other components, a human computation element.
△ Less
Submitted 13 July, 2013;
originally announced July 2013.
-
Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem
Authors:
Ittai Abraham,
Omar Alonso,
Vasilis Kandylas,
Aleksandrs Slivkins
Abstract:
Very recently crowdsourcing has become the de facto platform for distributing and collecting human computation for a wide range of tasks and applications such as information retrieval, natural language processing and machine learning. Current crowdsourcing platforms have some limitations in the area of quality control. Most of the effort to ensure good quality has to be done by the experimenter wh…
▽ More
Very recently crowdsourcing has become the de facto platform for distributing and collecting human computation for a wide range of tasks and applications such as information retrieval, natural language processing and machine learning. Current crowdsourcing platforms have some limitations in the area of quality control. Most of the effort to ensure good quality has to be done by the experimenter who has to manage the number of workers needed to reach good results.
We propose a simple model for adaptive quality control in crowdsourced multiple-choice tasks which we call the \emph{bandit survey problem}. This model is related to, but technically different from the well-known multi-armed bandit problem. We present several algorithms for this problem, and support them with analysis and simulations. Our approach is based in our experience conducting relevance evaluation for a large commercial search engine.
△ Less
Submitted 20 May, 2013; v1 submitted 13 February, 2013;
originally announced February 2013.
-
User Taglines: Alternative Presentations of Expertise and Interest in Social Media
Authors:
Hemant Purohit,
Alex Dow,
Omar Alonso,
Lei Duan,
Kevin Haas
Abstract:
Web applications are increasingly showing recommended users from social media along with some descriptions, an attempt to show relevancy - why they are being shown. For example, Twitter search for a topical keyword shows expert twitterers on the side for 'whom to follow'. Google+ and Facebook also recommend users to follow or add to friend circle. Popular Internet newspaper- The Huffington Post sh…
▽ More
Web applications are increasingly showing recommended users from social media along with some descriptions, an attempt to show relevancy - why they are being shown. For example, Twitter search for a topical keyword shows expert twitterers on the side for 'whom to follow'. Google+ and Facebook also recommend users to follow or add to friend circle. Popular Internet newspaper- The Huffington Post shows Twitter influencers/ experts on the side of an article for authoritative relevant tweets. The state of the art shows user profile bios as summary for Twitter experts, but it has issues with length constraint imposed by user interface (UI) design, missing bio and sometimes funny profile bio. Alternatively, applications can use human generated user summary, but it will not scale. Therefore, we study the problem of automatic generation of informative expertise summary or taglines for Twitter experts in space constraint imposed by UI design. We propose three methods for expertise summary generation- Occupation-Pattern based, Link-Triangulation based and User-Classification based, with use of knowledge-enhanced computing approaches. We also propose methods for final summary selection for users with multiple candidates of generated summaries. We evaluate the proposed approaches by user-study using a number of experiments. Our results show promising quality of 92.8% good summaries with majority agreement in the best case and 70% with majority agreement in the worst case. Our approaches also outperform the state of the art up to 88%. This study has implications in the area of expert profiling, user presentation and application design for engaging user experience.
△ Less
Submitted 9 December, 2012;
originally announced December 2012.