Search | arXiv e-print repository

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

Authors: Paramita Das, Isaac Johnson, Diego Saez-Trumper, Pablo Aragón

Abstract: Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, kee** these assessments complete and up-to-date is largely impossible given the ever-changi… ▽ More Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, kee** these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted at ICWSM-24

arXiv:2309.00196 [pdf, other]

doi 10.1145/3583780.3615254

A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia

Authors: Aitolkyn Baigutanova, Diego Saez-Trumper, Miriam Redi, Meeyoung Cha, Pablo Aragón

Abstract: Information presented in Wikipedia articles must be attributable to reliable published sources in the form of references. This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively a… ▽ More Information presented in Wikipedia articles must be attributable to reliable published sources in the form of references. This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively agreed upon by Wikipedia editors. We discover that some sources (or web domains) deemed untrustworthy in one language (i.e., English) continue to appear in articles in other languages. This trend is especially evident with sources tailored for smaller communities. Furthermore, non-authoritative sources found in the English version of a page tend to persist in other language versions of that page. We finally present a case study on the Chinese, Russian, and Swedish Wikipedias to demonstrate a discrepancy in reference reliability across cultures. Our finding highlights future challenges in coordinating global knowledge on source reliability. △ Less

Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced September 2023.

Comments: Conference on Information & Knowledge Management (CIKM '23)

arXiv:2306.01650 [pdf, other]

Fair multilingual vandalism detection system for Wikipedia

Authors: Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates, Diego Saez-Trumper

Abstract: This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system… ▽ More This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2303.05227 [pdf, other]

doi 10.1145/3543507.3583218

Longitudinal Assessment of Reference Quality on Wikipedia

Authors: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, Meeyoung Cha

Abstract: Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detec… ▽ More Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detective, a tool for automatically calculating the RN score, and discover that the RN score has dropped by 20 percent point in the last decade, with more than half of verifiable statements now accompanying references. The RR score has remained below 1% over the years as a result of the efforts of the community to eliminate unreliable references. We propose pairing novice and experienced editors on the same Wikipedia article as a strategy to enhance reference quality. Our quasi-experiment indicates that such a co-editing experience can result in a lasting advantage in identifying unreliable sources in future edits. As Wikipedia is frequently used as the ground truth for numerous Web applications, our findings and suggestions on its reliability can have a far-reaching impact. We discuss the possibility of other Web services adopting Wiki-style user collaboration to eliminate unreliable content. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: Published at the Web Conference 2023 (WWW '23)

Journal ref: Proceedings of the ACM Web Conference 2023 (WWW '23), May 1-5, 2023, Austin, TX, USA. ACM

arXiv:2111.08543 [pdf, other]

WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia

Authors: Cheng Hsu, Cheng-Te Li, Diego Saez-Trumper, Yi-Zhan Hsu

Abstract: While Wikipedia has been utilized for fact-checking and claim verification to debunk misinformation and disinformation, it is essential to either improve article quality and rule out noisy articles. Self-contradiction is one of the low-quality article types in Wikipedia. In this work, we propose a task of detecting self-contradiction articles in Wikipedia. Based on the "self-contradictory" templat… ▽ More While Wikipedia has been utilized for fact-checking and claim verification to debunk misinformation and disinformation, it is essential to either improve article quality and rule out noisy articles. Self-contradiction is one of the low-quality article types in Wikipedia. In this work, we propose a task of detecting self-contradiction articles in Wikipedia. Based on the "self-contradictory" template, we create a novel dataset for the self-contradiction detection task. Conventional contradiction detection focuses on comparing pairs of sentences or claims, but self-contradiction detection needs to further reason the semantics of an article and simultaneously learn the contradiction-aware comparison from all pairs of sentences. Therefore, we present the first model, Pairwise Contradiction Neural Network (PCNN), to not only effectively identify self-contradiction articles, but also highlight the most contradiction pairs of contradiction sentences. The main idea of PCNN is two-fold. First, to mitigate the effect of data scarcity on self-contradiction articles, we pre-train the module of pairwise contradiction learning using SNLI and MNLI benchmarks. Second, we select top-K sentence pairs with the highest contradiction probability values and model their correlation to determine whether the corresponding article belongs to self-contradiction. Experiments conducted on the proposed WikiContradiction dataset exhibit that PCNN can generate promising performance and comprehensively highlight the sentence pairs the contradiction locates. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: Published at IEEE BigData 2021 (regular paper). Data and code can be access via: https://github.com/Wiki-Contradictory/Wiki-Self-Contradictory/

arXiv:2109.00835 [pdf, other]

WikiCheck: An end-to-end open source Automatic Fact-Checking API based on Wikipedia

Authors: Mykola Trokhymovych, Diego Saez-Trumper

Abstract: With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in p… ▽ More With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in production environments. We discover overfitting issues in those models, and we propose a data filtering method that improves the model's performance and generalization. Then, we design an unsupervised fine-tuning of the Masked Language models to improve its accuracy working with Wikipedia. We also propose a novel query enhancing method to improve evidence discovery using the Wikipedia Search API. Finally, we present a new fact-checking system, the \textit{WikiCheck} API that automatically performs a facts validation process based on the Wikipedia knowledge base. It is comparable to SOTA solutions in terms of accuracy and can be used on low-memory CPU instances. △ Less

Submitted 2 September, 2021; originally announced September 2021.

arXiv:2106.15940 [pdf, other]

A preliminary approach to knowledge integrity risk assessment in Wikipedia projects

Authors: Pablo Aragón, Diego Sáez-Trumper

Abstract: Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project… ▽ More Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project is maintained by a different community, with their own strengths, weaknesses and limitations. In this paper, we introduce a taxonomy of knowledge integrity risks across Wikipedia projects and a first set of indicators to assess internal risks related to community and content issues, as well as external threats such as the geopolitical and media landscape. On top of this taxonomy, we offer a preliminary analysis illustrating how the lack of editors' geographical diversity might represent a knowledge integrity risk. These are the first steps of a research project to build a Wikipedia Knowledge Integrity Risk Observatory. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: Accepted at MIS2'21: Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with KDD 2021

arXiv:2105.04117 [pdf, other]

doi 10.1145/3404835.3463253

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Authors: KayYen Wong, Miriam Redi, Diego Saez-Trumper

Abstract: Wikipedia is the largest online encyclopedia, used by algorithms and web users as a central hub of reliable information on the web. The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts around Wikipedia content reliability. However, there is a lack of larg… ▽ More Wikipedia is the largest online encyclopedia, used by algorithms and web users as a central hub of reliable information on the web. The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts around Wikipedia content reliability. However, there is a lack of large-scale data to support the development of such research. To fill this gap, in this paper, we propose Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. To build this dataset, we rely on Wikipedia "templates". Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of "non-neutral point of view" or "contradictory articles", and serve as a strong signal for detecting reliability issues in a revision. We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata. We provide an overview of the possible downstream tasks enabled by such data, and show that Wiki-Reliability can be used to train large-scale models for content reliability prediction. We release all data and code for public use. △ Less

Submitted 1 June, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21), 2021

arXiv:2103.16613 [pdf, other]

Tracking Knowledge Propagation Across Wikipedia Languages

Authors: Roldolfo Valentim, Giovanni Comarela, Souneil Park, Diego Saez-Trumper

Abstract: In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the con… ▽ More In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Journal ref: 15th International Conference on Web and Social Media (ICWSM-21), 2021

arXiv:2103.00068 [pdf, other]

Language-agnostic Topic Classification for Wikipedia

Authors: Isaac Johnson, Martin Gerlach, Diego Sáez-Trumper

Abstract: A major challenge for many analyses of Wikipedia dynamics -- e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion -- is grou** the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia's category network, WikiPro… ▽ More A major challenge for many analyses of Wikipedia dynamics -- e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion -- is grou** the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia's category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage: typically, only a small subset of articles can be classified, or the method cannot be applied across (the more than 300) languages on Wikipedia. In this paper, we propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics that can be easily applied to (almost) any language and article on Wikipedia. We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage. △ Less

Submitted 26 February, 2021; originally announced March 2021.

Comments: Accepted to WikiWorkshop at The Web Conference 2021

arXiv:2009.11771 [pdf, other]

Scalable Recommendation of Wikipedia Articles to Editors Using Representation Learning

Authors: Oleksii Moskalenko, Denis Parra, Diego Saez-Trumper

Abstract: Wikipedia is edited by volunteer editors around the world. Considering the large amount of existing content (e.g. over 5M articles in English Wikipedia), deciding what to edit next can be difficult, both for experienced users that usually have a huge backlog of articles to prioritize, as well as for newcomers who that might need guidance in selecting the next article to contribute. Therefore, help… ▽ More Wikipedia is edited by volunteer editors around the world. Considering the large amount of existing content (e.g. over 5M articles in English Wikipedia), deciding what to edit next can be difficult, both for experienced users that usually have a huge backlog of articles to prioritize, as well as for newcomers who that might need guidance in selecting the next article to contribute. Therefore, hel** editors to find relevant articles should improve their performance and help in the retention of new editors. In this paper, we address the problem of recommending relevant articles to editors. To do this, we develop a scalable system on top of Graph Convolutional Networks and Doc2Vec, learning how to represent Wikipedia articles and deliver personalized recommendations for editors. We test our model on editors' histories, predicting their most recent edits based on their prior edits. We outperform competitive implicit-feedback collaborative-filtering methods such as WMRF based on ALS, as well as a traditional IR-method such as content-based filtering based on BM25. All of the data used on this paper is publicly available, including graph embeddings for Wikipedia articles, and we release our code to support replication of our experiments. Moreover, we contribute with a scalable implementation of a state-of-art graph embedding algorithm as current ones cannot efficiently handle the sheer size of the Wikipedia graph. △ Less

Submitted 24 September, 2020; originally announced September 2020.

Journal ref: ComplexRec 2020, Workshop on Recommendation in Complex Scenarios at the ACM RecSys Conference on Recommender Systems (RecSys 2020)

arXiv:2007.10403 [pdf, other]

Global gender differences in Wikipedia readership

Authors: Isaac Johnson, Florian Lemmerich, Diego Sáez-Trumper, Robert West, Markus Strohmaier, Leila Zia

Abstract: Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we… ▽ More Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we report that (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men do, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences. Our findings lay the foundation for identifying pathways toward knowledge equity in the usage of online encyclopedic knowledge. △ Less

Submitted 20 July, 2020; originally announced July 2020.

arXiv:2001.08810 [pdf, other]

Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

Authors: Valerio Lorini, Javier Rando, Diego Saez-Trumper, Carlos Castillo

Abstract: The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at… ▽ More The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at adding information in real-time as a crisis unfolds. In this study, we focus on the English version of Wikipedia, that is by far the most comprehensive version of this encyclopedia. Wikipedia tends to have good coverage of disasters, particularly those having a large number of fatalities. However, we also show that a tendency to cover events in wealthy countries and not cover events in poorer ones permeates Wikipedia as a source for disaster-related information. By performing careful automatic content analysis at a large scale, we show how the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada. We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries. These results have implications for systems using Wikipedia or similar collaborative media platforms as an information source for detecting emergencies or for gathering valuable information for disaster response. △ Less

Submitted 23 January, 2020; originally announced January 2020.

Comments: 17 pages, submitted to ISCRAM 2020 conference

arXiv:1910.12596 [pdf, other]

Online Disinformation and the Role of Wikipedia

Authors: Diego Saez-Trumper

Abstract: The aim of this study is to find key areas of research that can be useful to fight against disinformation on Wikipedia. To address this problem we perform a literature review trying to answer three main questions: (i) What is disinformation? (ii) What are the most popular mechanisms to spread online disinformation? and (iii) Which are the mechanisms that are currently being used to fight against d… ▽ More The aim of this study is to find key areas of research that can be useful to fight against disinformation on Wikipedia. To address this problem we perform a literature review trying to answer three main questions: (i) What is disinformation? (ii) What are the most popular mechanisms to spread online disinformation? and (iii) Which are the mechanisms that are currently being used to fight against disinformation?. In all these three questions we take first a general approach, considering studies from different areas such as journalism and communications, sociology, philosophy, information and political sciences. And comparing those studies with the current situation on the Wikipedia ecosystem. We conclude that in order to keep Wikipedia as free as possible from disinformation, it is necessary to help patrollers to early detect disinformation and assess the credibility of external sources. More research is needed to develop tools that use state-of-the-art machine learning techniques to detect potentially dangerous content, empowering patrollers to deal with attacks that are becoming more complex and sophisticated. △ Less

Submitted 14 October, 2019; originally announced October 2019.

arXiv:1812.00474 [pdf, other]

Why the World Reads Wikipedia: Beyond English Speakers

Authors: Florian Lemmerich, Diego Sáez-Trumper, Robert West, Leila Zia

Abstract: As one of the Web's primary multilingual knowledge sources, Wikipedia is read by millions of people across the globe every day. Despite this global readership, little is known about why users read Wikipedia's various language editions. To bridge this gap, we conduct a comparative study by combining a large-scale survey of Wikipedia readers across 14 language editions with a log-based analysis of u… ▽ More As one of the Web's primary multilingual knowledge sources, Wikipedia is read by millions of people across the globe every day. Despite this global readership, little is known about why users read Wikipedia's various language editions. To bridge this gap, we conduct a comparative study by combining a large-scale survey of Wikipedia readers across 14 language editions with a log-based analysis of user activity. We proceed in three steps. First, we analyze the survey results to compare the prevalence of Wikipedia use cases across languages, discovering commonalities, but also substantial differences, among Wikipedia languages with respect to their usage. Second, we match survey responses to the respondents' traces in Wikipedia's server logs to characterize behavioral patterns associated with specific use cases, finding that distinctive patterns consistently mark certain use cases across language editions. Third, we show that certain Wikipedia use cases are more common in countries with certain socio-economic characteristics; e.g., in-depth reading of Wikipedia articles is substantially more common in countries with a low Human Development Index. These findings advance our understanding of reader motivations and behaviors across Wikipedia languages and have implications for Wikipedia editors and developers of Wikipedia and other Web technologies. △ Less

Submitted 2 December, 2018; originally announced December 2018.

arXiv:1806.08282 [pdf, other]

Online Petitioning Through Data Exploration and What We Found There: A Dataset of Petitions from Avaaz.org

Authors: Pablo Aragón, Diego Sáez-Trumper, Miriam Redi, Scott A. Hale, Vicenç Gómez, Andreas Kaltenbrunner

Abstract: The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a… ▽ More The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a lack of available data. In this work we retrieved more than 350K petitions, standardized their field values, and added new information using language detection and named-entity recognition. To motivate future research with this unique repository of global protest, we present a first exploration of the dataset. In particular, we examine how social media campaigning is related to the success of petitions, as well as some geographic and linguistic findings about the worldwide community of Avaaz.org. We conclude with example research questions that could be addressed with our dataset. △ Less

Submitted 21 June, 2018; originally announced June 2018.

Comments: Accepted as a dataset paper at the 12th International AAAI Conference on Web and Social Media (ICWSM-18). This preprint includes an additional appendix with the reasons, provided by Avaaz.org, about the anomalies detected when exploring the dataset. For academic purposes, please cite the ICWSM version

arXiv:1604.03044 [pdf, other]

doi 10.1145/2700171.2791056

Wisdom of the Crowd or Wisdom of a Few? An Analysis of Users' Content Generation

Authors: Ricardo Baeza-Yates, Diego Saez-Trumper

Abstract: In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twit… ▽ More In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twitter, we find that a small percentage of active users and much less of all users represent 50\% of the UGC. We also analyze the dynamic behavior of the generation of this content to find that the set of most active users is quite stable in time. Moreover, we study the social graph, finding that those active users are highly connected among them. This implies that most of the wisdom comes from a few users, challenging the independence assumption needed to have a wisdom of crowds. We also address the content that is never seen by any people, which we call digital desert, that challenges the assumption that the content of every person should be taken in account in a collective decision. We also compare our results with Wikipedia data and we address the quality of UGC content using an Amazon dataset. At the end our results are not surprising, as the Web is a reflection of our own society, where economical or political power also is in the hands of minorities. △ Less

Submitted 11 April, 2016; originally announced April 2016.

ACM Class: H.2.8; J.4

Journal ref: Proceedings of the 26th ACM Conference on Hypertext & Social Media, 2015

arXiv:1602.09000 [pdf, other]

A Day of Your Days: Estimating Individual Daily Journeys Using Mobile Data to Understand Urban Flow

Authors: Eduardo Graells-Garrido, Diego Saez-Trumper

Abstract: Nowadays, travel surveys provide rich information about urban mobility and commuting patterns. But, at the same time, they have drawbacks: they are static pictures of a dynamic phenomena, are expensive to make, and take prolonged periods of time to finish. However, the availability of mobile usage data (Call Detail Records) makes the study of urban mobility possible at levels not known before. Thi… ▽ More Nowadays, travel surveys provide rich information about urban mobility and commuting patterns. But, at the same time, they have drawbacks: they are static pictures of a dynamic phenomena, are expensive to make, and take prolonged periods of time to finish. However, the availability of mobile usage data (Call Detail Records) makes the study of urban mobility possible at levels not known before. This has been done in the past with good results--mobile data makes possible to find and understand aggregated mobility patterns. In this paper, we propose to analyze mobile data at individual level by estimating daily journeys, and use those journeys to build Origin-Destiny matrices to understand urban flow. We evaluate this approach with large anonymized CDRs from Santiago, Chile, and find that our method has a high correlation ($ρ= 0.89$) with the current travel survey, and that it captures external anomalies in daily travel patterns, making our method suitable for inclusion into urban computing applications. △ Less

Submitted 29 February, 2016; originally announced February 2016.

Comments: Submitted for review - please contact authors before citing. 6 pages

arXiv:1411.5204 [pdf, other]

doi 10.1145/2675133.2675233

Measuring Urban Deprivation from User Generated Content

Authors: Alessandro Venerandi, Giovanni Quattrone, Licia Capra, Daniele Quercia, Diego Saez-Trumper

Abstract: Measuring socioeconomic deprivation of cities in an accurate and timely fashion has become a priority for governments around the world, as the massive urbanization process we are witnessing is causing high levels of inequalities which require intervention. Traditionally, deprivation indexes have been derived from census data, which is however very expensive to obtain, and thus acquired only every… ▽ More Measuring socioeconomic deprivation of cities in an accurate and timely fashion has become a priority for governments around the world, as the massive urbanization process we are witnessing is causing high levels of inequalities which require intervention. Traditionally, deprivation indexes have been derived from census data, which is however very expensive to obtain, and thus acquired only every few years. Alternative computational methods have been proposed in recent years to automatically extract proxies of deprivation at a fine spatio-temporal level of granularity; however, they usually require access to datasets (e.g., call details records) that are not publicly available to governments and agencies. To remedy this, we propose a new method to automatically mine deprivation at a fine level of spatio-temporal granularity that only requires access to freely available user-generated content. More precisely, the method needs access to datasets describing what urban elements are present in the physical environment; examples of such datasets are Foursquare and OpenStreetMap. Using these datasets, we quantitatively describe neighborhoods by means of a metric, called {\em Offering Advantage}, that reflects which urban elements are distinctive features of each neighborhood. We then use that metric to {\em (i)} build accurate classifiers of urban deprivation and {\em (ii)} interpret the outcomes through thematic analysis. We apply the method to three UK urban areas of different scale and elaborate on the results in terms of precision and recall. △ Less

Submitted 19 November, 2014; originally announced November 2014.

Comments: CSCW'15, March 14 - 18 2015, Vancouver, BC, Canada

Showing 1–19 of 19 results for author: Sáez-Trumper, D