Skip to main content

Showing 1–19 of 19 results for author: Sáez-Trumper, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.09764  [pdf, other

    cs.CY

    Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

    Authors: Paramita Das, Isaac Johnson, Diego Saez-Trumper, Pablo Aragón

    Abstract: Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, kee** these assessments complete and up-to-date is largely impossible given the ever-changi… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted at ICWSM-24

  2. A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia

    Authors: Aitolkyn Baigutanova, Diego Saez-Trumper, Miriam Redi, Meeyoung Cha, Pablo Aragón

    Abstract: Information presented in Wikipedia articles must be attributable to reliable published sources in the form of references. This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively a… ▽ More

    Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced September 2023.

    Comments: Conference on Information & Knowledge Management (CIKM '23)

  3. arXiv:2306.01650  [pdf, other

    cs.LG

    Fair multilingual vandalism detection system for Wikipedia

    Authors: Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates, Diego Saez-Trumper

    Abstract: This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  4. Longitudinal Assessment of Reference Quality on Wikipedia

    Authors: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, Meeyoung Cha

    Abstract: Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detec… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Published at the Web Conference 2023 (WWW '23)

    Journal ref: Proceedings of the ACM Web Conference 2023 (WWW '23), May 1-5, 2023, Austin, TX, USA. ACM

  5. arXiv:2111.08543  [pdf, other

    cs.CL cs.IR cs.LG

    WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia

    Authors: Cheng Hsu, Cheng-Te Li, Diego Saez-Trumper, Yi-Zhan Hsu

    Abstract: While Wikipedia has been utilized for fact-checking and claim verification to debunk misinformation and disinformation, it is essential to either improve article quality and rule out noisy articles. Self-contradiction is one of the low-quality article types in Wikipedia. In this work, we propose a task of detecting self-contradiction articles in Wikipedia. Based on the "self-contradictory" templat… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

    Comments: Published at IEEE BigData 2021 (regular paper). Data and code can be access via: https://github.com/Wiki-Contradictory/Wiki-Self-Contradictory/

  6. arXiv:2109.00835  [pdf, other

    cs.CY

    WikiCheck: An end-to-end open source Automatic Fact-Checking API based on Wikipedia

    Authors: Mykola Trokhymovych, Diego Saez-Trumper

    Abstract: With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in p… ▽ More

    Submitted 2 September, 2021; originally announced September 2021.

  7. arXiv:2106.15940  [pdf, other

    cs.CY

    A preliminary approach to knowledge integrity risk assessment in Wikipedia projects

    Authors: Pablo Aragón, Diego Sáez-Trumper

    Abstract: Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project… ▽ More

    Submitted 30 June, 2021; originally announced June 2021.

    Comments: Accepted at MIS2'21: Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with KDD 2021

  8. arXiv:2105.04117  [pdf, other

    cs.IR cs.CL cs.LG

    Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

    Authors: KayYen Wong, Miriam Redi, Diego Saez-Trumper

    Abstract: Wikipedia is the largest online encyclopedia, used by algorithms and web users as a central hub of reliable information on the web. The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts around Wikipedia content reliability. However, there is a lack of larg… ▽ More

    Submitted 1 June, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

    Comments: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21), 2021

  9. arXiv:2103.16613  [pdf, other

    cs.CY

    Tracking Knowledge Propagation Across Wikipedia Languages

    Authors: Roldolfo Valentim, Giovanni Comarela, Souneil Park, Diego Saez-Trumper

    Abstract: In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the con… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Journal ref: 15th International Conference on Web and Social Media (ICWSM-21), 2021

  10. arXiv:2103.00068  [pdf, other

    cs.CY

    Language-agnostic Topic Classification for Wikipedia

    Authors: Isaac Johnson, Martin Gerlach, Diego Sáez-Trumper

    Abstract: A major challenge for many analyses of Wikipedia dynamics -- e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion -- is grou** the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia's category network, WikiPro… ▽ More

    Submitted 26 February, 2021; originally announced March 2021.

    Comments: Accepted to WikiWorkshop at The Web Conference 2021

  11. arXiv:2009.11771  [pdf, other

    cs.IR cs.CY

    Scalable Recommendation of Wikipedia Articles to Editors Using Representation Learning

    Authors: Oleksii Moskalenko, Denis Parra, Diego Saez-Trumper

    Abstract: Wikipedia is edited by volunteer editors around the world. Considering the large amount of existing content (e.g. over 5M articles in English Wikipedia), deciding what to edit next can be difficult, both for experienced users that usually have a huge backlog of articles to prioritize, as well as for newcomers who that might need guidance in selecting the next article to contribute. Therefore, help… ▽ More

    Submitted 24 September, 2020; originally announced September 2020.

    Journal ref: ComplexRec 2020, Workshop on Recommendation in Complex Scenarios at the ACM RecSys Conference on Recommender Systems (RecSys 2020)

  12. arXiv:2007.10403  [pdf, other

    cs.CY

    Global gender differences in Wikipedia readership

    Authors: Isaac Johnson, Florian Lemmerich, Diego Sáez-Trumper, Robert West, Markus Strohmaier, Leila Zia

    Abstract: Wikipedia represents the largest and most popular source of encyclopedic knowledge in the world today, aiming to provide equal access to information worldwide. From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present novel evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

  13. arXiv:2001.08810  [pdf, other

    cs.IR cs.CY

    Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

    Authors: Valerio Lorini, Javier Rando, Diego Saez-Trumper, Carlos Castillo

    Abstract: The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: 17 pages, submitted to ISCRAM 2020 conference

  14. arXiv:1910.12596  [pdf, other

    cs.CY cs.DL cs.SI

    Online Disinformation and the Role of Wikipedia

    Authors: Diego Saez-Trumper

    Abstract: The aim of this study is to find key areas of research that can be useful to fight against disinformation on Wikipedia. To address this problem we perform a literature review trying to answer three main questions: (i) What is disinformation? (ii) What are the most popular mechanisms to spread online disinformation? and (iii) Which are the mechanisms that are currently being used to fight against d… ▽ More

    Submitted 14 October, 2019; originally announced October 2019.

  15. arXiv:1812.00474  [pdf, other

    cs.CY cs.HC cs.SI

    Why the World Reads Wikipedia: Beyond English Speakers

    Authors: Florian Lemmerich, Diego Sáez-Trumper, Robert West, Leila Zia

    Abstract: As one of the Web's primary multilingual knowledge sources, Wikipedia is read by millions of people across the globe every day. Despite this global readership, little is known about why users read Wikipedia's various language editions. To bridge this gap, we conduct a comparative study by combining a large-scale survey of Wikipedia readers across 14 language editions with a log-based analysis of u… ▽ More

    Submitted 2 December, 2018; originally announced December 2018.

  16. arXiv:1806.08282  [pdf, other

    cs.SI

    Online Petitioning Through Data Exploration and What We Found There: A Dataset of Petitions from Avaaz.org

    Authors: Pablo Aragón, Diego Sáez-Trumper, Miriam Redi, Scott A. Hale, Vicenç Gómez, Andreas Kaltenbrunner

    Abstract: The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a… ▽ More

    Submitted 21 June, 2018; originally announced June 2018.

    Comments: Accepted as a dataset paper at the 12th International AAAI Conference on Web and Social Media (ICWSM-18). This preprint includes an additional appendix with the reasons, provided by Avaaz.org, about the anomalies detected when exploring the dataset. For academic purposes, please cite the ICWSM version

  17. arXiv:1604.03044  [pdf, other

    cs.CY cs.SI physics.soc-ph

    Wisdom of the Crowd or Wisdom of a Few? An Analysis of Users' Content Generation

    Authors: Ricardo Baeza-Yates, Diego Saez-Trumper

    Abstract: In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twit… ▽ More

    Submitted 11 April, 2016; originally announced April 2016.

    ACM Class: H.2.8; J.4

    Journal ref: Proceedings of the 26th ACM Conference on Hypertext & Social Media, 2015

  18. arXiv:1602.09000  [pdf, other

    cs.SI physics.soc-ph

    A Day of Your Days: Estimating Individual Daily Journeys Using Mobile Data to Understand Urban Flow

    Authors: Eduardo Graells-Garrido, Diego Saez-Trumper

    Abstract: Nowadays, travel surveys provide rich information about urban mobility and commuting patterns. But, at the same time, they have drawbacks: they are static pictures of a dynamic phenomena, are expensive to make, and take prolonged periods of time to finish. However, the availability of mobile usage data (Call Detail Records) makes the study of urban mobility possible at levels not known before. Thi… ▽ More

    Submitted 29 February, 2016; originally announced February 2016.

    Comments: Submitted for review - please contact authors before citing. 6 pages

  19. Measuring Urban Deprivation from User Generated Content

    Authors: Alessandro Venerandi, Giovanni Quattrone, Licia Capra, Daniele Quercia, Diego Saez-Trumper

    Abstract: Measuring socioeconomic deprivation of cities in an accurate and timely fashion has become a priority for governments around the world, as the massive urbanization process we are witnessing is causing high levels of inequalities which require intervention. Traditionally, deprivation indexes have been derived from census data, which is however very expensive to obtain, and thus acquired only every… ▽ More

    Submitted 19 November, 2014; originally announced November 2014.

    Comments: CSCW'15, March 14 - 18 2015, Vancouver, BC, Canada