Search | arXiv e-print repository

Push and Pull: A Framework for Measuring Attentional Agency

Authors: Zachary Wojtowicz, Shrey Jain, Nicholas Vincent

Abstract: We propose a framework for measuring attentional agency - the ability to allocate one's attention according to personal desires, goals, and intentions - on digital platforms. Platforms extend people's limited powers of attention by extrapolating their preferences to large collections of previously unconsidered informational objects. However, platforms typically also allow people to influence one a… ▽ More We propose a framework for measuring attentional agency - the ability to allocate one's attention according to personal desires, goals, and intentions - on digital platforms. Platforms extend people's limited powers of attention by extrapolating their preferences to large collections of previously unconsidered informational objects. However, platforms typically also allow people to influence one another's attention. We introduce a formal framework for measuring how much a given platform empowers people to both pull information into their own attentional field and push information into the attentional fields of others. We also use these definitions to shed light on the implications of generative foundation models, which enable users to bypass the implicit "attentional bargain" that underlies embedded advertising and other methods for capturing economic value from informational goods. We conclude with a set of policy strategies that can be used to understand and reshape the distribution of attentional agency online. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2403.13073 [pdf, other]

doi 10.1145/3613904.3642749

A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training

Authors: Heila Precel, Allison McDonald, Brent Hecht, Nicholas Vincent

Abstract: Systemic property dispossession from minority groups has often been carried out in the name of technological progress. In this paper, we identify evidence that the current paradigm of large language models (LLMs) likely continues this long history. Examining common LLM training datasets, we find that a disproportionate amount of content authored by Jewish Americans is used for training without the… ▽ More Systemic property dispossession from minority groups has often been carried out in the name of technological progress. In this paper, we identify evidence that the current paradigm of large language models (LLMs) likely continues this long history. Examining common LLM training datasets, we find that a disproportionate amount of content authored by Jewish Americans is used for training without their consent. The degree of over-representation ranges from around 2x to around 6.5x. Given that LLMs may substitute for the paid labor of those who produced their training data, they have the potential to cause even more substantial and disproportionate economic harm to Jewish Americans in the coming years. This paper focuses on Jewish Americans as a case study, but it is probable that other minority communities (e.g., Asian Americans, Hindu Americans) may be similarly affected and, most importantly, the results should likely be interpreted as a "canary in the coal mine" that highlights deep structural concerns about the current LLM paradigm whose harms could soon affect nearly everyone. We discuss the implications of these results for the policymakers thinking about how to regulate LLMs as well as for those in the AI field who are working to advance LLMs. Our findings stress the importance of working together towards alternative LLM paradigms that avoid both disparate impacts and widespread societal harms. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Preprint, to appear in CHI 2024 proceedings

arXiv:2311.11350 [pdf, ps, other]

An Alternative to Regulation: The Case for Public AI

Authors: Nicholas Vincent, David Bau, Sarah Schwettmann, Joshua Tan

Abstract: Can governments build AI? In this paper, we describe an ongoing effort to develop ``public AI'' -- publicly accessible AI models funded, provisioned, and governed by governments or other public bodies. Public AI presents both an alternative and a complement to standard regulatory approaches to AI, but it also suggests new technical and policy challenges. We present a roadmap for how the ML researc… ▽ More Can governments build AI? In this paper, we describe an ongoing effort to develop ``public AI'' -- publicly accessible AI models funded, provisioned, and governed by governments or other public bodies. Public AI presents both an alternative and a complement to standard regulatory approaches to AI, but it also suggests new technical and policy challenges. We present a roadmap for how the ML research community can help shape this initiative and support its implementation, and how public AI can complement other responsible AI initiatives. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: To be presented at Regulatable ML @ NeurIPS2023 workshop

arXiv:2310.04329 [pdf, other]

Pika: Empowering Non-Programmers to Author Executable Governance Policies in Online Communities

Authors: Leijie Wang, Nicolas Vincent, Julija Rukanskaitė, Amy X. Zhang

Abstract: Internet users have formed a wide array of online communities with nuanced and diverse community goals and norms. However, most online platforms only offer a limited set of governance models in their software infrastructure and leave little room for customization. Consequently, technical proficiency becomes a prerequisite for online communities to build governance policies in code, excluding non-p… ▽ More Internet users have formed a wide array of online communities with nuanced and diverse community goals and norms. However, most online platforms only offer a limited set of governance models in their software infrastructure and leave little room for customization. Consequently, technical proficiency becomes a prerequisite for online communities to build governance policies in code, excluding non-programmers from participation in designing community governance. In this paper, we present Pika, a system that empowers non-programmers to author a wide range of executable governance policies. At its core, Pika incorporates a declarative language that decomposes governance policies into modular components, thereby facilitating expressive policy authoring through a user-friendly, form-based web interface. Our user studies with 17 participants show that Pika can empower non-programmers to author governance policies approximately 2.5 times faster than programmers who author in code. We also provide insights about Pika's expressivity in supporting diverse policies that online communities want. △ Less

Submitted 27 February, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: Conditionally accepted by CHI'2024

arXiv:2305.13238 [pdf]

doi 10.1145/3593013.3594070

The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers

Authors: Hanlin Li, Nicholas Vincent, Stevie Chancellor, Brent Hecht

Abstract: Many recent technological advances (e.g. ChatGPT and search engines) are possible only because of massive amounts of user-generated data produced through user interactions with computing systems or scraped from the web (e.g. behavior logs, user-generated content, and artwork). However, data producers have little say in what data is captured, how it is used, or who it benefits. Organizations with t… ▽ More Many recent technological advances (e.g. ChatGPT and search engines) are possible only because of massive amounts of user-generated data produced through user interactions with computing systems or scraped from the web (e.g. behavior logs, user-generated content, and artwork). However, data producers have little say in what data is captured, how it is used, or who it benefits. Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in sha** the technology landscape. By synthesizing related literature that reconceptualizes the production of data for computing as ``data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers in their relationship with tech companies, e.g advocating for transparency about data reuse, creating feedback channels between data producers and companies, and potentially develo** mechanisms to share data's revenue more broadly. In doing so, we characterize data labor with six important dimensions - legibility, end-use awareness, collaboration requirement, openness, replaceability, and livelihood overlap - based on the parallels between data labor and various other types of labor in the computing literature. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: To appear at the 2023 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT)

arXiv:2303.16302 [pdf]

Retracted Articles about COVID-19 Vaccines Enable Vaccine Misinformation on Twitter

Authors: Rod Abhari, Esteban Villa-Turek, Nicholas Vincent, Henry Dambanemuya, Emőke-Ágnes Horvát

Abstract: Retracted scientific articles about COVID-19 vaccines have proliferated false claims about vaccination harms and discouraged vaccine acceptance. Our study analyzed the topical content of 4,876 English-language tweets about retracted COVID-19 vaccine research and found that 27.4% of tweets contained retraction-related misinformation. Misinformed tweets either ignored the retraction, or less commonl… ▽ More Retracted scientific articles about COVID-19 vaccines have proliferated false claims about vaccination harms and discouraged vaccine acceptance. Our study analyzed the topical content of 4,876 English-language tweets about retracted COVID-19 vaccine research and found that 27.4% of tweets contained retraction-related misinformation. Misinformed tweets either ignored the retraction, or less commonly, politicized the retraction using conspiratorial rhetoric. To address this, Twitter and other social media platforms should expand their efforts to address retraction-related misinformation. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2203.04228 [pdf, other]

Online Engagement with Retracted Articles: Who, When, and How?

Authors: Henry K. Dambanemuya, Rod Abhari, Nicholas Vincent, Emőke-Ágnes Horvát

Abstract: Retracted research discussed on social media can spread misinformation. Yet we lack an understanding of how retracted articles are mentioned by academic and non-academic users. This is especially relevant on Twitter due to the platform's prominent role in science communication. Here, we analyze the pre- and post-retraction differences in Twitter attention and engagement metrics for over 3,800 retr… ▽ More Retracted research discussed on social media can spread misinformation. Yet we lack an understanding of how retracted articles are mentioned by academic and non-academic users. This is especially relevant on Twitter due to the platform's prominent role in science communication. Here, we analyze the pre- and post-retraction differences in Twitter attention and engagement metrics for over 3,800 retracted English-language articles alongside comparable non-retracted articles. We subset these findings according to five user types detected by our supervised learning classifier: members of the public, academics, bots, science practitioners, and science communicators. We find that retracted articles receive greater user attention (tweet count) and engagement (likes, retweets, and replies) than non-retracted articles, especially among members of the public and bots, with the majority of user engagement happening before retraction. Our results highlight the prominent role of non-experts in discussions of retracted research and suggest an opportunity for social media platforms to contribute towards early detection of problematic scientific research online. △ Less

Submitted 29 January, 2024; v1 submitted 8 March, 2022; originally announced March 2022.

ACM Class: K.4.0

arXiv:2105.05241 [pdf, ps, other]

Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus

Authors: Jack Bandy, Nicholas Vincent

Abstract: Recent literature has underscored the importance of dataset documentation work for machine learning, and part of this work involves addressing "documentation debt" for datasets that have been used widely but documented sparsely. This paper aims to help address documentation debt for BookCorpus, a popular text dataset for training large language models. Notably, researchers have used BookCorpus to… ▽ More Recent literature has underscored the importance of dataset documentation work for machine learning, and part of this work involves addressing "documentation debt" for datasets that have been used widely but documented sparsely. This paper aims to help address documentation debt for BookCorpus, a popular text dataset for training large language models. Notably, researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, even though little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a preliminary datasheet that provides key context and information about BookCorpus, highlighting several notable deficiencies. In particular, we find evidence that (1) BookCorpus likely violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, including problematic content, potential skews in religious representation, and lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus adds to growing literature that urges more careful and systematic documentation for machine learning datasets. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: Working paper

arXiv:2012.09995 [pdf, other]

Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies

Authors: Nicholas Vincent, Hanlin Li, Nicole Tilly, Stevie Chancellor, Brent Hecht

Abstract: Many powerful computing technologies rely on implicit and explicit data contributions from the public. This dependency suggests a potential source of leverage for the public in its relationship with technology companies: by reducing, stop**, redirecting, or otherwise manipulating data contributions, the public can reduce the effectiveness of many lucrative technologies. In this paper, we synthes… ▽ More Many powerful computing technologies rely on implicit and explicit data contributions from the public. This dependency suggests a potential source of leverage for the public in its relationship with technology companies: by reducing, stop**, redirecting, or otherwise manipulating data contributions, the public can reduce the effectiveness of many lucrative technologies. In this paper, we synthesize emerging research that seeks to better understand and help people action this \textit{data leverage}. Drawing on prior work in areas including machine learning, human-computer interaction, and fairness and accountability in computing, we present a framework for understanding data leverage that highlights new opportunities to change technology company behavior related to privacy, economic inequality, content moderation and other areas of societal concern. Our framework also points towards ways that policymakers can bolster data leverage as a means of changing the balance of power between the public and tech companies. △ Less

Submitted 17 February, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: This is a preprint. The paper will be presented at the 2021 Conference on Fairness, Accountability, and Transparency (FAccT 2021)

arXiv:2011.03116 [pdf, ps, other]

doi 10.1145/3531146.3533143

Behavioral Use Licensing for Responsible AI

Authors: Danish Contractor, Daniel McDuff, Julia Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, Hanlin Li

Abstract: With the growing reliance on artificial intelligence (AI) for many different applications, the sharing of code, data, and models is important to ensure the replicability and democratization of scientific knowledge. Many high-profile academic publishing venues expect code and models to be submitted and released with papers. Furthermore, developers often want to release these assets to encourage dev… ▽ More With the growing reliance on artificial intelligence (AI) for many different applications, the sharing of code, data, and models is important to ensure the replicability and democratization of scientific knowledge. Many high-profile academic publishing venues expect code and models to be submitted and released with papers. Furthermore, developers often want to release these assets to encourage development of technology that leverages their frameworks and services. A number of organizations have expressed concerns about the inappropriate or irresponsible use of AI and have proposed ethical guidelines around the application of such systems. While such guidelines can help set norms and shape policy, they are not easily enforceable. In this paper, we advocate the use of licensing to enable legally enforceable behavioral use conditions on software and code and provide several case studies that demonstrate the feasibility of behavioral use licensing. We envision how licensing may be implemented in accordance with existing responsible AI guidelines. △ Less

Submitted 20 October, 2022; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Paper published at ACM FAccT 2022

arXiv:2004.10265 [pdf]

A Deeper Investigation of the Importance of Wikipedia Links to the Success of Search Engines

Authors: Nicholas Vincent, Brent Hecht

Abstract: A growing body of work has highlighted the important role that Wikipedia's volunteer-created content plays in hel** search engines achieve their core goal of addressing the information needs of millions of people. In this paper, we report the results of an investigation into the incidence of Wikipedia links in search engine results pages (SERPs). Our results extend prior work by considering thre… ▽ More A growing body of work has highlighted the important role that Wikipedia's volunteer-created content plays in hel** search engines achieve their core goal of addressing the information needs of millions of people. In this paper, we report the results of an investigation into the incidence of Wikipedia links in search engine results pages (SERPs). Our results extend prior work by considering three U.S. search engines, simulating both mobile and desktop devices, and using a spatial analysis approach designed to study modern SERPs that are no longer just "ten blue links". We find that Wikipedia links are extremely common in important search contexts, appearing in 67-84% of all SERPs for common and trending queries, but less often for medical queries. Furthermore, we observe that Wikipedia links often appear in "Knowledge Panel" SERP elements and are in positions visible to users without scrolling, although Wikipedia appears less in prominent positions on mobile devices. Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: This is a pre-print of a paper accepted to the non-archival track of the WikiWorkshop at the Web Conference 2020

arXiv:1912.00757 [pdf]

Map** the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence

Authors: Nicholas Vincent, Yichun Li, Renee Zha, Brent Hecht

Abstract: Identifying strategies to more broadly distribute the economic winnings of AI technologies is a growing priority in HCI and other fields. One idea gaining prominence centers on "data dividends", or sharing the profits of AI technologies with the people who generated the data on which these technologies rely. Despite the rapidly growing discussion around data dividends - including backing by promin… ▽ More Identifying strategies to more broadly distribute the economic winnings of AI technologies is a growing priority in HCI and other fields. One idea gaining prominence centers on "data dividends", or sharing the profits of AI technologies with the people who generated the data on which these technologies rely. Despite the rapidly growing discussion around data dividends - including backing by prominent politicians - there exists little guidance about how data dividends might be designed and little information about if they will work. In this paper, we begin the process of develo** a concrete design space for data dividends. We additionally simulate the effects of a variety of important design decisions using well-known datasets and algorithms. We find that seemingly innocuous decisions can create counterproductive effects, e.g. severely concentrated dividends and demographic disparities. Overall, the outcomes we observe -- both desirable and undesirable -- highlight the need for dividend implementers to make design decisions cautiously. △ Less

Submitted 18 November, 2019; originally announced December 2019.

Comments: This is a working draft. It has not been peer-reviewed and is intended for internal discussion in the computing community

arXiv:1906.08576 [pdf]

Measuring the Importance of User-Generated Content to Search Engines

Authors: Nicholas Vincent, Isaac Johnson, Patrick Sheehan, Brent Hecht

Abstract: Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to… ▽ More Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to respond to queries. Analyzing results for six types of important queries (e.g. most popular, trending, expensive advertising), we observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types. More generally, our results provide empirical information to inform a nascent but rapidly-growing debate surrounding a highly-consequential question: Do users provide enough value to intelligent technologies that they should receive more of the economic benefits from intelligent technologies? △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: This version includes a bibliography entry that was missing from the first version of the text due to a processing error. This is a preprint of a paper accepted at ICWSM 2019. Please cite that version instead

Showing 1–13 of 13 results for author: Vincent, N