-
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
Authors:
Shayne Longpre,
Robert Mahari,
Naana Obeng-Marnu,
William Brannon,
Tobin South,
Katy Gero,
Sandy Pentland,
Jad Kabbara
Abstract:
New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models…
▽ More
New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Verifiable evaluations of machine learning models using zkSNARKs
Authors:
Tobin South,
Alexander Camuto,
Shrey Jain,
Shayla Nguyen,
Robert Mahari,
Christian Paquin,
Jason Morton,
Alex 'Sandy' Pentland
Abstract:
In a world of increasing closed-source commercial machine learning models, model evaluations from developers must be taken at face value. These benchmark results-whether over task accuracy, bias evaluations, or safety checks-are traditionally impossible to verify by a model end-user without the costly or impossible process of re-performing the benchmark on black-box model outputs. This work presen…
▽ More
In a world of increasing closed-source commercial machine learning models, model evaluations from developers must be taken at face value. These benchmark results-whether over task accuracy, bias evaluations, or safety checks-are traditionally impossible to verify by a model end-user without the costly or impossible process of re-performing the benchmark on black-box model outputs. This work presents a method of verifiable model evaluation using model inference through zkSNARKs. The resulting zero-knowledge computational proofs of model outputs over datasets can be packaged into verifiable evaluation attestations showing that models with fixed private weights achieve stated performance or fairness metrics over public inputs. We present a flexible proving system that enables verifiable attestations to be performed on any standard neural network model with varying compute requirements. For the first time, we demonstrate this across a sample of real-world models and highlight key challenges and design solutions. This presents a new transparency paradigm in the verifiable evaluation of private models.
△ Less
Submitted 22 May, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
zkTax: A pragmatic way to support zero-knowledge tax disclosures
Authors:
Alex Berke,
Tobin South,
Robert Mahari,
Kent Larson,
Alex Pentland
Abstract:
Tax returns contain key financial information of interest to third parties: public officials are asked to share financial data for transparency, companies seek to assess the financial status of business partners, and individuals need to prove their income to landlords or to receive benefits. Tax returns also contain sensitive data such that sharing them in their entirety undermines privacy. We int…
▽ More
Tax returns contain key financial information of interest to third parties: public officials are asked to share financial data for transparency, companies seek to assess the financial status of business partners, and individuals need to prove their income to landlords or to receive benefits. Tax returns also contain sensitive data such that sharing them in their entirety undermines privacy. We introduce a zero-knowledge tax disclosure system (zkTax) that allows individuals and organizations to make provable claims about select information in their tax returns without revealing additional information, which can be independently verified by third parties. The system consists of three distinct services that can be distributed: a tax authority provides tax documents signed with a public key; a Redact & Prove Service enables users to produce a redacted version of the tax documents with a zero-knowledge proof attesting the provenance of the redacted data; a Verify Service enables anyone to verify the proof. We implement a prototype with a user interface, compatible with U.S. tax forms, and demonstrate how this design could be implemented with minimal changes to existing tax infrastructure. Our system is designed to be extensible to other contexts and jurisdictions. This work provides a practical example of how distributed tools leveraging cryptography can enhance existing government or financial infrastructures, providing immediate transparency alongside privacy without system overhauls.
△ Less
Submitted 24 March, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Don't forget private retrieval: distributed private similarity search for large language models
Authors:
Guy Zyskind,
Tobin South,
Alex Pentland
Abstract:
While the flexible capabilities of large language models (LLMs) allow them to answer a range of queries based on existing learned knowledge, information retrieval to augment generation is an important tool to allow LLMs to answer questions on information not included in pre-training data. Such private information is increasingly being generated in a wide array of distributed contexts by organizati…
▽ More
While the flexible capabilities of large language models (LLMs) allow them to answer a range of queries based on existing learned knowledge, information retrieval to augment generation is an important tool to allow LLMs to answer questions on information not included in pre-training data. Such private information is increasingly being generated in a wide array of distributed contexts by organizations and individuals. Performing such information retrieval using neural embeddings of queries and documents always leaked information about queries and database content unless both were stored locally. We present Private Retrieval Augmented Generation (PRAG), an approach that uses multi-party computation (MPC) to securely transmit queries to a distributed set of servers containing a privately constructed database to return top-k and approximate top-k documents. This is a first-of-its-kind approach to dense information retrieval that ensures no server observes a client's query or can see the database content. The approach introduces a novel MPC friendly protocol for inverted file approximate search (IVF) that allows for fast document search over distributed and private data in sublinear communication complexity. This work presents new avenues through which data for use in LLMs can be accessed and used without needing to centralize or forgo privacy.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Open Problems in DAOs
Authors:
Joshua Tan,
Tara Merk,
Sarah Hubbard,
Eliza R. Oak,
Helena Rong,
Joni Pirovich,
Ellie Rennie,
Rolf Hoefer,
Michael Zargham,
Jason Potts,
Chris Berg,
Reuben Youngblom,
Primavera De Filippi,
Seth Frey,
Jeff Strnad,
Morshed Mannan,
Kelsie Nabben,
Silke Noa Elrifai,
Jake Hartnell,
Benjamin Mako Hill,
Tobin South,
Ryan L. Thomas,
Jonathan Dotan,
Ariana Spring,
Alexia Maddox
, et al. (4 additional authors not shown)
Abstract:
Decentralized autonomous organizations (DAOs) are a new, rapidly-growing class of organizations governed by smart contracts. Here we describe how researchers can contribute to the emerging science of DAOs and other digitally-constituted organizations. From granular privacy primitives to mechanism designs to model laws, we identify high-impact problems in the DAO ecosystem where existing gaps might…
▽ More
Decentralized autonomous organizations (DAOs) are a new, rapidly-growing class of organizations governed by smart contracts. Here we describe how researchers can contribute to the emerging science of DAOs and other digitally-constituted organizations. From granular privacy primitives to mechanism designs to model laws, we identify high-impact problems in the DAO ecosystem where existing gaps might be tackled through a new data set or by applying tools and ideas from existing research fields such as political science, computer science, economics, law, and organizational science. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the wider research community to join the global effort to invent the next generation of organizations.
△ Less
Submitted 12 June, 2024; v1 submitted 29 October, 2023;
originally announced October 2023.
-
Building a healthier feed: Private location trace intersection driven feed recommendations
Authors:
Tobin South,
Nick Lothian,
Alex "Sandy" Pentland
Abstract:
The physical environment you navigate strongly determines which communities and people matter most to individuals. These effects drive both personal access to opportunities and the social capital of communities, and can often be observed in the personal mobility traces of individuals. Traditional social media feeds underutilize these mobility-based features, or do so in a privacy exploitative mann…
▽ More
The physical environment you navigate strongly determines which communities and people matter most to individuals. These effects drive both personal access to opportunities and the social capital of communities, and can often be observed in the personal mobility traces of individuals. Traditional social media feeds underutilize these mobility-based features, or do so in a privacy exploitative manner. Here we propose a consent-first private information sharing paradigm for driving social feeds from users' personal private data, specifically using mobility traces. This approach designs the feed to explicitly optimize for integrating the user into the local community and for social capital building through leveraging mobility trace overlaps as a proxy for existing or potential real-world social connections, creating proportionality between whom a user sees in their feed, and whom the user is likely to see in person. These claims are validated against existing social-mobility data, and a reference implementation of the proposed algorithm is built for demonstration. In total, this work presents a novel technique for designing feeds that represent real offline social connections through private set intersections requiring no third party, or public data exposure.
△ Less
Submitted 20 September, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Information flow estimation: a study of news on Twitter
Authors:
Tobin South,
Bridget Smart,
Matthew Roughan,
Lewis Mitchell
Abstract:
News media has long been an ecosystem of creation, reproduction, and critique, where news outlets report on current events and add commentary to ongoing stories. Understanding the dynamics of news information creation and dispersion is important to accurately ascribe credit to influential work and understand how societal narratives develop. These dynamics can be modelled through a combination of i…
▽ More
News media has long been an ecosystem of creation, reproduction, and critique, where news outlets report on current events and add commentary to ongoing stories. Understanding the dynamics of news information creation and dispersion is important to accurately ascribe credit to influential work and understand how societal narratives develop. These dynamics can be modelled through a combination of information-theoretic natural language processing and networks; and can be parameterised using large quantities of textual data. However, it is challenging to see "the wood for the trees", i.e., to detect small but important flows of information in a sea of noise. Here we develop new comparative techniques to estimate temporal information flow between pairs of text producers. Using both simulated and real text data we compare the reliability and sensitivity of methods for estimating textual information flow, showing that a metric that normalises by local neighbourhood structure provides a robust estimate of information flow in large networks. We apply this metric to a large corpus of news organisations on Twitter and demonstrate its usefulness in identifying influence within an information ecosystem, finding that average information contribution to the network is not correlated with the number of followers or the number of tweets. This suggests that small local organisations and right-wing organisations which have lower average follower counts still contribute significant information to the ecosystem. Further, the methods are applied to smaller full-text datasets of specific news events across news sites and Russian troll accounts on Twitter. The information flow estimation reveals and quantifies features of how these events develop and the role of groups of trolls in setting disinformation narratives.
△ Less
Submitted 28 September, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Are we always in strife? A longitudinal study of the echo chamber effect in the Australian Twittersphere
Authors:
Mehwish Nasim,
Derek Weber,
Tobin South,
Jonathan Tuke,
Nigel Bean,
Lucia Falzon,
Lewis Mitchell
Abstract:
Contrary to expectations that the increased connectivity offered by the internet and particularly Online Social Networks (OSNs) would result in broad consensus on contentious issues, we instead frequently observe the formation of polarised echo chambers, in which only one side of an argument is entertained. These can progress to filter bubbles, actively filtering contrasting opinions, resulting in…
▽ More
Contrary to expectations that the increased connectivity offered by the internet and particularly Online Social Networks (OSNs) would result in broad consensus on contentious issues, we instead frequently observe the formation of polarised echo chambers, in which only one side of an argument is entertained. These can progress to filter bubbles, actively filtering contrasting opinions, resulting in vulnerability to misinformation and increased polarisation on social and political issues. These have real-world effects when they spread offline, such as vaccine hesitation and violence. This work seeks to develop a better understanding of how echo chambers manifest in different discussions dealing with different issues over an extended period of time. We explore the activities of two groups of polarised accounts across three Twitter discussions in the Australian context. We found Australian Twitter accounts arguing against marriage equality in 2017 were more likely to support the notion that arsonists were the primary cause of the 2019/2020 Australian bushfires, and those supporting marriage equality argued against that arson narrative. We also found strong evidence that the stance people took on marriage equality in 2017 did not predict their political stance in discussions around the Australian federal election two years later. Although mostly isolated from each other, we observe that in certain situations the polarised groups may interact with the broader community, which offers hope that the echo chambers may be reduced with concerted outreach to members.
△ Less
Submitted 22 January, 2022;
originally announced January 2022.
-
Popularity and Centrality in Spotify Networks: Critical transitions in eigenvector centrality
Authors:
Tobin South,
Matthew Roughan,
Lewis Mitchell
Abstract:
The modern age of digital music access has increased the availability of data about music consumption and creation, facilitating the large-scale analysis of the complex networks that connect music together. Data about user streaming behaviour, and the musical collaboration networks are particularly important with new data-driven recommendation systems. Without thorough analysis, such collaboration…
▽ More
The modern age of digital music access has increased the availability of data about music consumption and creation, facilitating the large-scale analysis of the complex networks that connect music together. Data about user streaming behaviour, and the musical collaboration networks are particularly important with new data-driven recommendation systems. Without thorough analysis, such collaboration graphs can lead to false or misleading conclusions. Here we present a new collaboration network of artists from the online music streaming service Spotify, and demonstrate a critical change in the eigenvector centrality of artists, as low popularity artists are removed. The critical change in centrality, from classical artists to rap artists, demonstrates deeper structural properties of the network. A Social Group Centrality model is presented to simulate this critical transition behaviour, and switching between dominant eigenvectors is observed. This model presents a novel investigation of the effect of popularity bias on how centrality and importance are measured, and provides a new tool for examining such flaws in networks.
△ Less
Submitted 29 August, 2021; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Generation of nitrogen-vacancy ensembles in diamond for quantum sensors: Optimization and scalability of CVD processes
Authors:
Andrew M. Edmonds,
Connor A. Hart,
Matthew J. Turner,
Pierre-Olivier Colard,
Jennifer M. Schloss,
Kevin Olsson,
Raisa Trubko,
Matthew L. Markham,
Adam Rathmill,
Ben Horne-Smith,
Wilbur Lew,
Arul Manickam,
Scott Bruce,
Peter G. Kaup,
Jon C. Russo,
Michael J. DiMario,
Joseph T. South,
Jay T. Hansen,
Daniel J. Twitchen,
Ronald L. Walsworth
Abstract:
Ensembles of nitrogen-vacancy (NV) centers in diamond are a leading platform for practical quantum sensors. Reproducible and scalable fabrication of NV-ensembles with desired properties is crucial. This work addresses these challenges by develo** a chemical vapor deposition (CVD) synthesis process to produce diamond material at scale with improved NV-ensemble properties for a target NV density.…
▽ More
Ensembles of nitrogen-vacancy (NV) centers in diamond are a leading platform for practical quantum sensors. Reproducible and scalable fabrication of NV-ensembles with desired properties is crucial. This work addresses these challenges by develo** a chemical vapor deposition (CVD) synthesis process to produce diamond material at scale with improved NV-ensemble properties for a target NV density. The material reported in this work enables immediate sensitivity improvements for current devices. In addition, techniques established in this work for material and sensor characterization at different stages of the CVD synthesis process provide metrics for future efforts targeting other NV densities or sample geometries.
△ Less
Submitted 3 April, 2020;
originally announced April 2020.
-
Complex contagion features without social reinforcement in a model of social information flow
Authors:
Tyson Pond,
Saranzaya Magsarjav,
Tobin South,
Lewis Mitchell,
James P. Bagrow
Abstract:
Contagion models are a primary lens through which we understand the spread of information over social networks. However, simple contagion models cannot reproduce the complex features observed in real-world data, leading to research on more complicated complex contagion models. A noted feature of complex contagion is social reinforcement that individuals require multiple exposures to information be…
▽ More
Contagion models are a primary lens through which we understand the spread of information over social networks. However, simple contagion models cannot reproduce the complex features observed in real-world data, leading to research on more complicated complex contagion models. A noted feature of complex contagion is social reinforcement that individuals require multiple exposures to information before they begin to spread it themselves. Here we show that the quoter model, a model of the social flow of written information over a network, displays features of complex contagion, including the weakness of long ties and that increased density inhibits rather than promotes information flow. Interestingly, the quoter model exhibits these features despite having no explicit social reinforcement mechanism, unlike complex contagion models. Our results highlight the need to complement contagion models with an information-theoretic view of information spreading to better understand how network properties affect information flow and what are the most necessary ingredients when modeling social behavior.
△ Less
Submitted 26 February, 2020; v1 submitted 12 February, 2020;
originally announced February 2020.
-
How the Avengers assemble: Ecological modelling of effective cast sizes for movies
Authors:
Matthew Roughan,
Lewis Mitchell,
Tobin South
Abstract:
The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classificatio…
▽ More
The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classification. We also show how the metric can be generalised using Jensen-Shannon divergence to provide a measure of the similarity of characters appearing in different movies, for instance of use in recommender systems, e.g., Netflix. We apply our measures to the Marvel Cinematic Universe (MCU), and show what they teach us about this highly successful franchise of movies. In particular, these measures provide a useful predictor of "success" for films in the MCU, as well as a natural means to understand the relationships between the stories in the overall film arc.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.