Search | arXiv e-print repository

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Authors: Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf

Abstract: The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produ… ▽ More The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.13974 [pdf, other]

CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models

Authors: Giada Pistilli, Alina Leidinger, Yacine Jernite, Atoosa Kasirzadeh, Alexandra Sasha Luccioni, Margaret Mitchell

Abstract: This paper introduces the "CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset, designed to evaluate the social and cultural variation of Large Language Models (LLMs) across multiple languages and value-sensitive topics. We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, so… ▽ More This paper introduces the "CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset, designed to evaluate the social and cultural variation of Large Language Models (LLMs) across multiple languages and value-sensitive topics. We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy. CIVICS is designed to generate responses showing LLMs' encoded and implicit values. Through our dynamic annotation processes, tailored prompt design, and experiments, we investigate how open-weight LLMs respond to value-sensitive issues, exploring their behavior across diverse linguistic and cultural contexts. Using two experimental set-ups based on log-probabilities and long-form responses, we show social and cultural variability across different LLMs. Specifically, experiments involving long-form responses demonstrate that refusals are triggered disparately across models, but consistently and more frequently in English or translated statements. Moreover, specific topics and sources lead to more pronounced differences across model answers, particularly on immigration, LGBTQI rights, and social welfare. As shown by our experiments, the CIVICS dataset aims to serve as a tool for future research, promoting reproducibility and transparency across broader linguistic settings, and furthering the development of AI technologies that respect and reflect global cultural diversities and value pluralism. The CIVICS dataset and tools will be made available upon publication under open licenses; an anonymized version is currently available at https://huggingface.co/CIVICS-dataset. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2402.08955 [pdf, other]

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Authors: Martha Lewis, Melanie Mitchell

Abstract: Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making… ▽ More Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2401.10376 [pdf, other]

PAC Code Rate-Profile Design Using Search-Constrained Optimization Algorithms

Authors: Mohsen Moradi, David G. M. Mitchell

Abstract: In this paper, we introduce a novel rate-profile design based on search-constrained optimization techniques to assess the performance of polarization-adjusted convolutional (PAC) codes under Fano (sequential) decoding. The results demonstrate that the resulting PAC code offers much reduced computational complexity compared to a construction based on a conventional genetic algorithm without a perfo… ▽ More In this paper, we introduce a novel rate-profile design based on search-constrained optimization techniques to assess the performance of polarization-adjusted convolutional (PAC) codes under Fano (sequential) decoding. The results demonstrate that the resulting PAC code offers much reduced computational complexity compared to a construction based on a conventional genetic algorithm without a performance loss in error-correction performance. As the fitness function of our algorithm, we propose an adaptive successive cancellation list decoding algorithm to determine the weight distribution of the rate profiles. The simulation results indicate that, for a PAC(256, 128) code, only 8% of the population requires that their fitness function be evaluated with a large list size. This represents an improvement of almost 92% over a conventional evolutionary algorithm. For a PAC(64, 32) code, this improvement is about 99%. We also plotted the performance of the high-rate PAC(128, 105) and PAC(64, 51) codes, and the results show that they exhibit superior performance compared to other algorithms. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.10284 [pdf, other]

MorpheusNet: Resource efficient sleep stage classifier for embedded on-line systems

Authors: Ali Kavoosi, Morgan P. Mitchell, Raveen Kariyawasam, John E. Fleming, Penny Lewis, Heidi Johansen-Berg, Hayriye Cagnan, Timothy Denison

Abstract: Sleep Stage Classification (SSC) is a labor-intensive task, requiring experts to examine hours of electrophysiological recordings for manual classification. This is a limiting factor when it comes to leveraging sleep stages for therapeutic purposes. With increasing affordability and expansion of wearable devices, automating SSC may enable deployment of sleep-based therapies at scale. Deep Learning… ▽ More Sleep Stage Classification (SSC) is a labor-intensive task, requiring experts to examine hours of electrophysiological recordings for manual classification. This is a limiting factor when it comes to leveraging sleep stages for therapeutic purposes. With increasing affordability and expansion of wearable devices, automating SSC may enable deployment of sleep-based therapies at scale. Deep Learning has gained increasing attention as a potential method to automate this process. Previous research has shown accuracy comparable to manual expert scores. However, previous approaches require sizable amount of memory and computational resources. This constrains the ability to classify in real time and deploy models on the edge. To address this gap, we aim to provide a model capable of predicting sleep stages in real-time, without requiring access to external computational sources (e.g., mobile phone, cloud). The algorithm is power efficient to enable use on embedded battery powered systems. Our compact sleep stage classifier can be deployed on most off-the-shelf microcontrollers (MCU) with constrained hardware settings. This is due to the memory footprint of our approach requiring significantly fewer operations. The model was tested on three publicly available data bases and achieved performance comparable to the state of the art, whilst reducing model complexity by orders of magnitude (up to 280 times smaller compared to state of the art). We further optimized the model with quantization of parameters to 8 bits with only an average drop of 0.95% in accuracy. When implemented in firmware, the quantized model achieves a latency of 1.6 seconds on an Arm CortexM4 processor, allowing its use for on-line SSC-based therapies. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: This paper was presented at the 2023 IEEE conference on Systems, Man, and Cybernetics (SMC)

arXiv:2312.09323 [pdf, other]

Perspectives on the State and Future of Deep Learning - 2023

Authors: Micah Goldblum, Anima Anandkumar, Richard Baraniuk, Tom Goldstein, Kyunghyun Cho, Zachary C Lipton, Melanie Mitchell, Preetum Nakkiran, Max Welling, Andrew Gordon Wilson

Abstract: The goal of this series is to chronicle opinions and issues in the field of machine learning as they stand today and as they change over time. The plan is to host this survey periodically until the AI singularity paperclip-frenzy-driven doomsday, kee** an updated list of topical questions and interviewing new community members for each edition. In this issue, we probed people's opinions on inter… ▽ More The goal of this series is to chronicle opinions and issues in the field of machine learning as they stand today and as they change over time. The plan is to host this survey periodically until the AI singularity paperclip-frenzy-driven doomsday, kee** an updated list of topical questions and interviewing new community members for each edition. In this issue, we probed people's opinions on interpretable AI, the value of benchmarking in modern NLP, the state of progress towards understanding deep learning, and the future of academia. △ Less

Submitted 18 December, 2023; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2311.16896 [pdf, other]

65 GOPS/neuron Photonic Tensor Core with Thin-film Lithium Niobate Photonics

Authors: Zhong** Lin, Bhavin J. Shastri, Shangxuan Yu, **gxiang Song, Yuntao Zhu, Arman Safarnejadian, Wangning Cai, Yanmei Lin, Wei Ke, Mustafa Hammood, Tianye Wang, Mengyue Xu, Zibo Zheng, Mohammed Al-Qadasi, Omid Esmaeeli, Mohamed Rahim, Grzegorz Pakulski, Jens Schmid, Pedro Barrios, Weihong Jiang, Hugh Morison, Matthew Mitchell, Xiaogang Qiang, Xun Guan, Nicolas A. F. Jaeger , et al. (6 additional authors not shown)

Abstract: Photonics offers a transformative approach to artificial intelligence (AI) and neuromorphic computing by providing low latency, high bandwidth, and energy-efficient computations. Here, we introduce a photonic tensor core processor enabled by time-multiplexed inputs and charge-integrated outputs. This fully integrated processor, comprising only two thin-film lithium niobate (TFLN) modulators, a III… ▽ More Photonics offers a transformative approach to artificial intelligence (AI) and neuromorphic computing by providing low latency, high bandwidth, and energy-efficient computations. Here, we introduce a photonic tensor core processor enabled by time-multiplexed inputs and charge-integrated outputs. This fully integrated processor, comprising only two thin-film lithium niobate (TFLN) modulators, a III-V laser, and a charge-integration photoreceiver, can implement an entire layer of a neural network. It can execute 65 billion operations per second (GOPS) per neuron, including simultaneous weight updates-a hitherto unachieved speed. Our processor stands out from conventional photonic processors, which have static weights set during training, as it supports fast "hardware-in-the-loop" training, and can dynamically adjust the inputs (fan-in) and outputs (fan-out) within a layer, thereby enhancing its versatility. Our processor can perform large-scale dot-product operations with vector dimensions up to 131,072. Furthermore, it successfully classifies (supervised learning) and clusters (unsupervised learning) 112*112-pixel images after "hardware-in-the-loop" training. To handle "hardware-in-the-loop" training for clustering AI tasks, we provide a solution for multiplications involving two negative numbers based on our processor. △ Less

Submitted 30 November, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: 19 pages, 6 figures

MSC Class: 78A05

arXiv:2311.09247 [pdf, other]

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

Authors: Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev

Abstract: We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC ta… ▽ More We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels. △ Less

Submitted 11 December, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Corrected Figure 3 (extra spaces were replaced by commas, which were lost in original formatting)

Journal ref: Proceedings of the LLM-CP Workshop, AAAI 2024

arXiv:2310.01557 [pdf, other]

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Authors: Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li

Abstract: Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi… ▽ More Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/Microsoft/SmartPlay △ Less

Submitted 17 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2307.14213 [pdf, other]

Soft Air Pocket Force Sensors for Large Scale Flexible Robots

Authors: Michael R. Mitchell, Ciera McFarland, Margaret M. Coad

Abstract: Flexible robots have advantages over rigid robots in their ability to conform physically to their environment and to form a wide variety of shapes. Sensing the force applied by or to flexible robots is useful for both navigation and manipulation tasks, but it is challenging due to the need for the sensors to withstand the robots' shape change without encumbering their functionality. Also, for robo… ▽ More Flexible robots have advantages over rigid robots in their ability to conform physically to their environment and to form a wide variety of shapes. Sensing the force applied by or to flexible robots is useful for both navigation and manipulation tasks, but it is challenging due to the need for the sensors to withstand the robots' shape change without encumbering their functionality. Also, for robots with long or large bodies, the number of sensors required to cover the entire surface area of the robot body can be prohibitive due to high cost and complexity. We present a novel soft air pocket force sensor that is highly flexible, lightweight, relatively inexpensive, and easily scalable to various sizes. Our sensor produces a change in internal pressure that is linear with the applied force. We present results of experimental testing of how uncontrollable factors (contact location and contact area) and controllable factors (initial internal pressure, thickness, size, and number of interior seals) affect the sensitivity. We demonstrate our sensor applied to a vine robot-a soft inflatable robot that "grows" from the tip via eversion-and we show that the robot can successfully grow and steer towards an object with which it senses contact. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: M. R. Mitchell, C. McFarland, and M. M. Coad, "Soft Air Pocket Force Sensors for Large Scale Flexible Robots," in IEEE International Conference on Soft Robotics, 2023, pp. 1-8. Video: https://youtu.be/2De0htilW74

arXiv:2307.13905 [pdf, other]

Reinforcement Learning for Sequential Decoding of Generalized LDPC Codes

Authors: Salman Habib, David G. M. Mitchell

Abstract: In this work, we propose reinforcement learning (RL) for sequential decoding of moderate length generalized low-density parity-check (GLDPC) codes. Here, sequential decoding refers to scheduling all the generalized constraint nodes (GCNs) and single parity-check nodes (SPCNs) of a GLDPC code serially in each iteration. A GLDPC decoding environment is modeled as a finite Markov decision process (MD… ▽ More In this work, we propose reinforcement learning (RL) for sequential decoding of moderate length generalized low-density parity-check (GLDPC) codes. Here, sequential decoding refers to scheduling all the generalized constraint nodes (GCNs) and single parity-check nodes (SPCNs) of a GLDPC code serially in each iteration. A GLDPC decoding environment is modeled as a finite Markov decision process (MDP) in which the state-space comprises of all possible sequences of hard-decision values of the variables nodes (VNs) connected to the scheduled GCN or SPCN, and the action-space of the MDP consists of all possible actions (GCN and SPCN scheduling). The goal of RL is to determine an optimized scheduling policy, i.e., one that results in a decoded codeword by minimizing the complexity of the belief propagation (BP) decoder. For training, we consider the proportion of correct bits at the output of the GCN or SPCN as a reward once it is scheduled. The expected rewards for scheduling all the GCNs/SPCNs in the code's Tanner graph are earned via BP decoding during the RL phase. The proposed RL-based decoding scheme is shown to significantly outperform the standard BP flooding decoder, as well as a sequential decoder in which the GCNs/SPCNs are scheduled randomly. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: accepted for publication at ISTC 2023. arXiv admin note: text overlap with arXiv:2112.13934

arXiv:2306.05949 [pdf, other]

Evaluating the Social Impact of Generative AI Systems in Systems and Society

Authors: Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Canyu Chen, Hal Daumé III, Jesse Dodge, Isabella Duan, Ellie Evans, Felix Friedrich, Avijit Ghosh, Usman Gohar, Sara Hooker, Yacine Jernite, Ria Kalluri, Alberto Lusoli, Alina Leidinger, Michelle Lin, Xiuzhu Lin, Sasha Luccioni, Jennifer Mickel, Margaret Mitchell, Jessica Newman , et al. (6 additional authors not shown)

Abstract: Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categor… ▽ More Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm. △ Less

Submitted 28 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Forthcoming in Hacker, Engel, Hammer, Mittelstadt (eds), Oxford Handbook on the Foundations and Regulation of Generative AI. Oxford University Press

arXiv:2305.18615 [pdf, other]

doi 10.1145/3593013.3594002

Stronger Together: on the Articulation of Ethical Charters, Legal Tools, and Technical Documentation in ML

Authors: Giada Pistilli, Carlos Munoz Ferrandis, Yacine Jernite, Margaret Mitchell

Abstract: The growing need for accountability of the people behind AI systems can be addressed by leveraging processes in three fields of study: ethics, law, and computer science. While these fields are often considered in isolation, they rely on complementary notions in their interpretation and implementation. In this work, we detail this interdependence and motivate the necessary role of collaborative gov… ▽ More The growing need for accountability of the people behind AI systems can be addressed by leveraging processes in three fields of study: ethics, law, and computer science. While these fields are often considered in isolation, they rely on complementary notions in their interpretation and implementation. In this work, we detail this interdependence and motivate the necessary role of collaborative governance tools in sha** a positive evolution of AI. We first contrast notions of compliance in the ethical, legal, and technical fields; we outline both their differences and where they complement each other, with a particular focus on the roles of ethical charters, licenses, and technical documentation in these interactions. We then focus on the role of values in articulating the synergies between the fields and outline specific mechanisms of interaction between them in practice. We identify how these mechanisms have played out in several open governance fora: an open collaborative workshop, a responsible licensing initiative, and a proposed regulatory framework. By leveraging complementary notions of compliance in these three domains, we can create a more comprehensive framework for governing AI systems that jointly takes into account their technical capabilities, their impact on society, and how technical specifications can inform relevant regulations. Our analysis thus underlines the necessity of joint consideration of the ethical, legal, and technical in AI ethics frameworks to be used on a larger scale to govern AI systems and how the thinking in each of these areas can inform the others. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.07141 [pdf, other]

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Authors: Arseny Moskvichev, Victor Vikram Odouard, Melanie Mitchell

Abstract: The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to… ▽ More The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture. In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet [2019]. In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around "concept groups" -- sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI's GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Journal ref: Transactions on Machine Learning Research, 8/2023

arXiv:2304.13626 [pdf, other]

The Roles of Symbols in Neural-based AI: They are Not What You Think!

Authors: Daniel L. Silver, Tom M. Mitchell

Abstract: We propose that symbols are first and foremost external communication tools used between intelligent agents that allow knowledge to be transferred in a more efficient and effective manner than having to experience the world directly. But, they are also used internally within an agent through a form of self-communication to help formulate, describe and justify subsymbolic patterns of neural activit… ▽ More We propose that symbols are first and foremost external communication tools used between intelligent agents that allow knowledge to be transferred in a more efficient and effective manner than having to experience the world directly. But, they are also used internally within an agent through a form of self-communication to help formulate, describe and justify subsymbolic patterns of neural activity that truly implement thinking. Symbols, and our languages that make use of them, not only allow us to explain our thinking to others and ourselves, but also provide beneficial constraints (inductive bias) on learning about the world. In this paper we present relevant insights from neuroscience and cognitive science, about how the human brain represents symbols and the concepts they refer to, and how today's artificial neural networks can do the same. We then present a novel neuro-symbolic hypothesis and a plausible architecture for intelligent agents that combines subsymbolic representations for symbols and concepts for learning and reasoning. Our hypothesis and associated architecture imply that symbols will remain critical to the future of intelligent systems NOT because they are the fundamental building blocks of thought, but because they are characterizations of subsymbolic processes that constitute thought. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Comments: 28 pages

arXiv:2303.17853 [pdf, other]

Can AI Put Gamma-Ray Astrophysicists Out of a Job?

Authors: Samuel T. Spencer, Vikas Joshi, Alison M. W. Mitchell

Abstract: In what will likely be a litany of generative-model-themed arXiv submissions celebrating April the 1st, we evaluate the capacity of state-of-the-art transformer models to create a paper detailing the detection of a Pulsar Wind Nebula with a non-existent Imaging Atmospheric Cherenkov Telescope (IACT) Array. We do this to evaluate the ability of such models to interpret astronomical observations and… ▽ More In what will likely be a litany of generative-model-themed arXiv submissions celebrating April the 1st, we evaluate the capacity of state-of-the-art transformer models to create a paper detailing the detection of a Pulsar Wind Nebula with a non-existent Imaging Atmospheric Cherenkov Telescope (IACT) Array. We do this to evaluate the ability of such models to interpret astronomical observations and sources based on language information alone, and to assess potential means by which fraudulently generated scientific papers could be identified during peer review (given that reliable generative model watermarking has yet to be deployed for these tools). We conclude that our jobs as astronomers are safe for the time being. From this point on, prompts given to ChatGPT and Stable Diffusion are shown in orange, text generated by ChatGPT is shown in black, whereas analysis by the (human) authors is in blue. △ Less

Submitted 4 April, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.11408 [pdf, other]

Stable Bias: Analyzing Societal Representations in Diffusion Models

Authors: Alexandra Sasha Luccioni, Christopher Akiki, Margaret Mitchell, Yacine Jernite

Abstract: As machine learning-enabled Text-to-Image (TTI) systems are becoming increasingly prevalent and seeing growing adoption as commercial services, characterizing the social biases they exhibit is a necessary first step to lowering their risk of discriminatory outcomes. This evaluation, however, is made more difficult by the synthetic nature of these systems' outputs: common definitions of diversity a… ▽ More As machine learning-enabled Text-to-Image (TTI) systems are becoming increasingly prevalent and seeing growing adoption as commercial services, characterizing the social biases they exhibit is a necessary first step to lowering their risk of discriminatory outcomes. This evaluation, however, is made more difficult by the synthetic nature of these systems' outputs: common definitions of diversity are grounded in social categories of people living in the world, whereas the artificial depictions of fictive humans created by these systems have no inherent gender or ethnicity. To address this need, we propose a new method for exploring the social biases in TTI systems. Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts, and comparing it to the variation engendered by spanning different professions. This allows us to (1) identify specific bias trends, (2) provide targeted scores to directly compare models in terms of diversity and representation, and (3) jointly model interdependent social variables to support a multidimensional analysis. We leverage this method to analyze images generated by 3 popular TTI systems (Dall-E 2, Stable Diffusion v 1.4 and 2) and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents. We also release the datasets and low-code interactive bias exploration platforms developed for this work, as well as the necessary tools to similarly evaluate additional TTI systems. △ Less

Submitted 9 November, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted to NeurIPS Datasets and Benchmarks 2023 (spotlight)

arXiv:2303.03915 [pdf, other]

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Authors: Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa , et al. (29 additional authors not shown)

Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f… ▽ More As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: NeurIPS 2022, Datasets and Benchmarks Track

ACM Class: I.2.7

arXiv:2302.04449 [pdf, other]

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Authors: Yue Wu, Yewen Fan, Paul Pu Liang, Amos Azaria, Yuanzhi Li, Tom M. Mitchell

Abstract: High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and re… ▽ More High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design. △ Less

Submitted 26 October, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2212.05129 [pdf, other]

Measuring Data

Authors: Margaret Mitchell, Alexandra Sasha Luccioni, Nathan Lambert, Marissa Gerchick, Angelina McMillan-Major, Ezinwanne Ozoani, Nazneen Rajani, Tristan Thrush, Yacine Jernite, Douwe Kiela

Abstract: We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of t… ▽ More We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice. △ Less

Submitted 13 February, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

arXiv:2211.15533 [pdf, other]

The Stack: 3 TB of permissively licensed source code

Authors: Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries

Abstract: Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect t… ▽ More Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/. △ Less

Submitted 20 November, 2022; originally announced November 2022.

arXiv:2211.05100 [pdf, other]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. △ Less

Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.15767 [pdf]

Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report

Authors: Michael L. Littman, Ifeoma Ajunwa, Guy Berger, Craig Boutilier, Morgan Currie, Finale Doshi-Velez, Gillian Hadfield, Michael C. Horowitz, Charles Isbell, Hiroaki Kitano, Karen Levy, Terah Lyons, Melanie Mitchell, Julie Shah, Steven Sloman, Shannon Vallor, Toby Walsh

Abstract: In September 2021, the "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the second report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Michael Littman of Brown University. The report, entitled "Gathering Strengt… ▽ More In September 2021, the "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the second report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Michael Littman of Brown University. The report, entitled "Gathering Strength, Gathering Storms," answers a set of 14 questions probing critical areas of AI development addressing the major risks and dangers of AI, its effects on society, its public perception and the future of the field. The report concludes that AI has made a major leap from the lab to people's lives in recent years, which increases the urgency to understand its potential negative effects. The questions were developed by the AI100 Standing Committee, chaired by Peter Stone of the University of Texas at Austin, consisting of a group of AI leaders with expertise in computer science, sociology, ethics, economics, and other disciplines. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: 82 pages, https://ai100.stanford.edu/gathering-strength-gathering-storms-one-hundred-year-study-artificial-intelligence-ai100-2021-study

arXiv:2210.13966 [pdf, ps, other]

doi 10.1073/pnas.2215907120

The Debate Over Understanding in AI's Large Language Models

Authors: Melanie Mitchell, David C. Krakauer

Abstract: We survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. We describe arguments that have been made for and against such understanding, and key questions for the broader sciences of intelligence that have arisen in light of th… ▽ More We survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. We describe arguments that have been made for and against such understanding, and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that a new science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition. △ Less

Submitted 10 February, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: Under submission as a Perspective article. Updated with additional discussion and citations

Journal ref: Proceedings of the National Academy of Sciences 120 (13), 2023

arXiv:2210.13589 [pdf, ps, other]

Embodied, Situated, and Grounded Intelligence: Implications for AI

Authors: Tyler Millhouse, Melanie Moses, Melanie Mitchell

Abstract: In April of 2022, the Santa Fe Institute hosted a workshop on embodied, situated, and grounded intelligence as part of the Institute's Foundations of Intelligence project. The workshop brought together computer scientists, psychologists, philosophers, social scientists, and others to discuss the science of embodiment and related issues in human intelligence, and its implications for building robus… ▽ More In April of 2022, the Santa Fe Institute hosted a workshop on embodied, situated, and grounded intelligence as part of the Institute's Foundations of Intelligence project. The workshop brought together computer scientists, psychologists, philosophers, social scientists, and others to discuss the science of embodiment and related issues in human intelligence, and its implications for building robust, human-level AI. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: 38 pages, workshop report

arXiv:2210.05839 [pdf, other]

SEAL : Interactive Tool for Systematic Error Analysis and Labeling

Authors: Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Mitchell, James Zou

Abstract: With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not obvious in aggregate evaluation. Identifying such problematic data groups is even more challenging when there are no explicit labels (e.g., ethnicity, gender, etc… ▽ More With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not obvious in aggregate evaluation. Identifying such problematic data groups is even more challenging when there are no explicit labels (e.g., ethnicity, gender, etc.) and further compounded for NLP datasets due to the lack of visual features to characterize failure modes (e.g., Asian males, animals indoors, waterbirds on land, etc.). This paper introduces an interactive Systematic Error Analysis and Labeling (\seal) tool that uses a two-step approach to first identify high error slices of data and then, in the second step, introduce methods to give human-understandable semantics to those underperforming slices. We explore a variety of methods for coming up with coherent semantics for the error groups using language models for semantic labeling and a text-to-image model for generating visual features. SEAL toolkit and demo screencast is available at https://huggingface.co/spaces/nazneen/seal. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted at EMNLP 2022 demo track

arXiv:2210.02667 [pdf, ps, other]

A Human Rights-Based Approach to Responsible AI

Authors: Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, Iason Gabriel

Abstract: Research on fairness, accountability, transparency and ethics of AI-based interventions in society has gained much-needed momentum in recent years. However it lacks an explicit alignment with a set of normative values and principles that guide this research and interventions. Rather, an implicit consensus is often assumed to hold for the values we impart into our models - something that is at odds… ▽ More Research on fairness, accountability, transparency and ethics of AI-based interventions in society has gained much-needed momentum in recent years. However it lacks an explicit alignment with a set of normative values and principles that guide this research and interventions. Rather, an implicit consensus is often assumed to hold for the values we impart into our models - something that is at odds with the pluralistic world we live in. In this paper, we put forth the doctrine of universal human rights as a set of globally salient and cross-culturally recognized set of values that can serve as a grounding framework for explicit value alignment in responsible AI - and discuss its efficacy as a framework for civil society partnership and participation. We argue that a human rights framework orients the research in this space away from the machines and the risks of their biases, and towards humans and the risks to their rights, essentially hel** to center the conversation around who is harmed, what harms they face, and how those harms may be mitigated. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: Presented as a (non-archival) poster at the 2022 ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization or (EAAMO '22)

arXiv:2210.01970 [pdf, other]

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Authors: Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško, Albert Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela

Abstract: Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support… ▽ More Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden evaluation to cover more facets of model performance. It includes over 50 efficient canonical implementations for a variety of domains and scenarios, interactive documentation, and the ability to easily share implementations and outcomes. The library is available at https://github.com/huggingface/evaluate. In addition, we introduce Evaluation on the Hub, a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button. Evaluation on the Hub is available at https://huggingface.co/autoevaluate. △ Less

Submitted 6 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

arXiv:2207.08939 [pdf, other]

Learning Sparsity-Promoting Regularizers using Bilevel Optimization

Authors: Avrajit Ghosh, Michael T. McCann, Madeline Mitchell, Saiprasad Ravishankar

Abstract: We present a method for supervised learning of sparsity-promoting regularizers for denoising signals and images. Sparsity-promoting regularization is a key ingredient in solving modern signal reconstruction problems; however, the operators underlying these regularizers are usually either designed by hand or learned from data in an unsupervised way. The recent success of supervised learning (mainly… ▽ More We present a method for supervised learning of sparsity-promoting regularizers for denoising signals and images. Sparsity-promoting regularization is a key ingredient in solving modern signal reconstruction problems; however, the operators underlying these regularizers are usually either designed by hand or learned from data in an unsupervised way. The recent success of supervised learning (mainly convolutional neural networks) in solving image reconstruction problems suggests that it could be a fruitful approach to designing regularizers. Towards this end, we propose to denoise signals using a variational formulation with a parametric, sparsity-promoting regularizer, where the parameters of the regularizer are learned to minimize the mean squared error of reconstructions on a training set of ground truth image and measurement pairs. Training involves solving a challenging bilievel optimization problem; we derive an expression for the gradient of the training loss using the closed-form solution of the denoising problem and provide an accompanying gradient descent algorithm to minimize it. Our experiments with structured 1D signals and natural images show that the proposed method can learn an operator that outperforms well-known regularizers (total variation, DCT-sparsity, and unsupervised dictionary learning) and collaborative filtering for denoising. While the approach we present is specific to denoising, we believe that it could be adapted to the larger class of inverse problems with linear measurement models, giving it applicability in a wide range of signal reconstruction settings. △ Less

Submitted 5 September, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

Journal ref: SIAM Journal on Imaging Sciences (SIIMS-2023)

arXiv:2206.14187 [pdf, other]

Evaluating Understanding on Conceptual Abstraction Benchmarks

Authors: Victor Vikram Odouard, Melanie Mitchell

Abstract: A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the difficulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance… ▽ More A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the difficulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many different instantiations. We present case studies of such an evaluations on two domains -- RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC) -- that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: EBeM'22: AI Evaluation Beyond Metrics, July 24, 2022, Vienna, Austria

arXiv:2206.03216 [pdf, other]

doi 10.1145/3531146.3534637

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Gérard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac Johnson, Dragomir Radev, Somaieh Nikpoor, Jörg Frohberg, Aaron Gokaslan, Peter Henderson, Rishi Bommasani, Margaret Mitchell

Abstract: The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distrib… ▽ More The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work. △ Less

Submitted 2 November, 2022; v1 submitted 3 May, 2022; originally announced June 2022.

Comments: 32 pages: Full paper and Appendices; Association for Computing Machinery, New York, NY, USA, 2206-2222

Journal ref: Proceedings of 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22)

arXiv:2202.05839 [pdf, other]

Abstraction for Deep Reinforcement Learning

Authors: Murray Shanahan, Melanie Mitchell

Abstract: We characterise the problem of abstraction in the context of deep reinforcement learning. Various well established approaches to analogical reasoning and associative memory might be brought to bear on this issue, but they present difficulties because of the need for end-to-end differentiability. We review developments in AI and machine learning that could facilitate their adoption. We characterise the problem of abstraction in the context of deep reinforcement learning. Various well established approaches to analogical reasoning and associative memory might be brought to bear on this issue, but they present difficulties because of the need for end-to-end differentiability. We review developments in AI and machine learning that could facilitate their adoption. △ Less

Submitted 29 April, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: To appear in Proceedings IJCAI 2022

arXiv:2202.03980 [pdf, other]

Transferable Student Performance Modeling for Intelligent Tutoring Systems

Authors: Robin Schmucker, Tom M. Mitchell

Abstract: Millions of learners worldwide are now using intelligent tutoring systems (ITSs). At their core, ITSs rely on machine learning algorithms to track each user's changing performance level over time to provide personalized instruction. Crucially, student performance models are trained using interaction sequence data of previous learners to analyse data generated by future learners. This induces a col… ▽ More Millions of learners worldwide are now using intelligent tutoring systems (ITSs). At their core, ITSs rely on machine learning algorithms to track each user's changing performance level over time to provide personalized instruction. Crucially, student performance models are trained using interaction sequence data of previous learners to analyse data generated by future learners. This induces a cold-start problem when a new course is introduced for which no training data is available. Here, we consider transfer learning techniques as a way to provide accurate performance predictions for new courses by leveraging log data from existing courses. We study two settings: (i) In the naive transfer setting, we propose course-agnostic performance models that can be applied to any course. (ii) In the inductive transfer setting, we tune pre-trained course-agnostic performance models to new courses using small-scale target course data (e.g., collected during a pilot study). We evaluate the proposed techniques using student interaction sequence data from 5 different mathematics courses containing data from over 47,000 students in a real world large-scale ITS. The course-agnostic models that use additional features provided by human domain experts (e.g, difficulty ratings for questions in the new course) but no student interaction training data for the new course, achieve prediction accuracy on par with standard BKT and PFA models that use training data from thousands of students in the new course. In the inductive setting our transfer learning approach yields more accurate predictions than conventional performance models when only limited student interaction training data (<100 students) is available to both. △ Less

Submitted 8 February, 2022; originally announced February 2022.

arXiv:2112.06864 [pdf, ps, other]

Frontiers in Collective Intelligence: A Workshop Report

Authors: Tyler Millhouse, Melanie Moses, Melanie Mitchell

Abstract: In August of 2021, the Santa Fe Institute hosted a workshop on collective intelligence as part of its Foundations of Intelligence project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists, biologists, philosophers, social scientists, and others to share their i… ▽ More In August of 2021, the Santa Fe Institute hosted a workshop on collective intelligence as part of its Foundations of Intelligence project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists, biologists, philosophers, social scientists, and others to share their insights about how intelligence can emerge from interactions among multiple agents--whether those agents be machines, animals, or human beings. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research. △ Less

Submitted 10 October, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: acknowledgments added

arXiv:2110.10320 [pdf, ps, other]

Frontiers in Evolutionary Computation: A Workshop Report

Authors: Tyler Millhouse, Melanie Moses, Melanie Mitchell

Abstract: In July of 2021, the Santa Fe Institute hosted a workshop on evolutionary computation as part of its Foundations of Intelligence in Natural and Artificial Systems project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists and biologists to share their insights a… ▽ More In July of 2021, the Santa Fe Institute hosted a workshop on evolutionary computation as part of its Foundations of Intelligence in Natural and Artificial Systems project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. The workshop brought together computer scientists and biologists to share their insights about the nature of evolution and the future of evolutionary computation. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research. △ Less

Submitted 19 October, 2021; originally announced October 2021.

arXiv:2109.07703 [pdf, other]

ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI

Authors: Guanxiong Chen, Haoyu Yang, Ian M. Mitchell

Abstract: We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied learning-based agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physically and photorealistic simulation that benefits the training and/or testing of vision-based embodied agents.… ▽ More We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied learning-based agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physically and photorealistic simulation that benefits the training and/or testing of vision-based embodied agents. With this interface, roboticists can evaluate their own Habitat RL agents in another ROS-based simulator or use Habitat Sim v2 as the test bed for their own robotic algorithms. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of a Habitat RGBD agent; that a standard set of ROS map**, planning and navigation tools can run in Habitat Sim v2; and that a Habitat agent can run in the standard ROS simulator Gazebo. △ Less

Submitted 29 April, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: Camera-ready version submitted to Canadian Conference on Computer and Robot Vision (CRV) 2022

arXiv:2109.01955 [pdf, other]

Iterative Threshold Decoding of Spatially Coupled, Parallel-Concatenated Codes

Authors: Andrew D. Cummins, David G. M. Mitchell, Daniel J. Costello, Jr

Abstract: Spatially coupled, parallel concatenated codes (SC-PCCs) have been shown to approach channel capacity when decoded using optimal iterative methods. However, under complexity constraints such decoding strategies can result in unacceptable power and latency costs. In this work, we employ convolutional self-orthogonal component codes along with low-complexity, suboptimal a posteriori probability (APP… ▽ More Spatially coupled, parallel concatenated codes (SC-PCCs) have been shown to approach channel capacity when decoded using optimal iterative methods. However, under complexity constraints such decoding strategies can result in unacceptable power and latency costs. In this work, we employ convolutional self-orthogonal component codes along with low-complexity, suboptimal a posteriori probability (APP) threshold decoders with SC-PCCs to reduce decoding complexity. The proposed code design is faster, more energy efficient, and easier to implement than optimal methods, while offering significant coding gain over existing threshold decodable, turbo-like constructions of similar latency and complexity. The design also serves to further illustrate the advantages spatial coupling can provide to existing code constructions and decoder implementations. △ Less

Submitted 4 September, 2021; originally announced September 2021.

Comments: 5 pages, 6 figures, for inclusion in the 2021 International Symposium on Topics in Coding

arXiv:2109.01753 [pdf, other]

Assessing the Performance of Online Students -- New Data, New Approaches, Improved Accuracy

Authors: Robin Schmucker, **gbo Wang, Shijia Hu, Tom M. Mitchell

Abstract: We consider the problem of assessing the changing performance levels of individual students as they go through online courses. This student performance (SP) modeling problem is a critical step for building adaptive online teaching systems. Specifically, we conduct a study of how to utilize various types and large amounts of student log data to train accurate machine learning (ML) models that predi… ▽ More We consider the problem of assessing the changing performance levels of individual students as they go through online courses. This student performance (SP) modeling problem is a critical step for building adaptive online teaching systems. Specifically, we conduct a study of how to utilize various types and large amounts of student log data to train accurate machine learning (ML) models that predict the performance of future students. This study is the first to use four very large sets of student data made available recently from four distinct intelligent tutoring systems. Our results include a new ML approach that defines a new state of the art for logistic regression based SP modeling, improving over earlier methods in several ways: First, we achieve improved accuracy by introducing new features that can be easily computed from conventional question-response logs (e.g., the pattern in the student 's most recent answers). Second, we take advantage of features of the student history that go beyond question-response pairs (e.g., features such as which video segments the student watched, or skipped) as well as information about prerequisite structure in the curriculum. Third, we train multiple specialized SP models for different aspects of the curriculum (e.g., specializing in early versus later segments of the student history), then combine these specialized models to create a group prediction of the SP. Taken together, these innovations yield an average AUC score across these four datasets of 0.808 compared to the previous best logistic regression approach score of 0.767, and also outperforming state-of-the-art deep neural net approaches. Importantly, we observe consistent improvements from each of our three methodological innovations, in each dataset, suggesting that our methods are of general utility and likely to produce improvements for other online tutoring systems as well. △ Less

Submitted 8 February, 2022; v1 submitted 3 September, 2021; originally announced September 2021.

arXiv:2108.02753 [pdf, other]

doi 10.1109/LCSYS.2021.3089641

Safe Motion Planning against Multimodal Distributions based on a Scenario Approach

Authors: Hee** Ahn, Colin Chen, Ian M. Mitchell, Maryam Kamgarpour

Abstract: We present the design of a motion planning algorithm that ensures safety for an autonomous vehicle. In particular, we consider a multimodal distribution over uncertainties; for example, the uncertain predictions of future trajectories of surrounding vehicles reflect discrete decisions, such as turning or going straight at intersections. We develop a computationally efficient, scenario-based approa… ▽ More We present the design of a motion planning algorithm that ensures safety for an autonomous vehicle. In particular, we consider a multimodal distribution over uncertainties; for example, the uncertain predictions of future trajectories of surrounding vehicles reflect discrete decisions, such as turning or going straight at intersections. We develop a computationally efficient, scenario-based approach that solves the motion planning problem with high confidence given a quantifiable number of samples from the multimodal distribution. Our approach is based on two preprocessing steps, which 1) separate the samples into distinct clusters and 2) compute a bounding polytope for each cluster. Then, we rewrite the motion planning problem approximately as a mixed-integer problem using the polytopes. We demonstrate via simulation on the nuScenes dataset that our approach ensures safety with high probability in the presence of multimodal uncertainties, and is computationally more efficient and less conservative than a conventional scenario approach. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: Published in IEEE Control Systems Letters

Journal ref: in IEEE Control Systems Letters, vol. 6, pp. 1142-1147, 2022

arXiv:2108.01637 [pdf, ps, other]

A Unifying Framework to Construct QC-LDPC Tanner Graphs of Desired Girth

Authors: Roxana Smarandache, David G. M. Mitchell

Abstract: This paper presents a unifying framework to construct low-density parity-check (LDPC) codes with associated Tanner graphs of desired girth. Towards this goal, we highlight the role that a certain square matrix that appears in the product of the parity-check matrix with its transpose has in the construction of codes with graphs of desired girth and further explore it in order to generate the set of… ▽ More This paper presents a unifying framework to construct low-density parity-check (LDPC) codes with associated Tanner graphs of desired girth. Towards this goal, we highlight the role that a certain square matrix that appears in the product of the parity-check matrix with its transpose has in the construction of codes with graphs of desired girth and further explore it in order to generate the set of necessary and sufficient conditions for a Tanner graph to have a given girth between 6 and 12. For each such girth, we present algorithms to construct codes of the desired girth and we show how to use them to compute the minimum necessary value of the lifting factor. For girth larger than 12, we show how to use multi-step graph lifting methods to deterministically modify codes in order to increase their girth. We also give a new perspective on LDPC protograph-based parity-check matrices by viewing them as rows of a parity-check matrix equal to the sum of certain permutation matrices and obtain an important connection between all protographs and those with variable nodes of degree 2. We also show that the results and methodology that we develop for the all-one protograph can be used and adapted to analyze the girth of the Tanner graph of any parity-check matrix and demonstrate how this can be done using a well-known irregular, multi-edge protograph specified by the NASA Consultative Committee for Space Data Systems (CCSDS). Throughout the paper, we exemplify our theoretical results with constructions of LDPC codes with Tanner graphs of any girth between 6 and 14 and give sufficient conditions for a multi-step lifted parity-check matrix to have girth between 14 and 22. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: Submitted to the IEEE Transactions on Information Theory

arXiv:2106.04072 [pdf, ps, other]

Coarse-to-Fine Curriculum Learning

Authors: Otilia Stretcu, Emmanouil Antonios Platanios, Tom M. Mitchell, Barnabás Póczos

Abstract: When faced with learning challenging new tasks, humans often follow sequences of steps that allow them to incrementally build up the necessary skills for performing these new tasks. However, in machine learning, models are most often trained to solve the target tasks directly.Inspired by human learning, we propose a novel curriculum learning approach which decomposes challenging tasks into sequenc… ▽ More When faced with learning challenging new tasks, humans often follow sequences of steps that allow them to incrementally build up the necessary skills for performing these new tasks. However, in machine learning, models are most often trained to solve the target tasks directly.Inspired by human learning, we propose a novel curriculum learning approach which decomposes challenging tasks into sequences of easier intermediate goals that are used to pre-train a model before tackling the target task. We focus on classification tasks, and design the intermediate tasks using an automatically constructed label hierarchy. We train the model at each level of the hierarchy, from coarse labels to fine labels, transferring acquired knowledge across these levels. For instance, the model will first learn to distinguish animals from objects, and then use this acquired knowledge when learning to classify among more fine-grained classes such as cat, dog, car, and truck. Most existing curriculum learning algorithms for supervised learning consist of scheduling the order in which the training examples are presented to the model. In contrast, our approach focuses on the output space of the model. We evaluate our method on several established datasets and show significant performance gains especially on classification problems with many labels. We also evaluate on a new synthetic dataset which allows us to study multiple aspects of our method. △ Less

Submitted 7 June, 2021; originally announced June 2021.

arXiv:2106.02105 [pdf, other]

A Little Robustness Goes a Long Way: Leveraging Robust Features for Targeted Transfer Attacks

Authors: Jacob M. Springer, Melanie Mitchell, Garrett T. Kenyon

Abstract: Adversarial examples for neural network image classifiers are known to be transferable: examples optimized to be misclassified by a source classifier are often misclassified as well by classifiers with different architectures. However, targeted adversarial examples -- optimized to be classified as a chosen target class -- tend to be less transferable between architectures. While prior research on… ▽ More Adversarial examples for neural network image classifiers are known to be transferable: examples optimized to be misclassified by a source classifier are often misclassified as well by classifiers with different architectures. However, targeted adversarial examples -- optimized to be classified as a chosen target class -- tend to be less transferable between architectures. While prior research on constructing transferable targeted attacks has focused on improving the optimization procedure, in this work we examine the role of the source classifier. Here, we show that training the source classifier to be "slightly robust" -- that is, robust to small-magnitude adversarial examples -- substantially improves the transferability of class-targeted and representation-targeted adversarial attacks, even between architectures as different as convolutional neural networks and transformers. The results we present provide insight into the nature of adversarial examples as well as the mechanisms underlying so-called "robust" classifiers. △ Less

Submitted 25 October, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: NeurIPS '21

arXiv:2106.00861 [pdf, ps, other]

Necessary and Sufficient Girth Conditions for LDPC Tanner Graphs with Denser Protographs

Authors: Anthony Gómez-Fonseca, Roxana Smarandache, David G. M. Mitchell

Abstract: This paper gives necessary and sufficient conditions for the Tanner graph of a quasi-cyclic (QC) low-density parity-check (LDPC) code based on the all-one protograph to have girth 6, 8, 10, and 12, respectively, in the case of parity-check matrices with column weight 4. These results are a natural extension of the girth results of the already-studied cases of column weight 2 and 3, and it is based… ▽ More This paper gives necessary and sufficient conditions for the Tanner graph of a quasi-cyclic (QC) low-density parity-check (LDPC) code based on the all-one protograph to have girth 6, 8, 10, and 12, respectively, in the case of parity-check matrices with column weight 4. These results are a natural extension of the girth results of the already-studied cases of column weight 2 and 3, and it is based on the connection between the girth of a Tanner graph given by a parity-check matrix and the properties of powers of the product between the matrix and its transpose. The girth conditions can be easily incorporated into fast algorithms that construct codes of desired girth between 6 and 12; our own algorithms are presented for each girth, together with constructions obtained from them and corresponding computer simulations. More importantly, this paper emphasizes how the girth conditions of the Tanner graph corresponding to a parity-check matrix composed of circulants relate to the matrix obtained by adding (over the integers) the circulant columns of the parity-check matrix. In particular, we show that imposing girth conditions on a parity-check matrix is equivalent to imposing conditions on a square circulant submatrix of size 4 obtained from it. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Submitted to the International Symposium on Topics in Coding 2021. arXiv admin note: text overlap with arXiv:2105.03462

arXiv:2105.03462 [pdf, ps, other]

Necessary and Sufficient Girth Conditions for Tanner Graphs of Quasi-Cyclic LDPC Codes

Authors: Roxana Smarandache, David G. M. Mitchell

Abstract: This paper revisits the connection between the girth of a protograph-based LDPC code given by a parity-check matrix and the properties of powers of the product between the matrix and its transpose in order to obtain the necessary and sufficient conditions for a code to have given girth between 6 and 12, and to show how these conditions can be incorporated into simple algorithms to construct codes… ▽ More This paper revisits the connection between the girth of a protograph-based LDPC code given by a parity-check matrix and the properties of powers of the product between the matrix and its transpose in order to obtain the necessary and sufficient conditions for a code to have given girth between 6 and 12, and to show how these conditions can be incorporated into simple algorithms to construct codes of that girth. To this end, we highlight the role that certain submatrices that appear in these products have in the construction of codes of desired girth. In particular, we show that imposing girth conditions on a parity-check matrix is equivalent to imposing conditions on a square submatrix obtained from it and we show how this equivalence is particularly strong for a protograph based parity-check matrix of variable node degree 2, where the cycles in its Tanner graph correspond one-to-one to the cycles in the Tanner graph of a square submatrix obtained by adding the permutation matrices (or products of these) in the composition of the parity-check matrix. We end the paper with exemplary constructions of codes with various girths and computer simulations. Although, we mostly assume the case of fully connected protographs of variable node degree 2 and 3, the results can be used for any parity-check matrix/protograph-based Tanner graph. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Comments: Submitted to the 2021 IEEE International Symposium on Information Theory

arXiv:2105.02486 [pdf, other]

Towards General Natural Language Understanding with Probabilistic Worldbuilding

Authors: Abulhair Saparov, Tom M. Mitchell

Abstract: We introduce the Probabilistic Worldbuilding Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations which greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of senten… ▽ More We introduce the Probabilistic Worldbuilding Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations which greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of sentences, acquired facts about the world, and intermediate steps in reasoning are all expressed in a human-readable formal language, with the design goal of interpretability. PWM is Bayesian, designed specifically to be able to generalize to new domains and new tasks. We derive and implement an inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and evaluate it on two out-of-domain question-answering datasets: (1) ProofWriter and (2) a new dataset we call FictionalGeoQA, designed to be more representative of real language but still simple enough to focus on evaluating reasoning ability, while being robust against heuristics. Our method outperforms baselines on both, thereby demonstrating its value as a proof-of-concept. △ Less

Submitted 20 December, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

Comments: Accepted to TACL; pre-MIT Press publication version

arXiv:2105.02198 [pdf, ps, other]

Foundations of Intelligence in Natural and Artificial Systems: A Workshop Report

Authors: Tyler Millhouse, Melanie Moses, Melanie Mitchell

Abstract: In March of 2021, the Santa Fe Institute hosted a workshop as part of its Foundations of Intelligence in Natural and Artificial Systems project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. During the workshop, speakers from diverse disciplines gathered to develop a taxonomy of intelligence, articulating t… ▽ More In March of 2021, the Santa Fe Institute hosted a workshop as part of its Foundations of Intelligence in Natural and Artificial Systems project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. During the workshop, speakers from diverse disciplines gathered to develop a taxonomy of intelligence, articulating their own understanding of intelligence and how their research has furthered that understanding. In this report, we summarize the insights offered by each speaker and identify the themes that emerged during the talks and subsequent discussions. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 30 pages, 0 figures, workshop report

arXiv:2104.12871 [pdf, ps, other]

Why AI is Harder Than We Think

Authors: Melanie Mitchell

Abstract: Since its beginning in the 1950s, the field of artificial intelligence has cycled several times between periods of optimistic predictions and massive investment ("AI spring") and periods of disappointment, loss of confidence, and reduced funding ("AI winter"). Even with today's seemingly fast pace of AI breakthroughs, the development of long-promised technologies such as self-driving cars, houseke… ▽ More Since its beginning in the 1950s, the field of artificial intelligence has cycled several times between periods of optimistic predictions and massive investment ("AI spring") and periods of disappointment, loss of confidence, and reduced funding ("AI winter"). Even with today's seemingly fast pace of AI breakthroughs, the development of long-promised technologies such as self-driving cars, housekee** robots, and conversational companions has turned out to be much harder than many people expected. One reason for these repeating cycles is our limited understanding of the nature and complexity of intelligence itself. In this paper I describe four fallacies in common assumptions made by AI researchers, which can lead to overconfident predictions about the field. I conclude by discussing the open questions spurred by these fallacies, including the age-old challenge of imbuing machines with humanlike common sense. △ Less

Submitted 28 April, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: 12 pages; typos corrected in newest version

arXiv:2104.08758 [pdf, other]

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Authors: Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner

Abstract: Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scra** significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C… ▽ More Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scra** significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet. △ Less

Submitted 30 September, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: EMNLP 2021 accepted paper camera ready version

arXiv:2103.03417 [pdf, other]

doi 10.1145/3461702.3462557

Measuring Model Biases in the Absence of Ground Truth

Authors: Osman Aka, Ken Burke, Alex Bäuerle, Christina Greer, Margaret Mitchell

Abstract: The measurement of bias in machine learning often focuses on model performance across identity subgroups (such as man and woman) with respect to groundtruth labels. However, these methods do not directly measure the associations that a model may have learned, for example between labels and identity subgroups. Further, measuring a model's bias requires a fully annotated evaluation dataset which may… ▽ More The measurement of bias in machine learning often focuses on model performance across identity subgroups (such as man and woman) with respect to groundtruth labels. However, these methods do not directly measure the associations that a model may have learned, for example between labels and identity subgroups. Further, measuring a model's bias requires a fully annotated evaluation dataset which may not be easily available in practice. We present an elegant mathematical solution that tackles both issues simultaneously, using image classification as a working example. By treating a classification model's predictions for a given image as a set of labels analogous to a bag of words, we rank the biases that a model has learned with respect to different identity labels. We use (man, woman) as a concrete example of an identity label set (although this set need not be binary), and present rankings for the labels that are most biased towards one identity or the other. We demonstrate how the statistical properties of different association metrics can lead to different rankings of the most "gender biased" labels, and conclude that normalized pointwise mutual information (nPMI) is most useful in practice. Finally, we announce an open-sourced nPMI visualization tool using TensorBoard. △ Less

Submitted 6 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

arXiv:2102.10717 [pdf, other]

doi 10.1111/nyas.14619

Abstraction and Analogy-Making in Artificial Intelligence

Authors: Melanie Mitchell

Abstract: Conceptual abstraction and analogy-making are key abilities underlying humans' abilities to learn, reason, and robustly adapt their knowledge to new domains. Despite of a long history of research on constructing AI systems with these abilities, no current AI system is anywhere close to a capability of forming humanlike abstractions or analogies. This paper reviews the advantages and limitations of… ▽ More Conceptual abstraction and analogy-making are key abilities underlying humans' abilities to learn, reason, and robustly adapt their knowledge to new domains. Despite of a long history of research on constructing AI systems with these abilities, no current AI system is anywhere close to a capability of forming humanlike abstractions or analogies. This paper reviews the advantages and limitations of several approaches toward this goal, including symbolic methods, deep learning, and probabilistic program induction. The paper concludes with several proposals for designing challenge tasks and evaluation measures in order to make quantifiable and generalizable progress in this area. △ Less

Submitted 14 May, 2021; v1 submitted 21 February, 2021; originally announced February 2021.

Comments: Revised version. 30 pages, 9 figures. To appear in Annals of the New York Academy of Sciences

Showing 1–50 of 124 results for author: Mitchell, M