-
Evaluating the World Model Implicit in a Generative Model
Authors:
Keyon Vafa,
Justin Y. Chen,
Jon Kleinberg,
Sendhil Mullainathan,
Ashesh Rambachan
Abstract:
Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world m…
▽ More
Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
△ Less
Submitted 22 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Authors:
Keyon Vafa,
Ashesh Rambachan,
Sendhil Mullainathan
Abstract:
What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human…
▽ More
What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding
Authors:
Ashesh Rambachan,
Amanda Coston,
Edward Kennedy
Abstract:
Predictive algorithms inform consequential decisions in settings where the outcome is selectively observed given choices made by human decision makers. We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units conditional o…
▽ More
Predictive algorithms inform consequential decisions in settings where the outcome is selectively observed given choices made by human decision makers. We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units conditional on observed covariates and identified nuisance parameters, formalizing popular empirical strategies for imputing missing data such as proxy outcomes and instrumental variables. We develop debiased machine learning estimators for the bounds on a large class of predictive performance estimands, such as the conditional likelihood of the outcome, a predictive algorithm's mean square error, true/false positive rate, and many others, under these assumptions. In an administrative dataset from a large Australian financial institution, we illustrate how varying assumptions on unobserved confounding leads to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups.
△ Less
Submitted 19 May, 2024; v1 submitted 19 December, 2022;
originally announced December 2022.
-
Characterizing Fairness Over the Set of Good Models Under Selective Labels
Authors:
Amanda Coston,
Ashesh Rambachan,
Alexandra Chouldechova
Abstract:
Algorithmic risk assessments are used to inform decisions in a wide variety of high-stakes settings. Often multiple predictive models deliver similar overall performance but differ markedly in their predictions for individual cases, an empirical phenomenon known as the "Rashomon Effect." These models may have different properties over various groups, and therefore have different predictive fairnes…
▽ More
Algorithmic risk assessments are used to inform decisions in a wide variety of high-stakes settings. Often multiple predictive models deliver similar overall performance but differ markedly in their predictions for individual cases, an empirical phenomenon known as the "Rashomon Effect." These models may have different properties over various groups, and therefore have different predictive fairness properties. We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance, or "the set of good models." Our framework addresses the empirically relevant challenge of selectively labelled data in the setting where the selection decision and outcome are unconfounded given the observed data features. Our framework can be used to 1) replace an existing model with one that has better fairness properties; or 2) audit for predictive bias. We illustrate these uses cases on a real-world credit-scoring task and a recidivism prediction task.
△ Less
Submitted 30 April, 2021; v1 submitted 1 January, 2021;
originally announced January 2021.
-
Bias In, Bias Out? Evaluating the Folk Wisdom
Authors:
Ashesh Rambachan,
Jonathan Roth
Abstract:
We evaluate the folk wisdom that algorithmic decision rules trained on data produced by biased human decision-makers necessarily reflect this bias. We consider a setting where training labels are only generated if a biased decision-maker takes a particular action, and so "biased" training data arise due to discriminatory selection into the training data. In our baseline model, the more biased the…
▽ More
We evaluate the folk wisdom that algorithmic decision rules trained on data produced by biased human decision-makers necessarily reflect this bias. We consider a setting where training labels are only generated if a biased decision-maker takes a particular action, and so "biased" training data arise due to discriminatory selection into the training data. In our baseline model, the more biased the decision-maker is against a group, the more the algorithmic decision rule favors that group. We refer to this phenomenon as "bias reversal." We then clarify the conditions that give rise to bias reversal. Whether a prediction algorithm reverses or inherits bias depends critically on how the decision-maker affects the training data as well as the label used in training. We illustrate our main theoretical results in a simulation study applied to the New York City Stop, Question and Frisk dataset.
△ Less
Submitted 19 December, 2020; v1 submitted 18 September, 2019;
originally announced September 2019.