Search | arXiv e-print repository

Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models

Authors: Kevin Leyton-Brown, Yoav Shoham

Abstract: Motivated by the rapid ascent of Large Language Models (LLMs) and debates about the extent to which they possess human-level qualities, we propose a framework for testing whether any agent (be it a machine or a human) understands a subject matter. In Turing-test fashion, the framework is based solely on the agent's performance, and specifically on how well it answers questions. Elements of the fra… ▽ More Motivated by the rapid ascent of Large Language Models (LLMs) and debates about the extent to which they possess human-level qualities, we propose a framework for testing whether any agent (be it a machine or a human) understands a subject matter. In Turing-test fashion, the framework is based solely on the agent's performance, and specifically on how well it answers questions. Elements of the framework include circumscribing the set of questions (the "scope of understanding"), requiring general competence ("passing grade"), avoiding "ridiculous answers", but still allowing wrong and "I don't know" answers to some questions. Reaching certainty about these conditions requires exhaustive testing of the questions which is impossible for nontrivial scopes, but we show how high confidence can be achieved via random sampling and the application of probabilistic confidence bounds. We also show that accompanying answers with explanations can improve the sample complexity required to achieve acceptable bounds, because an explanation of an answer implies the ability to answer many similar questions. According to our framework, current LLMs cannot be said to understand nontrivial domains, but as the framework provides a practical recipe for testing understanding, it thus also constitutes a tool for building AI agents that do understand. △ Less

Submitted 19 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

arXiv:2405.18246 [pdf, other]

Utilitarian Algorithm Configuration for Infinite Parameter Spaces

Authors: Devon Graham, Kevin Leyton-Brown

Abstract: Utilitarian algorithm configuration is a general-purpose technique for automatically searching the parameter space of a given algorithm to optimize its performance, as measured by a given utility function, on a given set of inputs. Recently introduced utilitarian configuration procedures offer optimality guarantees about the returned parameterization while provably adapting to the hardness of the… ▽ More Utilitarian algorithm configuration is a general-purpose technique for automatically searching the parameter space of a given algorithm to optimize its performance, as measured by a given utility function, on a given set of inputs. Recently introduced utilitarian configuration procedures offer optimality guarantees about the returned parameterization while provably adapting to the hardness of the underlying problem. However, the applicability of these approaches is severely limited by the fact that they only search a finite, relatively small set of parameters. They cannot effectively search the configuration space of algorithms with continuous or uncountable parameters. In this paper we introduce a new procedure, which we dub COUP (Continuous, Optimistic Utilitarian Procrastination). COUP is designed to search infinite parameter spaces efficiently to find good configurations quickly. Furthermore, COUP maintains the theoretical benefits of previous utilitarian configuration procedures when applied to finite parameter spaces but is significantly faster, both provably and experimentally. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.06563 [pdf, other]

What Can Natural Language Processing Do for Peer Review?

Authors: Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, Tom Hope, Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, **gyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Shah, Iryna Gurevych

Abstract: The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time… ▽ More The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2402.19420 [pdf, other]

Understanding Iterative Combinatorial Auction Designs via Multi-Agent Reinforcement Learning

Authors: Greg d'Eon, Neil Newman, Kevin Leyton-Brown

Abstract: Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to understand analytically, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) a… ▽ More Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to understand analytically, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) algorithms can be used to understand iterative combinatorial auctions, given that these algorithms have recently shown empirical success in several other domains. We find that MARL can indeed benefit auction analysis, but that deploying it effectively is nontrivial. We begin by describing modelling decisions that keep the resulting game tractable without sacrificing important features such as imperfect information or asymmetry between bidders. We also discuss how to navigate pitfalls of various MARL algorithms, how to overcome challenges in verifying convergence, and how to generate and interpret multiple equilibria. We illustrate the promise of our resulting approach by using it to evaluate a specific rule change to a clock auction, finding substantially different auction outcomes due to complex changes in bidders' behavior. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 18 pages (body) + 10 pages (acknowledgements, references, appendices)

arXiv:2402.09552 [pdf, other]

STEER: Assessing the Economic Rationality of Large Language Models

Authors: Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz

Abstract: There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing suc… ▽ More There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior. △ Less

Submitted 28 May, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2312.10205 [pdf, other]

Pay to (Not) Play: Monetizing Impatience in Mobile Games

Authors: Taylor Lundy, Narun Raman, Hu Fu, Kevin Leyton-Brown

Abstract: Mobile gaming is a rapidly growing and incredibly profitable sector; having grown seven-fold over the past 10 years, it now grosses over $100 billion annually. This growth was due in large part to a shift in monetization strategies: rather than charging players an upfront cost ("pay-to-play"), games often request optional microtransactions throughout gameplay ("free-to-play"). We focus on a common… ▽ More Mobile gaming is a rapidly growing and incredibly profitable sector; having grown seven-fold over the past 10 years, it now grosses over $100 billion annually. This growth was due in large part to a shift in monetization strategies: rather than charging players an upfront cost ("pay-to-play"), games often request optional microtransactions throughout gameplay ("free-to-play"). We focus on a common scenario in which games include wait times -- gating either items or game progression -- that players can pay to skip. Game designers typically say that they optimize for player happiness rather than revenue; however, prices for skips are typically set at levels that few players are willing to pay, leading to low purchase rates. Under a traditional analysis, it would seem that game designers fail at their stated goal if few players buy what they are selling. We argue that an alternate model can better explain this dynamic: players value tasks more highly as they are perceived to be more difficult. While skips can increase players' utilities by providing instant gratification, pricing skips too cheaply can lower players' utilities by decreasing the perceived amount of work needed to complete a task. We show that high revenue, high player utility, and low purchase rates can all coexist under this model, particularly under a realistic distribution of players having few buyers but a few big-spending "whales." We also investigate how a game designer should optimize prices under our model. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: 18 pages

arXiv:2310.20401 [pdf, other]

Utilitarian Algorithm Configuration

Authors: Devon R. Graham, Kevin Leyton-Brown, Tim Roughgarden

Abstract: We present the first nontrivial procedure for configuring heuristic algorithms to maximize the utility provided to their end users while also offering theoretical guarantees about performance. Existing procedures seek configurations that minimize expected runtime. However, very recent theoretical work argues that expected runtime minimization fails to capture algorithm designers' preferences. Here… ▽ More We present the first nontrivial procedure for configuring heuristic algorithms to maximize the utility provided to their end users while also offering theoretical guarantees about performance. Existing procedures seek configurations that minimize expected runtime. However, very recent theoretical work argues that expected runtime minimization fails to capture algorithm designers' preferences. Here we show that the utilitarian objective also confers significant algorithmic benefits. Intuitively, this is because mean runtime is dominated by extremely long runs even when they are incredibly rare; indeed, even when an algorithm never gives rise to such long runs, configuration procedures that provably minimize mean runtime must perform a huge number of experiments to demonstrate this fact. In contrast, utility is bounded and monotonically decreasing in runtime, allowing for meaningful empirical bounds on a configuration's performance. This paper builds on this idea to describe effective and theoretically sound configuration procedures. We prove upper bounds on the runtime of these procedures that are similar to theoretical lower bounds, while also demonstrating their performance empirically. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2307.06908 [pdf, other]

Generating Benchmarks for Factuality Evaluation of Language Models

Authors: Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham

Abstract: Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual… ▽ More Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create three benchmarks: Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor. △ Less

Submitted 4 February, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.04778 [pdf, other]

How to Evaluate Behavioral Models

Authors: Greg d'Eon, Sophie Greenwood, Kevin Leyton-Brown, James R. Wright

Abstract: Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to th… ▽ More Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub "diagonal bounded Bregman divergences", that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models. △ Less

Submitted 22 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: 15 pages (7 pages body + references and appendix). To appear at AAAI 2024

arXiv:2302.00083 [pdf, other]

In-Context Retrieval-Augmented Language Models

Authors: Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham

Abstract: Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying… ▽ More Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying the LM architecture in order to facilitate the incorporation of external information, significantly complicating deployment. This paper considers a simple alternative, which we dub In-Context RALM: leaving the LM architecture unchanged and prepending grounding documents to the input, without any further training of the LM. We show that In-Context RALM that builds on off-the-shelf general purpose retrievers provides surprisingly large LM gains across model sizes and diverse corpora. We also demonstrate that the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. We conclude that In-Context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification or even via API access. △ Less

Submitted 1 August, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

Comments: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL). pre-MIT Press publication version

arXiv:2212.10947 [pdf, other]

Parallel Context Windows for Large Language Models

Authors: Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham

Abstract: When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The k… ▽ More When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows''), restrict the attention mechanism to apply only within each window, and re-use the positional embeddings across the windows. Our main results test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. We show additional benefits in other settings where long context windows may be beneficial: multi-hop questions and retrieval-augmented question answering with multiple retrieved documents. Our results highlight Parallel Context Windows as a promising method for applying off-the-shelf LLMs in a range of settings that require long text sequences. We make our code publicly available at https://github.com/ai21labs/parallel-context-windows. △ Less

Submitted 1 August, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)

arXiv:2211.12581 [pdf, other]

UNSAT Solver Synthesis via Monte Carlo Forest Search

Authors: Chris Cameron, Jason Hartford, Taylor Lundy, Tuan Truong, Alan Milligan, Rex Chen, Kevin Leyton-Brown

Abstract: We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer… ▽ More We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer program. MCFS algorithms can be seen as extensions of Monte Carlo Tree Search (MCTS) to cases where, rather than finding a good path (solution) within a tree, the problem is to find a small tree within a forest of candidate trees. We instantiate and evaluate our ideas in an algorithm that we dub Knuth Synthesis, an MCFS algorithm that learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem, with the objective of achieving good average-case performance on a given distribution of unsatisfiable problem instances. Knuth Synthesis leverages two key ideas to avoid the prohibitive costs of policy evaluations in an exponentially-sized tree. First, we estimate tree size by randomly sampling paths and measuring their lengths, drawing on an unbiased approximation due to Knuth (1975). Second, we query a strong solver at a user-defined depth rather than learning a policy across the whole tree, to focus our policy search on early decisions that offer the greatest potential for reducing tree size. We matched or improved performance over a strong baseline on three well-known SAT distributions (R3SAT, sgen, satfc). △ Less

Submitted 25 May, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.06318 [pdf]

Artificial Intelligence and Life in 2030: The One Hundred Year Study on Artificial Intelligence

Authors: Peter Stone, Rodney Brooks, Erik Brynjolfsson, Ryan Calo, Oren Etzioni, Greg Hager, Julia Hirschberg, Shivaram Kalyanakrishnan, Ece Kamar, Sarit Kraus, Kevin Leyton-Brown, David Parkes, William Press, AnnaLee Saxenian, Julie Shah, Milind Tambe, Astro Teller

Abstract: In September 2016, Stanford's "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the first report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Peter Stone of the University of Texas at Austin. The report, entitled… ▽ More In September 2016, Stanford's "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the first report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Peter Stone of the University of Texas at Austin. The report, entitled "Artificial Intelligence and Life in 2030," examines eight domains of typical urban settings on which AI is likely to have impact over the coming years: transportation, home and service robots, healthcare, education, public safety and security, low-resource communities, employment and workplace, and entertainment. It aims to provide the general public with a scientifically and technologically accurate portrayal of the current state of AI and its potential and to help guide decisions in industry and governments, as well as to inform research and development in the field. The charge for this report was given to the panel by the AI100 Standing Committee, chaired by Barbara Grosz of Harvard University. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: 52 pages, https://ai100.stanford.edu/2016-report

arXiv:2209.01242 [pdf, other]

Better Peer Grading through Bayesian Inference

Authors: Hedayat Zarkoob, Greg d'Eon, Lena Podina, Kevin Leyton-Brown

Abstract: Peer grading systems aggregate noisy reports from multiple students to approximate a true grade as closely as possible. Most current systems either take the mean or median of reported grades; others aim to estimate students' grading accuracy under a probabilistic model. This paper extends the state of the art in the latter approach in three key ways: (1) recognizing that students can behave strate… ▽ More Peer grading systems aggregate noisy reports from multiple students to approximate a true grade as closely as possible. Most current systems either take the mean or median of reported grades; others aim to estimate students' grading accuracy under a probabilistic model. This paper extends the state of the art in the latter approach in three key ways: (1) recognizing that students can behave strategically (e.g., reporting grades close to the class average without doing the work); (2) appropriately handling censored data that arises from discrete-valued grading rubrics; and (3) using mixed integer programming to improve the interpretability of the grades assigned to students. We show how to make Bayesian inference practical in this model and evaluate our approach on both synthetic and real-world data obtained by using our implemented system in four large classes. These extensive experiments show that grade aggregation using our model accurately estimates true grades, students' likelihood of submitting uninformative grades, and the variation in their inherent grading error; we also characterize our models' robustness. △ Less

Submitted 2 December, 2022; v1 submitted 2 September, 2022; originally announced September 2022.

arXiv:2205.13028 [pdf, other]

Formalizing Preferences Over Runtime Distributions

Authors: Devon R. Graham, Kevin Leyton-Brown, Tim Roughgarden

Abstract: When trying to solve a computational problem, we are often faced with a choice between algorithms that are guaranteed to return the right answer but differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algori… ▽ More When trying to solve a computational problem, we are often faced with a choice between algorithms that are guaranteed to return the right answer but differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algorithm that minimizes expected runtime. However, such preferences would be driven by exactly how slow our algorithm is on bad inputs, whereas in practice we are typically willing to cut off occasional, sufficiently long runs before they finish. We propose a principled alternative, taking a utility-theoretic approach to characterize the scoring functions that describe preferences over algorithms. These functions depend on the way our value for solving our problem decreases with time and on the distribution from which captimes are drawn. We describe examples of realistic utility functions and show how to leverage a maximum-entropy approach for modeling underspecified captime distributions. Finally, we show how to efficiently estimate an algorithm's expected utility from runtime samples. △ Less

Submitted 2 June, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

arXiv:2205.00445 [pdf, other]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Authors: Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, Moshe Tenenholtz

Abstract: Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. Although an essential element of modern AI, LMs are also inherently limited in a number of ways. We discuss these limitations and how they can be avoided by adopting a systems approach. Conceptualizing the challenge as one that involves knowledge and reasoning in addition to… ▽ More Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. Although an essential element of modern AI, LMs are also inherently limited in a number of ways. We discuss these limitations and how they can be avoided by adopting a systems approach. Conceptualizing the challenge as one that involves knowledge and reasoning in addition to linguistic processing, we define a flexible architecture with multiple neural models, complemented by discrete knowledge and reasoning modules. We describe this neuro-symbolic architecture, dubbed the Modular Reasoning, Knowledge and Language (MRKL, pronounced "miracle") system, some of the technical challenges in implementing it, and Jurassic-X, AI21 Labs' MRKL system implementation. △ Less

Submitted 1 May, 2022; originally announced May 2022.

arXiv:2204.10019 [pdf, other]

Standing on the Shoulders of Giant Frozen Language Models

Authors: Yoav Levine, Itay Dalmedigos, Ori Ram, Yoel Zeldes, Daniel Jannai, Dor Muhlgay, Yoni Osin, Opher Lieber, Barak Lenz, Shai Shalev-Shwartz, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham

Abstract: Huge pretrained language models (LMs) have demonstrated surprisingly good zero-shot capabilities on a wide variety of tasks. This gives rise to the appealing vision of a single, versatile model with a wide range of functionalities across disparate applications. However, current leading techniques for leveraging a "frozen" LM -- i.e., leaving its weights untouched -- still often underperform fine-t… ▽ More Huge pretrained language models (LMs) have demonstrated surprisingly good zero-shot capabilities on a wide variety of tasks. This gives rise to the appealing vision of a single, versatile model with a wide range of functionalities across disparate applications. However, current leading techniques for leveraging a "frozen" LM -- i.e., leaving its weights untouched -- still often underperform fine-tuning approaches which modify these weights in a task-dependent way. Those, in turn, suffer forgetfulness and compromise versatility, suggesting a tradeoff between performance and versatility. The main message of this paper is that current frozen-model techniques such as prompt tuning are only the tip of the iceberg, and more powerful methods for leveraging frozen LMs can do just as well as fine tuning in challenging domains without sacrificing the underlying model's versatility. To demonstrate this, we introduce three novel methods for leveraging frozen models: input-dependent prompt tuning, frozen readers, and recursive LMs, each of which vastly improves on current frozen-model approaches. Indeed, some of our methods even outperform fine-tuning approaches in domains currently dominated by the latter. The computational cost of each method is higher than that of existing frozen model methods, but still negligible relative to a single pass through a huge frozen LM. Each of these methods constitutes a meaningful contribution in its own right, but by presenting these contributions together we aim to convince the reader of a broader message that goes beyond the details of any given method: that frozen models have untapped potential and that fine-tuning is often unnecessary. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2202.12273 [pdf, other]

Matching Papers and Reviewers at Large Conferences

Authors: Kevin Leyton-Brown, Mausam, Yatin Nandwani, Hedayat Zarkoob, Chris Cameron, Neil Newman, Dinesh Raghu

Abstract: Peer-reviewed conferences, the main publication venues in CS, rely critically on matching highly qualified reviewers for each paper. Because of the growing scale of these conferences, the tight timelines on which they operate, and a recent surge in explicitly dishonest behavior, there is now no alternative to performing this matching in an automated way. This paper studies a novel reviewer-paper m… ▽ More Peer-reviewed conferences, the main publication venues in CS, rely critically on matching highly qualified reviewers for each paper. Because of the growing scale of these conferences, the tight timelines on which they operate, and a recent surge in explicitly dishonest behavior, there is now no alternative to performing this matching in an automated way. This paper studies a novel reviewer-paper matching approach that was recently deployed in the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), and has since been adopted (wholly or partially) by other conferences including ICML 2022, AAAI 2022, and IJCAI 2022. This approach has three main elements: (1) collecting and processing input data to identify problematic matches and generate reviewer-paper scores; (2) formulating and solving an optimization problem to find good reviewer-paper matchings; and (3) a two-phase reviewing process that shifts reviewing resources away from papers likely to be rejected and towards papers closer to the decision boundary. This paper also describes an evaluation of these innovations based on an extensive post-hoc analysis on real data -- including a comparison with the matching algorithm used in AAAI's previous (2020) iteration -- and supplements this with additional numerical experimentation. △ Less

Submitted 5 August, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

arXiv:2107.00758 [pdf, other]

The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models

Authors: Greg d'Eon, Jason d'Eon, James R. Wright, Kevin Leyton-Brown

Abstract: Supervised learning models often make systematic errors on rare subsets of the data. When these subsets correspond to explicit labels in the data (e.g., gender, race) such poor performance can be identified straightforwardly. This paper introduces a method for discovering systematic errors that do not correspond to such explicitly labelled subgroups. The key idea is that similar inputs tend to hav… ▽ More Supervised learning models often make systematic errors on rare subsets of the data. When these subsets correspond to explicit labels in the data (e.g., gender, race) such poor performance can be identified straightforwardly. This paper introduces a method for discovering systematic errors that do not correspond to such explicitly labelled subgroups. The key idea is that similar inputs tend to have similar representations in the final hidden layer of a neural network. We leverage this structure by "shining a spotlight" on this representation space to find contiguous regions where the model performs poorly. We show that the spotlight surfaces semantically meaningful areas of weakness in a wide variety of existing models spanning computer vision, NLP, and recommender systems. △ Less

Submitted 15 October, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

arXiv:2106.10349 [pdf, other]

The Perils of Learning Before Optimizing

Authors: Chris Cameron, Jason Hartford, Taylor Lundy, Kevin Leyton-Brown

Abstract: Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be l… ▽ More Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be learned end-to-end by differentiating through the optimization task. Such methods often yield empirical improvements, which are typically attributed to end-to-end making better error tradeoffs than the standard loss function used in a two-stage solution. We refine this explanation and more precisely characterize when end-to-end can improve performance. When prediction targets are stochastic, a two-stage solution must make an a priori choice about which statistics of the target distribution to model-we consider expectations over prediction targets-while an end-to-end solution can make this choice adaptively. We show that the performance gap between a two-stage and end-to-end approach is closely related to the price of correlation concept in stochastic optimization and show the implications of some existing POC results for the predict-then-optimize problem. We then consider a novel and particularly practical setting, where multiple prediction targets are combined to obtain each of the objective function's coefficients. We give explicit constructions where (1) two-stage performs unboundedly worse than end-to-end; and (2) two-stage is optimal. We use simulations to experimentally quantify performance gaps and identify a wide range of real-world applications from the literature whose objective functions rely on multiple prediction targets, suggesting that end-to-end learning could yield significant improvements. △ Less

Submitted 16 December, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

arXiv:2101.10078 [pdf, other]

Mechanical TA 2: A System for Peer Grading with TA Support

Authors: Hedayat Zarkoob, Farzad Abdolhosseini, Kevin Leyton-Brown

Abstract: Mechanical TA 2 (MTA2) is an open source web-based peer grading application that leverages trusted TA graders to incentivize high-quality peer review. A previous, prototype implementation of MTA proved the value of the concept, but was neither suitable for use at scale nor easily extensible; MTA2 is a complete reimplementation of the system that overcomes these hurdles. MTA2 serves two, interconne… ▽ More Mechanical TA 2 (MTA2) is an open source web-based peer grading application that leverages trusted TA graders to incentivize high-quality peer review. A previous, prototype implementation of MTA proved the value of the concept, but was neither suitable for use at scale nor easily extensible; MTA2 is a complete reimplementation of the system that overcomes these hurdles. MTA2 serves two, interconnected purposes: facilitating practical peer grading and serving as a testbed for experimentation with different peer grading mechanisms. The system is characterized by a modular design that makes customization easy; support for dividing students into different pools based on their peer-grading prowess; mechanisms for automated calibration and spot checking; and the ability for students to appeal grades and to give feedback about individual reviews. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2012.00689 [pdf, ps, other]

Dynamic Weighted Matching with Heterogeneous Arrival and Departure Rates

Authors: Natalie Collina, Nicole Immorlica, Kevin Leyton-Brown, Brendan Lucier, Neil Newman

Abstract: We study a dynamic non-bipartite matching problem. There is a fixed set of agent types, and agents of a given type arrive and depart according to type-specific Poisson processes. Agent departures are not announced in advance. The value of a match is determined by the types of the matched agents. We present an online algorithm that is (1/8)-competitive with respect to the value of the optimal-in-hi… ▽ More We study a dynamic non-bipartite matching problem. There is a fixed set of agent types, and agents of a given type arrive and depart according to type-specific Poisson processes. Agent departures are not announced in advance. The value of a match is determined by the types of the matched agents. We present an online algorithm that is (1/8)-competitive with respect to the value of the optimal-in-hindsight policy, for arbitrary weighted graphs. Our algorithm treats agents heterogeneously, interpolating between immediate and delayed matching in order to thicken the market while still matching valuable agents opportunistically. △ Less

Submitted 10 January, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

arXiv:2011.01285 [pdf, other]

Exemplar Guided Active Learning

Authors: Jason Hartford, Kevin Leyton-Brown, Hadas Raviv, Dan Padnos, Shahar Lev, Barak Lenz

Abstract: We consider the problem of wisely using a limited budget to label a small subset of a large unlabeled dataset. We are motivated by the NLP problem of word sense disambiguation. For any word, we have a set of candidate labels from a knowledge base, but the label set is not necessarily representative of what occurs in the data: there may exist labels in the knowledge base that very rarely occur in t… ▽ More We consider the problem of wisely using a limited budget to label a small subset of a large unlabeled dataset. We are motivated by the NLP problem of word sense disambiguation. For any word, we have a set of candidate labels from a knowledge base, but the label set is not necessarily representative of what occurs in the data: there may exist labels in the knowledge base that very rarely occur in the corpus because the sense is rare in modern English; and conversely there may exist true labels that do not exist in our knowledge base. Our aim is to obtain a classifier that performs as well as possible on examples of each "common class" that occurs with frequency above a given threshold in the unlabeled set while annotating as few examples as possible from "rare classes" whose labels occur with less than this frequency. The challenge is that we are not informed which labels are common and which are rare, and the true label distribution may exhibit extreme skew. We describe an active learning approach that (1) explicitly searches for rare classes by leveraging the contextual embedding spaces provided by modern language models, and (2) incorporates a stop** rule that ignores classes once we prove that they occur below our target threshold with high probability. We prove that our algorithm only costs logarithmically more than a hypothetical approach that knows all true label frequencies and show experimentally that incorporating automated search can significantly reduce the number of samples needed to reach target accuracy levels. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: Published at NeurIPS 2020

arXiv:2010.01825 [pdf, other]

PMI-Masking: Principled masking of correlated spans

Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham

Abstract: Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy… ▽ More Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training. △ Less

Submitted 5 October, 2020; originally announced October 2020.

arXiv:2006.11386 [pdf, other]

Valid Causal Inference with (Some) Invalid Instruments

Authors: Jason Hartford, Victor Veitch, Dhanya Sridhar, Kevin Leyton-Brown

Abstract: Instrumental variable methods provide a powerful approach to estimating causal effects in the presence of unobserved confounding. But a key challenge when applying them is the reliance on untestable "exclusion" assumptions that rule out any relationship between the instrument variable and the response that is not mediated by the treatment. In this paper, we show how to perform consistent IV estima… ▽ More Instrumental variable methods provide a powerful approach to estimating causal effects in the presence of unobserved confounding. But a key challenge when applying them is the reliance on untestable "exclusion" assumptions that rule out any relationship between the instrument variable and the response that is not mediated by the treatment. In this paper, we show how to perform consistent IV estimation despite violations of the exclusion assumption. In particular, we show that when one has multiple candidate instruments, only a majority of these candidates---or, more generally, the modal candidate-response relationship---needs to be valid to estimate the causal effect. Our approach uses an estimate of the modal prediction from an ensemble of instrumental variable estimators. The technique is simple to apply and is "black-box" in the sense that it may be used with any instrumental variable estimator as long as the treatment effect is identified for each valid instrument independently. As such, it is compatible with recent machine-learning based estimators that allow for the estimation of conditional average treatment effects (CATE) on complex, high dimensional data. Experimentally, we achieve accurate estimates of conditional average treatment effects using an ensemble of deep network-based estimators, including on a challenging simulated Mendelian Randomization problem. △ Less

Submitted 19 June, 2020; originally announced June 2020.

arXiv:2006.04497 [pdf, other]

Learning under Invariable Bayesian Safety

Authors: Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, Moshe Tennenholtz

Abstract: A recent body of work addresses safety constraints in explore-and-exploit systems. Such constraints arise where, for example, exploration is carried out by individuals whose welfare should be balanced with overall welfare. In this paper, we adopt a model inspired by recent work on a bandit-like setting for recommendations. We contribute to this line of literature by introducing a safety constraint… ▽ More A recent body of work addresses safety constraints in explore-and-exploit systems. Such constraints arise where, for example, exploration is carried out by individuals whose welfare should be balanced with overall welfare. In this paper, we adopt a model inspired by recent work on a bandit-like setting for recommendations. We contribute to this line of literature by introducing a safety constraint that should be respected in every round and determines that the expected value in each round is above a given threshold. Due to our modeling, the safe explore-and-exploit policy deserves careful planning, or otherwise, it will lead to sub-optimal welfare. We devise an asymptotically optimal algorithm for the setting and analyze its instance-dependent convergence rate. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:2003.09761 [pdf, other]

Smarter Parking: Using AI to Identify Parking Inefficiencies in Vancouver

Authors: Devon Graham, Satish Kumar Sarraf, Taylor Lundy, Ali MohammadMehr, Sara Uppal, Tae Yoon Lee, Hedayat Zarkoob, Scott Duke Kominers, Kevin Leyton-Brown

Abstract: On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road u… ▽ More On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road users and that people exiting parked cars pose a risk to cyclists. These social costs may not be worth paying when off-street parking lots are nearby and have surplus capacity. To see where this might be true in downtown Vancouver, we used artificial intelligence techniques to estimate the amount of time it would take drivers to both park on and off street for destinations throughout the city. For on-street parking, we developed (1) a deep-learning model of block-by-block parking availability based on data from parking meters and audits and (2) a computational simulation of drivers searching for an on-street spot. For off-street parking, we developed a computational simulation of the time it would take drivers drive from their original destination to the nearest city-owned off-street lot and then to queue for a spot based on traffic and lot occupancy data. Finally, in both cases we also computed the time it would take the driver to walk from their parking spot to their original destination. We compared these time estimates for destinations in each block of Vancouver's downtown core and each hour of the day. We found many areas where off street would actually save drivers time over searching the streets for a spot, and many more where the time cost for parking off street was small. The identification of such areas provides an opportunity for the city to repurpose valuable curbside space for community-friendly uses more in line with its transportation goals. △ Less

Submitted 21 March, 2020; originally announced March 2020.

Comments: All the authors contributed equally. This paper is an outcome of https://www.cs.ubc.ca/~kevinlb/teaching/cs532l%20-%202018-19/index.html. To be submitted to a journal in transportation or urban planning

arXiv:1906.05884 [pdf, other]

Report-Sensitive Spot-Checking in Peer-Grading Systems

Authors: Hedayat Zarkoob, Hu Fu, Kevin Leyton-Brown

Abstract: Peer grading systems make large courses more scalable, provide students with faster and more detailed feedback, and help students to learn by thinking critically about the work of others. A key obstacle to the broader adoption of peer grading systems is motivating students to provide accurate grades. The literature has explored many different approaches to incentivizing accurate grading (which we… ▽ More Peer grading systems make large courses more scalable, provide students with faster and more detailed feedback, and help students to learn by thinking critically about the work of others. A key obstacle to the broader adoption of peer grading systems is motivating students to provide accurate grades. The literature has explored many different approaches to incentivizing accurate grading (which we survey in detail), but the strongest incentive guarantees have been offered by mechanisms that compare peer grades to trusted TA grades with a fixed probability. In this work, we show that less TA work is required when these probabilities are allowed to depend on the grades that students report. We prove this result in a model with two possible grades, arbitrary numbers of agents, no requirement that students grade multiple assignments, arbitrary but homogeneous noisy observation of the ground truth which students can pay a fixed cost to observe, and the possibility of misreporting grades before or after observing this signal. We give necessary and sufficient conditions for our new mechanism's feasibility, prove its optimality under these assumptions, and characterize its improvement over the previous state of the art both analytically and empirically. Finally, we relax our homogeneity assumption, allowing each student and TA to observe the ground truth according to a different noise model. △ Less

Submitted 8 March, 2021; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: This work is published at AAMAS 2020 and supersedes an AAMAS 2019 extended abstract with the same title

arXiv:1905.07043 [pdf, ps, other]

Fiduciary Bandits

Authors: Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, Moshe Tennenholtz

Abstract: Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously r… ▽ More Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously respects agents' interests---i.e., by a system that acts as a fiduciary. More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. Our main contribution is a positive result: an asymptotically optimal, incentive compatible, and ex-ante individually rational recommendation algorithm. △ Less

Submitted 28 June, 2020; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: Published in The Thirty-seventh International Conference on Machine Learning (ICML 2020)

arXiv:1902.05454 [pdf, other]

Procrastinating with Confidence: Near-Optimal, Anytime, Adaptive Algorithm Configuration

Authors: Robert Kleinberg, Kevin Leyton-Brown, Brendan Lucier, Devon Graham

Abstract: Algorithm configuration methods optimize the performance of a parameterized heuristic algorithm on a given distribution of problem instances. Recent work introduced an algorithm configuration procedure ("Structured Procrastination") that provably achieves near optimal performance with high probability and with nearly minimal runtime in the worst case. It also offers an $\textit{anytime}$ property:… ▽ More Algorithm configuration methods optimize the performance of a parameterized heuristic algorithm on a given distribution of problem instances. Recent work introduced an algorithm configuration procedure ("Structured Procrastination") that provably achieves near optimal performance with high probability and with nearly minimal runtime in the worst case. It also offers an $\textit{anytime}$ property: it keeps tightening its optimality guarantees the longer it is run. Unfortunately, Structured Procrastination is not $\textit{adaptive}$ to characteristics of the parameterized algorithm: it treats every input like the worst case. Follow-up work ("LeapsAndBounds") achieves adaptivity but trades away the anytime property. This paper introduces a new algorithm, "Structured Procrastination with Confidence", that preserves the near-optimality and anytime properties of Structured Procrastination while adding adaptivity. In particular, the new algorithm will perform dramatically faster in settings where many algorithm configurations perform poorly. We show empirically both that such settings arise frequently in practice and that the anytime property is useful for finding good configurations quickly. △ Less

Submitted 8 November, 2019; v1 submitted 14 February, 2019; originally announced February 2019.

arXiv:1812.11571 [pdf, ps, other]

A Formal Separation Between Strategic and Nonstrategic Behavior

Authors: James R. Wright, Kevin Leyton-Brown

Abstract: It is common to make a distinction between "strategic" behavior and other forms of intentional but "nonstrategic" behavior: typically, that strategic agents model other agents while nonstrategic agents do not. However, a crisp boundary between these concepts has proven elusive. This problem is pervasive throughout the game theoretic literature on bounded rationality and particularly critical in pa… ▽ More It is common to make a distinction between "strategic" behavior and other forms of intentional but "nonstrategic" behavior: typically, that strategic agents model other agents while nonstrategic agents do not. However, a crisp boundary between these concepts has proven elusive. This problem is pervasive throughout the game theoretic literature on bounded rationality and particularly critical in parts of the behavioral game theory literature that make an explicit distinction between the behavior of "nonstrategic" level-0 agents and "strategic" higher-level agents (e.g., the level-k and cognitive hierarchy models). Overall, work discussing bounded rationality rarely gives clear guidance on how the rationality of nonstrategic agents must be bounded, instead typically just singling out specific decision rules (e.g., randomizing uniformly, playing toward the best case, optimizing the worst case) and informally asserting that they are nonstrategic. In this work, we propose a new, formal characterization of nonstrategic behavior. Our main contribution is to show that it satisfies two properties: (1) it is general enough to capture all purportedly "nonstrategic" decision rules of which we are aware in the behavioral game theory literature; (2) behavior that obeys our characterization is distinct from strategic behavior in a precise sense. △ Less

Submitted 26 May, 2022; v1 submitted 30 December, 2018; originally announced December 2018.

arXiv:1803.02879 [pdf, other]

Deep Models of Interactions Across Sets

Authors: Jason Hartford, Devon R Graham, Kevin Leyton-Brown, Siamak Ravanbakhsh

Abstract: We use deep learning to model interactions across two or more sets of objects, such as user-movie ratings, protein-drug bindings, or ternary user-item-tag interactions. The canonical representation of such interactions is a matrix (or a higher-dimensional tensor) with an exchangeability property: the encoding's meaning is not changed by permuting rows or columns. We argue that models should hence… ▽ More We use deep learning to model interactions across two or more sets of objects, such as user-movie ratings, protein-drug bindings, or ternary user-item-tag interactions. The canonical representation of such interactions is a matrix (or a higher-dimensional tensor) with an exchangeability property: the encoding's meaning is not changed by permuting rows or columns. We argue that models should hence be Permutation Equivariant (PE): constrained to make the same predictions across such permutations. We present a parameter-sharing scheme and prove that it could not be made any more expressive without violating PE. This scheme yields three benefits. First, we demonstrate state-of-the-art performance on multiple matrix completion benchmarks. Second, our models require a number of parameters independent of the numbers of objects, and thus scale well to large datasets. Third, models can be queried about new objects that were not available at training time, but for which interactions have since been observed. In experiments, our models achieved surprisingly good generalization performance on this matrix extrapolation task, both within domains (e.g., new users and new movies drawn from the same distribution used for training) and even across domains (e.g., predicting music ratings after training on movies). △ Less

Submitted 8 June, 2018; v1 submitted 7 March, 2018; originally announced March 2018.

arXiv:1706.04324 [pdf, other]

Assessing Economic Outcomes in Simulated Reverse Clock Auctions for Radio Spectrum

Authors: Neil Newman, Kevin Leyton-Brown, Paul Milgrom, Ilya Segal

Abstract: We investigate the economic outcomes that result under simulated bidder behavior in a model of the FCC's reverse auction for radio spectrum. In our simulations, limiting our notion of efficiency to the reverse auction in isolation, the reverse clock auction achieves very efficient solutions, the FCC's scoring rule greatly reduces the total payments to TV broadcasters at the cost of some efficiency… ▽ More We investigate the economic outcomes that result under simulated bidder behavior in a model of the FCC's reverse auction for radio spectrum. In our simulations, limiting our notion of efficiency to the reverse auction in isolation, the reverse clock auction achieves very efficient solutions, the FCC's scoring rule greatly reduces the total payments to TV broadcasters at the cost of some efficiency, and using a poor feasibility checker can have grave consequences both in terms of the auction's cost and efficiency. △ Less

Submitted 14 June, 2017; originally announced June 2017.

arXiv:1706.03304 [pdf, other]

Deep Optimization for Spectrum Repacking

Authors: Neil Newman, Alexandre Fréchette, Kevin Leyton-Brown

Abstract: Over 13 months in 2016-17 the FCC conducted an "incentive auction" to repurpose radio spectrum from broadcast television to wireless internet. In the end, the auction yielded $19.8 billion, $10.05 billion of which was paid to 175 broadcasters for voluntarily relinquishing their licenses across 14 UHF channels. Stations that continued broadcasting were assigned potentially new channels to fit as de… ▽ More Over 13 months in 2016-17 the FCC conducted an "incentive auction" to repurpose radio spectrum from broadcast television to wireless internet. In the end, the auction yielded $19.8 billion, $10.05 billion of which was paid to 175 broadcasters for voluntarily relinquishing their licenses across 14 UHF channels. Stations that continued broadcasting were assigned potentially new channels to fit as densely as possible into the channels that remained. The government netted more than $7 billion (used to pay down the national debt) after covering costs. A crucial element of the auction design was the construction of a solver, dubbed SATFC, that determined whether sets of stations could be "repacked" in this way; it needed to run every time a station was given a price quote. This paper describes the process by which we built SATFC. We adopted an approach we dub "deep optimization", taking a data-driven, highly parametric, and computationally intensive approach to solver design. More specifically, to build SATFC we designed software that could pair both complete and local-search SAT-encoded feasibility checking with a wide range of domain-specific techniques. We then used automatic algorithm configuration techniques to construct a portfolio of eight complementary algorithms to be run in parallel, aiming to achieve good performance on instances that arose in proprietary auction simulations. To evaluate the impact of our solver in this paper, we built an open-source reverse auction simulator. We found that within the short time budget required in practice, SATFC solved more than 95% of the problems it encountered. Furthermore, the incentive auction paired with SATFC produced nearly optimal allocations in a restricted setting and substantially outperformed other alternatives at national scale. △ Less

Submitted 10 June, 2017; originally announced June 2017.

arXiv:1703.10342 [pdf, other]

Efficient Benchmarking of Algorithm Configuration Procedures via Model-Based Surrogates

Authors: Katharina Eggensperger, Marius Lindauer, Holger H. Hoos, Frank Hutter, Kevin Leyton-Brown

Abstract: The optimization of algorithm (hyper-)parameters is crucial for achieving peak performance across a wide range of domains, ranging from deep neural networks to solvers for hard combinatorial problems. The resulting algorithm configuration (AC) problem has attracted much attention from the machine learning community. However, the proper evaluation of new AC procedures is hindered by two key hurdles… ▽ More The optimization of algorithm (hyper-)parameters is crucial for achieving peak performance across a wide range of domains, ranging from deep neural networks to solvers for hard combinatorial problems. The resulting algorithm configuration (AC) problem has attracted much attention from the machine learning community. However, the proper evaluation of new AC procedures is hindered by two key hurdles. First, AC benchmarks are hard to set up. Second and even more significantly, they are computationally expensive: a single run of an AC procedure involves many costly runs of the target algorithm whose performance is to be optimized in a given AC benchmark scenario. One common workaround is to optimize cheap-to-evaluate artificial benchmark functions (e.g., Branin) instead of actual algorithms; however, these have different properties than realistic AC problems. Here, we propose an alternative benchmarking approach that is similarly cheap to evaluate but much closer to the original AC problem: replacing expensive benchmarks by surrogate benchmarks constructed from AC benchmarks. These surrogate benchmarks approximate the response surface corresponding to true target algorithm performance using a regression model, and the original and surrogate benchmark share the same (hyper-)parameter space. In our experiments, we construct and evaluate surrogate benchmarks for hyperparameter optimization as well as for AC problems that involve performance optimization of solvers for hard combinatorial problems, drawing training data from the runs of existing AC procedures. We show that our surrogate benchmarks capture overall important characteristics of the AC scenarios, such as high- and low-performing regions, from which they were derived, while being much easier to use and orders of magnitude cheaper to evaluate. △ Less

Submitted 30 March, 2017; originally announced March 2017.

arXiv:1612.09596 [pdf, other]

Counterfactual Prediction with Deep Instrumental Variables Networks

Authors: Jason Hartford, Greg Lewis, Kevin Leyton-Brown, Matt Taddy

Abstract: We are in the middle of a remarkable rise in the use and capability of artificial intelligence. Much of this growth has been fueled by the success of deep learning architectures: models that map from observables to outputs via multiple layers of latent representations. These deep learning algorithms are effective tools for unstructured prediction, and they can be combined in AI systems to solve co… ▽ More We are in the middle of a remarkable rise in the use and capability of artificial intelligence. Much of this growth has been fueled by the success of deep learning architectures: models that map from observables to outputs via multiple layers of latent representations. These deep learning algorithms are effective tools for unstructured prediction, and they can be combined in AI systems to solve complex automated reasoning problems. This paper provides a recipe for combining ML algorithms to solve for causal effects in the presence of instrumental variables -- sources of treatment randomization that are conditionally independent from the response. We show that a flexible IV specification resolves into two prediction tasks that can be solved with deep neural nets: a first-stage network for treatment prediction and a second-stage network whose loss function involves integration over the conditional treatment distribution. This Deep IV framework imposes some specific structure on the stochastic gradient descent routine used for training, but it is general enough that we can take advantage of off-the-shelf ML capabilities and avoid extensive algorithm customization. We outline how to obtain out-of-sample causal validation in order to avoid over-fit. We also introduce schemes for both Bayesian and frequentist inference: the former via a novel adaptation of dropout training, and the latter via a data splitting routine. △ Less

Submitted 30 December, 2016; originally announced December 2016.

arXiv:1609.08923 [pdf, other]

Models of Level-0 Behavior for Predicting Human Behavior in Games

Authors: James R. Wright, Kevin Leyton-Brown

Abstract: Behavioral game theory seeks to describe the way actual people (as compared to idealized, "rational" agents) act in strategic situations. Our own recent work has identified iterative models (such as quantal cognitive hierarchy) as the state of the art for predicting human play in unrepeated, simultaneous-move games (Wright & Leyton-Brown 2012, 2016). Iterative models predict that agents reason ite… ▽ More Behavioral game theory seeks to describe the way actual people (as compared to idealized, "rational" agents) act in strategic situations. Our own recent work has identified iterative models (such as quantal cognitive hierarchy) as the state of the art for predicting human play in unrepeated, simultaneous-move games (Wright & Leyton-Brown 2012, 2016). Iterative models predict that agents reason iteratively about their opponents, building up from a specification of nonstrategic behavior called level-0. The modeler is in principle free to choose any description of level-0 behavior that makes sense for the setting. However, almost all existing work specifies this behavior as a uniform distribution over actions. In most games it is not plausible that even nonstrategic agents would choose an action uniformly at random, nor that other agents would expect them to do so. A more accurate model for level-0 behavior has the potential to dramatically improve predictions of human behavior, since a substantial fraction of agents may play level-0 strategies directly, and furthermore since iterative models ground all higher-level strategies in responses to the level-0 strategy. Our work considers models of the way in which level-0 agents construct a probability distribution over actions, given an arbitrary game. Using a Bayesian optimization package called SMAC (Hutter, Hoos, & Leyton-Brown, 2010, 2011, 2012), we systematically evaluated a large space of such models, each of which makes its prediction based only on general features that can be computed from any normal form game. In the end, we recommend a model that achieved excellent performance across the board: a linear weighting of features that requires the estimation of four weights. We evaluated the effects of combining this new level-0 model with several iterative models, and observed large improvements in the models' predictive accuracies. △ Less

Submitted 28 September, 2016; originally announced September 2016.

arXiv:1606.07042 [pdf, ps, other]

Incentivizing Evaluation via Limited Access to Ground Truth: Peer-Prediction Makes Things Worse

Authors: Alice Gao, James R. Wright, Kevin Leyton-Brown

Abstract: In many settings, an effective way of evaluating objects of interest is to collect evaluations from dispersed individuals and to aggregate these evaluations together. Some examples are categorizing online content and evaluating student assignments via peer grading. For this data science problem, one challenge is to motivate participants to conduct such evaluations carefully and to report them hone… ▽ More In many settings, an effective way of evaluating objects of interest is to collect evaluations from dispersed individuals and to aggregate these evaluations together. Some examples are categorizing online content and evaluating student assignments via peer grading. For this data science problem, one challenge is to motivate participants to conduct such evaluations carefully and to report them honestly, particularly when doing so is costly. Existing approaches, notably peer-prediction mechanisms, can incentivize truth telling in equilibrium. However, they also give rise to equilibria in which agents do not pay the costs required to evaluate accurately, and hence fail to elicit useful information. We show that this problem is unavoidable whenever agents are able to coordinate using low-cost signals about the items being evaluated (e.g., text labels or pictures). We then consider ways of circumventing this problem by comparing agents' reports to ground truth, which is available in practice when there exist trusted evaluators---such as teaching assistants in the peer grading scenario---who can perform a limited number of unbiased (but noisy) evaluations. Of course, when such ground truth is available, a simpler approach is also possible: rewarding each agent based on agreement with ground truth with some probability, and unconditionally rewarding the agent otherwise. Surprisingly, we show that the simpler mechanism achieves stronger incentive guarantees given less access to ground truth than a large set of peer-prediction mechanisms. △ Less

Submitted 22 June, 2016; originally announced June 2016.

arXiv:1506.02465 [pdf, other]

ASlib: A Benchmark Library for Algorithm Selection

Authors: Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Frechette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, Joaquin Vanschoren

Abstract: The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, b… ▽ More The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms. △ Less

Submitted 6 April, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

Comments: Accepted to be published in Artificial Intelligence Journal

arXiv:1505.01221 [pdf, other]

The Configurable SAT Solver Challenge (CSSC)

Authors: Frank Hutter, Marius Lindauer, Adrian Balint, Sam Bayless, Holger Hoos, Kevin Leyton-Brown

Abstract: It is well known that different solution strategies work well for different types of instances of hard combinatorial problems. As a consequence, most solvers for the propositional satisfiability problem (SAT) expose parameters that allow them to be customized to a particular family of instances. In the international SAT competition series, these parameters are ignored: solvers are run using a sing… ▽ More It is well known that different solution strategies work well for different types of instances of hard combinatorial problems. As a consequence, most solvers for the propositional satisfiability problem (SAT) expose parameters that allow them to be customized to a particular family of instances. In the international SAT competition series, these parameters are ignored: solvers are run using a single default parameter setting (supplied by the authors) for all benchmark instances in a given track. While this competition format rewards solvers with robust default settings, it does not reflect the situation faced by a practitioner who only cares about performance on one particular application and can invest some time into tuning solver parameters for this application. The new Configurable SAT Solver Competition (CSSC) compares solvers in this latter setting, scoring each solver by the performance it achieved after a fully automated configuration step. This article describes the CSSC in more detail, and reports the results obtained in its two instantiations so far, CSSC 2013 and 2014. △ Less

Submitted 2 August, 2016; v1 submitted 5 May, 2015; originally announced May 2015.

arXiv:1408.0703 [pdf, other]

Computational Analysis of Perfect-Information Position Auctions

Authors: David R. M Thompson, Kevin Leyton-Brown

Abstract: After experimentation with other designs, the major search engines converged on the weighted, generalized second-price auction (wGSP) for selling keyword advertisements. Notably, this convergence occurred before position auctions were well understood (or, indeed, widely studied) theoretically. While much progress has been made since, theoretical analysis is still not able to settle the question of… ▽ More After experimentation with other designs, the major search engines converged on the weighted, generalized second-price auction (wGSP) for selling keyword advertisements. Notably, this convergence occurred before position auctions were well understood (or, indeed, widely studied) theoretically. While much progress has been made since, theoretical analysis is still not able to settle the question of why search engines found wGSP preferable to other position auctions. We approach this question in a new way, adopting a new analytical paradigm we dub "computational mechanism analysis." By sampling position auction games from a given distribution, encoding them in a computationally efficient representation language, computing their Nash equilibria, and then calculating economic quantities of interest, we can quantitatively answer questions that theoretical methods have not. We considered seven widely studied valuation models from the literature and three position auction variants (generalized first price, unweighted generalized second price, and wGSP). We found that wGSP consistently showed the best ads of any position auction, measured both by social welfare and by relevance (expected number of clicks). Even in models where wGSP was already known to have bad worse-case efficiency, we found that it almost always performed well on average. In contrast, we found that revenue was extremely variable across auction mechanisms, and was highly sensitive to equilibrium selection, the preference model, and the valuation distribution. △ Less

Submitted 4 August, 2014; originally announced August 2014.

arXiv:1401.8074 [pdf, ps, other]

Empirically Evaluating Multiagent Learning Algorithms

Authors: Erik Zawadzki, Asher Lipson, Kevin Leyton-Brown

Abstract: There exist many algorithms for learning how to play repeated bimatrix games. Most of these algorithms are justified in terms of some sort of theoretical guarantee. On the other hand, little is known about the empirical performance of these algorithms. Most such claims in the literature are based on small experiments, which has hampered understanding as well as the development of new multiagent le… ▽ More There exist many algorithms for learning how to play repeated bimatrix games. Most of these algorithms are justified in terms of some sort of theoretical guarantee. On the other hand, little is known about the empirical performance of these algorithms. Most such claims in the literature are based on small experiments, which has hampered understanding as well as the development of new multiagent learning (MAL) algorithms. We have developed a new suite of tools for running multiagent experiments: the MultiAgent Learning Testbed (MALT). These tools are designed to facilitate larger and more comprehensive experiments by removing the need to build one-off experimental code. MALT also provides baseline implementations of many MAL algorithms, hopefully eliminating or reducing differences between algorithm implementations and increasing the reproducibility of results. Using this test suite, we ran an experiment unprecedented in size. We analyzed the results according to a variety of performance metrics including reward, maxmin distance, regret, and several notions of equilibrium convergence. We confirmed several pieces of conventional wisdom, but also discovered some surprising results. For example, we found that single-agent $Q$-learning outperformed many more complicated and more modern MAL algorithms. △ Less

Submitted 31 January, 2014; originally announced January 2014.

arXiv:1401.3492 [pdf]

doi 10.1613/jair.2861

ParamILS: An Automatic Algorithm Configuration Framework

Authors: Frank Hutter, Thomas Stuetzle, Kevin Leyton-Brown, Holger H. Hoos

Abstract: The identification of performance-optimizing parameter settings is an important part of the development and application of algorithms. We describe an automatic framework for this algorithm configuration problem. More formally, we provide methods for optimizing a target algorithm's performance on a given class of problem instances by varying a set of ordinal and/or categorical parameters. We review… ▽ More The identification of performance-optimizing parameter settings is an important part of the development and application of algorithms. We describe an automatic framework for this algorithm configuration problem. More formally, we provide methods for optimizing a target algorithm's performance on a given class of problem instances by varying a set of ordinal and/or categorical parameters. We review a family of local-search-based algorithm configuration procedures and present novel techniques for accelerating them by adaptively limiting the time spent for evaluating individual configurations. We describe the results of a comprehensive experimental evaluation of our methods, based on the configuration of prominent complete and incomplete algorithms for SAT. We also present what is, to our knowledge, the first published work on automatically configuring the CPLEX mixed integer programming solver. All the algorithms we considered had default parameter settings that were manually identified with considerable effort. Nevertheless, using our automated algorithm configuration procedures, we achieved substantial and consistent performance improvements. △ Less

Submitted 15 January, 2014; originally announced January 2014.

Journal ref: Journal Of Artificial Intelligence Research, Volume 36, pages 267-306, 2009

arXiv:1310.1947 [pdf, other]

Bayesian Optimization With Censored Response Data

Authors: Frank Hutter, Holger Hoos, Kevin Leyton-Brown

Abstract: Bayesian optimization (BO) aims to minimize a given blackbox function using a model that is updated whenever new evidence about the function becomes available. Here, we address the problem of BO under partially right-censored response data, where in some evaluations we only obtain a lower bound on the function value. The ability to handle such response data allows us to adaptively censor costly fu… ▽ More Bayesian optimization (BO) aims to minimize a given blackbox function using a model that is updated whenever new evidence about the function becomes available. Here, we address the problem of BO under partially right-censored response data, where in some evaluations we only obtain a lower bound on the function value. The ability to handle such response data allows us to adaptively censor costly function evaluations in minimization problems where the cost of a function evaluation corresponds to the function value. One important application giving rise to such censored data is the runtime-minimizing variant of the algorithm configuration problem: finding settings of a given parametric algorithm that minimize the runtime required for solving problem instances from a given distribution. We demonstrate that terminating slow algorithm runs prematurely and handling the resulting right-censored observations can substantially improve the state of the art in model-based algorithm configuration. △ Less

Submitted 7 October, 2013; originally announced October 2013.

Comments: Extended version of NIPS 2011 workshop paper

ACM Class: G.3; G.1.6

arXiv:1306.0918 [pdf, other]

doi 10.1016/j.geb.2017.09.009

Predicting Human Behavior in Unrepeated, Simultaneous-Move Games

Authors: James R. Wright, Kevin Leyton-Brown

Abstract: It is common to assume that agents will adopt Nash equilibrium strategies; however, experimental studies have demonstrated that Nash equilibrium is often a poor description of human players' behavior in unrepeated normal-form games. In this paper, we analyze five widely studied models (Quantal Response Equilibrium, Level-k, Cognitive Hierarchy, QLk, and Noisy Introspection) that aim to describe ac… ▽ More It is common to assume that agents will adopt Nash equilibrium strategies; however, experimental studies have demonstrated that Nash equilibrium is often a poor description of human players' behavior in unrepeated normal-form games. In this paper, we analyze five widely studied models (Quantal Response Equilibrium, Level-k, Cognitive Hierarchy, QLk, and Noisy Introspection) that aim to describe actual, rather than idealized, human behavior in such games. We performed what we believe is the most comprehensive meta-analysis of these models, leveraging ten different data sets from the literature recording human play of two-player games. We began by evaluating the models' generalization or predictive performance, asking how well a model fits unseen test data after having had its parameters calibrated based on separate training data. Surprisingly, we found that what we dub the QLk model of Stahl & Wilson (1994) consistently achieved the best performance. Motivated by this finding, we describe methods for analyzing the posterior distributions over a model's parameters. We found that QLk's parameters were being set to values that were not consistent with their intended economic interpretations. We thus explored variations of QLk, ultimately identifying a new model family that has fewer parameters, gives rise to more parsimonious parameter values, and achieves better predictive performance. △ Less

Submitted 29 August, 2017; v1 submitted 4 June, 2013; originally announced June 2013.

Journal ref: Games and Economic Behavior, 106 (2017), pages 16--37

arXiv:1211.0906 [pdf, other]

Algorithm Runtime Prediction: Methods & Evaluation

Authors: Frank Hutter, Lin Xu, Holger H. Hoos, Kevin Leyton-Brown

Abstract: Perhaps surprisingly, it is possible to predict how long an algorithm will take to run on a previously unseen input, using machine learning techniques to build a model of the algorithm's runtime as a function of problem-specific instance features. Such models have important applications to algorithm analysis, portfolio-based algorithm selection, and the automatic configuration of parameterized alg… ▽ More Perhaps surprisingly, it is possible to predict how long an algorithm will take to run on a previously unseen input, using machine learning techniques to build a model of the algorithm's runtime as a function of problem-specific instance features. Such models have important applications to algorithm analysis, portfolio-based algorithm selection, and the automatic configuration of parameterized algorithms. Over the past decade, a wide variety of techniques have been studied for building such models. Here, we describe extensions and improvements of existing models, new families of models, and -- perhaps most importantly -- a much more thorough treatment of algorithm parameters as model inputs. We also comprehensively describe new and existing features for predicting algorithm runtime for propositional satisfiability (SAT), travelling salesperson (TSP) and mixed integer programming (MIP) problems. We evaluate these innovations through the largest empirical analysis of its kind, comparing to a wide range of runtime modelling techniques from the literature. Our experiments consider 11 algorithms and 35 instance distributions; they also span a very wide range of SAT, MIP, and TSP instances, with the least structured having been generated uniformly at random and the most structured having emerged from real industrial applications. Overall, we demonstrate that our new models yield substantially better runtime predictions than previous approaches in terms of their generalization to new problem instances, to new algorithms from a parameterized space, and to both simultaneously. △ Less

Submitted 26 October, 2013; v1 submitted 5 November, 2012; originally announced November 2012.

Comments: 51 pages, 13 figures, 8 tables. Added references, feature cost, and experiments with subsets of features; reworded Sections 1&2

MSC Class: 68T20 ACM Class: I.2.8; I.2.6

arXiv:1208.3719 [pdf, other]

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Authors: Chris Thornton, Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown

Abstract: Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that addresses these issues in isolation. We show that this problem can be addressed by a fu… ▽ More Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that addresses these issues in isolation. We show that this problem can be addressed by a fully automated approach, leveraging recent innovations in Bayesian optimization. Specifically, we consider a wide range of feature selection techniques (combining 3 search and 8 evaluator methods) and all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers, and hyperparameter settings for each classifier. On each of 21 popular datasets from the UCI repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, we show classification performance often much better than using standard selection/hyperparameter optimization methods. We hope that our approach will help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance. △ Less

Submitted 6 March, 2013; v1 submitted 17 August, 2012; originally announced August 2012.

Comments: 9 pages, 3 figures

Report number: Technical Report TR-2012-05 ACM Class: I.2.6; D.2.10; I.2.2

arXiv:1207.4128 [pdf]

Computing Nash Equilibria of Action-Graph Games

Authors: Navin Bhat, Kevin Leyton-Brown

Abstract: Action-graph games (AGGs) are a fully expressive game representation which can compactly express both strict and context-specific independence between players' utility functions. Actions are represented as nodes in a graph G, and the payoff to an agent who chose the action s depends only on the numbers of other agents who chose actions connected to s. We present algorithms for computing both symme… ▽ More Action-graph games (AGGs) are a fully expressive game representation which can compactly express both strict and context-specific independence between players' utility functions. Actions are represented as nodes in a graph G, and the payoff to an agent who chose the action s depends only on the numbers of other agents who chose actions connected to s. We present algorithms for computing both symmetric and arbitrary equilibria of AGGs using a continuation method. We analyze the worst-case cost of computing the Jacobian of the payoff function, the exponential-time bottleneck step, and in all cases achieve exponential speedup. When the indegree of G is bounded by a constant and the game is symmetric, the Jacobian can be computed in polynomial time. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-35-42

arXiv:1205.2638 [pdf]

Temporal Action-Graph Games: A New Representation for Dynamic Games

Authors: Albert Xin Jiang, Kevin Leyton-Brown, Avi Pfeffer

Abstract: In this paper we introduce temporal action graph games (TAGGs), a novel graphical representation of imperfect-information extensive form games. We show that when a game involves anonymity or context-specific utility independencies, its encoding as a TAGG can be much more compact than its direct encoding as a multiagent influence diagram (MAID).We also show that TAGGs can be understood as indirect… ▽ More In this paper we introduce temporal action graph games (TAGGs), a novel graphical representation of imperfect-information extensive form games. We show that when a game involves anonymity or context-specific utility independencies, its encoding as a TAGG can be much more compact than its direct encoding as a multiagent influence diagram (MAID).We also show that TAGGs can be understood as indirect MAID encodings in which many deterministic chance nodes are introduced. We provide an algorithm for computing with TAGGs, and show both theoretically and empirically that our approach improves significantly on the previous state of the art. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-268-276

arXiv:1111.2249 [pdf, ps]

doi 10.1613/jair.2490

SATzilla: Portfolio-based Algorithm Selection for SAT

Authors: Lin Xu, Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown

Abstract: It has been widely observed that there is no single "dominant" SAT solver; instead, different solvers perform best on different instances. Rather than following the traditional approach of choosing the best solver for a given class of instances, we advocate making this decision online on a per-instance basis. Building on previous work, we describe SATzilla, an automated approach for constructing p… ▽ More It has been widely observed that there is no single "dominant" SAT solver; instead, different solvers perform best on different instances. Rather than following the traditional approach of choosing the best solver for a given class of instances, we advocate making this decision online on a per-instance basis. Building on previous work, we describe SATzilla, an automated approach for constructing per-instance algorithm portfolios for SAT that use so-called empirical hardness models to choose among their constituent solvers. This approach takes as input a distribution of problem instances and a set of component solvers, and constructs a portfolio optimizing a given objective function (such as mean runtime, percent of instances solved, or score in a competition). The excellent performance of SATzilla was independently verified in the 2007 SAT Competition, where our SATzilla07 solvers won three gold, one silver and one bronze medal. In this article, we go well beyond SATzilla07 by making the portfolio construction scalable and completely automated, and improving it by integrating local search solvers as candidate solvers, by predicting performance score instead of runtime, and by using hierarchical hardness models that take into account different types of SAT instances. We demonstrate the effectiveness of these new techniques in extensive experimental results on data sets including instances from the most recent SAT competition. △ Less

Submitted 31 October, 2011; originally announced November 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 32, pages 565-606, 2008

Showing 1–50 of 53 results for author: Leyton-Brown, K