Skip to main content

Showing 1–24 of 24 results for author: Christiano, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  2. arXiv:2312.11671  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating Language-Model Agents on Realistic Autonomous Tasks

    Authors: Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano

    Abstract: In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting… ▽ More

    Submitted 4 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 14 pages

  3. arXiv:2305.15324  [pdf, other

    cs.AI

    Model evaluation for extreme risks

    Authors: Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe

    Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify danger… ▽ More

    Submitted 22 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Fixed typos; added citation

    ACM Class: K.4.1

  4. arXiv:2211.06738  [pdf, ps, other

    cs.AI cs.DM cs.LO

    Formalizing the presumption of independence

    Authors: Paul Christiano, Eric Neyman, Mark Xu

    Abstract: Mathematical proof aims to deliver confident conclusions, but a very similar process of deduction can be used to make uncertain estimates that are open to revision. A key ingredient in such reasoning is the use of a "default" estimate of $\mathbb{E}[XY] = \mathbb{E}[X] \mathbb{E}[Y]$ in the absence of any specific information about the correlation between $X$ and $Y$, which we call *the presumptio… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: 61 pages, 8 figures

  5. arXiv:2203.02155  [pdf, other

    cs.CL cs.AI cs.LG

    Training language models to follow instructions with human feedback

    Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

    Abstract: Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning wi… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

  6. arXiv:2109.10862  [pdf, other

    cs.CL cs.AI cs.LG

    Recursively Summarizing Books with Human Feedback

    Authors: Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano

    Abstract: A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist hum… ▽ More

    Submitted 27 September, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

  7. arXiv:2009.01325  [pdf, other

    cs.CL cs.AI cs.LG

    Learning to summarize from human feedback

    Authors: Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

    Abstract: As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible t… ▽ More

    Submitted 15 February, 2022; v1 submitted 2 September, 2020; originally announced September 2020.

    Comments: NeurIPS 2020

  8. arXiv:1909.08593  [pdf, other

    cs.CL cs.LG stat.ML

    Fine-Tuning Language Models from Human Preferences

    Authors: Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving

    Abstract: Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and saf… ▽ More

    Submitted 8 January, 2020; v1 submitted 18 September, 2019; originally announced September 2019.

  9. arXiv:1810.08575  [pdf, other

    cs.LG cs.AI stat.ML

    Supervising strong learners by amplifying weak experts

    Authors: Paul Christiano, Buck Shlegeris, Dario Amodei

    Abstract: Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alterna… ▽ More

    Submitted 19 October, 2018; originally announced October 2018.

  10. arXiv:1809.08352  [pdf, other

    stat.ML cs.CV cs.LG

    Unrestricted Adversarial Examples

    Authors: Tom B. Brown, Nicholas Carlini, Chiyuan Zhang, Catherine Olsson, Paul Christiano, Ian Goodfellow

    Abstract: We introduce a two-player contest for evaluating the safety and robustness of machine learning systems, with a large prize pool. Unlike most prior work in ML robustness, which studies norm-constrained adversaries, we shift our focus to unconstrained adversaries. Defenders submit machine learning models, and try to achieve high accuracy and coverage on non-adversarial data while making no confident… ▽ More

    Submitted 21 September, 2018; originally announced September 2018.

  11. arXiv:1805.00899  [pdf, other

    stat.ML cs.LG

    AI safety via debate

    Authors: Geoffrey Irving, Paul Christiano, Dario Amodei

    Abstract: To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via se… ▽ More

    Submitted 22 October, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

    Comments: 24 pages, 6 figures

  12. arXiv:1804.00640  [pdf, ps, other

    quant-ph cs.CC

    A Cryptographic Test of Quantumness and Certifiable Randomness from a Single Quantum Device

    Authors: Zvika Brakerski, Paul Christiano, Urmila Mahadev, Umesh Vazirani, Thomas Vidick

    Abstract: We consider a new model for the testing of untrusted quantum devices, consisting of a single polynomial-time bounded quantum device interacting with a classical polynomial-time verifier. In this model we propose solutions to two tasks - a protocol for efficient classical verification that the untrusted device is "truly quantum," and a protocol for producing certifiable randomness from a single unt… ▽ More

    Submitted 4 May, 2021; v1 submitted 2 April, 2018; originally announced April 2018.

    Comments: 45 pages

  13. arXiv:1706.03741  [pdf, other

    stat.ML cs.AI cs.HC cs.LG

    Deep reinforcement learning from human preferences

    Authors: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei

    Abstract: For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari… ▽ More

    Submitted 17 February, 2023; v1 submitted 12 June, 2017; originally announced June 2017.

  14. arXiv:1611.03852  [pdf, ps, other

    cs.LG cs.AI

    A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

    Authors: Chelsea Finn, Paul Christiano, Pieter Abbeel, Sergey Levine

    Abstract: Generative adversarial networks (GANs) are a recently proposed class of generative models in which a generator is trained to optimize a cost function that is being simultaneously learned by a discriminator. While the idea of learning cost functions is relatively new to the field of generative modeling, learning costs has long been studied in control and reinforcement learning (RL) domains, typical… ▽ More

    Submitted 25 November, 2016; v1 submitted 11 November, 2016; originally announced November 2016.

    Comments: NIPS 2016 Workshop on Adversarial Training. First two authors contributed equally

  15. arXiv:1610.03518  [pdf, other

    cs.RO cs.AI cs.LG eess.SY

    Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model

    Authors: Paul Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, Wojciech Zaremba

    Abstract: Develo** control policies in simulation is often more practical and safer than directly running experiments in the real world. This applies to policies obtained from planning and optimization, and even more so to policies obtained from reinforcement learning, which is often very data demanding. However, a policy that succeeds in simulation often doesn't work when deployed on a real robot. Nevert… ▽ More

    Submitted 11 October, 2016; originally announced October 2016.

  16. arXiv:1606.06565  [pdf, other

    cs.AI cs.LG

    Concrete Problems in AI Safety

    Authors: Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané

    Abstract: Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical… ▽ More

    Submitted 25 July, 2016; v1 submitted 21 June, 2016; originally announced June 2016.

    Comments: 29 pages

  17. arXiv:1605.02688  [pdf, other

    cs.SC cs.LG cs.MS

    Theano: A Python framework for fast computation of mathematical expressions

    Authors: The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano , et al. (88 additional authors not shown)

    Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu… ▽ More

    Submitted 9 May, 2016; originally announced May 2016.

    Comments: 19 pages, 5 figures

  18. arXiv:1603.06265  [pdf, ps, other

    cs.LG

    Collaborative prediction with expert advice

    Authors: Paul Christiano

    Abstract: Many practical learning systems aggregate data across many users, while learning theory traditionally considers a single learner who trusts all of their observations. A case in point is the foundational learning problem of prediction with expert advice. To date, there has been no theoretical study of the general collaborative version of prediction with expert advice, in which many users face a sim… ▽ More

    Submitted 7 April, 2016; v1 submitted 20 March, 2016; originally announced March 2016.

  19. arXiv:1508.04145  [pdf, ps, other

    cs.AI cs.GT

    Reflective Oracles: A Foundation for Classical Game Theory

    Authors: Benja Fallenstein, Jessica Taylor, Paul F. Christiano

    Abstract: Classical game theory treats players as special---a description of a game contains a full, explicit enumeration of all players---even though in the real world, "players" are no more fundamentally special than rocks or clouds. It isn't trivial to find a decision-theoretic foundation for game theory in which an agent's coplayers are a non-distinguished part of the agent's environment. Attempts to mo… ▽ More

    Submitted 17 August, 2015; originally announced August 2015.

    Comments: Extended version of "Reflective Oracles: A Foundation for Game Theory in Artificial Intelligence" accepted to LORI-V

    ACM Class: I.2

  20. arXiv:1411.1127  [pdf, ps, other

    cs.GT

    Provably Manipulation-Resistant Reputation Systems

    Authors: Paul Christiano

    Abstract: We consider a community of users who must make periodic decisions about whether to interact with one another. We propose a protocol which allows honest users to reliably interact with each other, while limiting the damage done by each malicious or incompetent user. The worst-case cost per user is sublinear in the average number of interactions per user and is independent of the number of users. Ou… ▽ More

    Submitted 4 November, 2014; originally announced November 2014.

  21. arXiv:1403.5287  [pdf, ps, other

    cs.LG

    Online Local Learning via Semidefinite Programming

    Authors: Paul Christiano

    Abstract: In many online learning problems we are interested in predicting local information about some universe of items. For example, we may want to know whether two items are in the same cluster rather than computing an assignment of items to clusters; we may want to know which of two teams will win a game rather than computing a ranking of teams. Although finding the optimal clustering or ranking is typ… ▽ More

    Submitted 20 March, 2014; originally announced March 2014.

    Comments: 10 pages

  22. arXiv:1401.5577  [pdf, ps, other

    cs.GT cs.LO

    Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic

    Authors: Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire, Eliezer Yudkowsky

    Abstract: We consider the one-shot Prisoner's Dilemma between algorithms with read-access to one anothers' source codes, and we use the modal logic of provability to build agents that can achieve mutual cooperation in a manner that is robust, in that cooperation does not require exact equality of the agents' source code, and unexploitable, meaning that such an agent never cooperates when its opponent defect… ▽ More

    Submitted 4 April, 2021; v1 submitted 22 January, 2014; originally announced January 2014.

    Comments: 18 pages. Updated 2021 to indicate that bounded analogues of these results have yet to be proved

    ACM Class: F.4.1

  23. arXiv:1203.4740  [pdf, ps, other

    quant-ph cs.CC

    Quantum Money from Hidden Subspaces

    Authors: Scott Aaronson, Paul Christiano

    Abstract: Forty years ago, Wiesner pointed out that quantum mechanics raises the striking possibility of money that cannot be counterfeited according to the laws of physics. We propose the first quantum money scheme that is (1) public-key, meaning that anyone can verify a banknote as genuine, not only the bank that printed it, and (2) cryptographically secure, under a "classical" hardness assumption that ha… ▽ More

    Submitted 17 September, 2012; v1 submitted 21 March, 2012; originally announced March 2012.

    Comments: 48 pages, minor corrections and improvements, journal version to appear in Theory of Computing; Proceedings of ACM STOC 2012

  24. arXiv:1010.2921  [pdf, other

    cs.DS cs.CC

    Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs

    Authors: Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, Shang-Hua Teng

    Abstract: We introduce a new approach to computing an approximately maximum s-t flow in a capacitated, undirected graph. This flow is computed by solving a sequence of electrical flow problems. Each electrical flow is given by the solution of a system of linear equations in a Laplacian matrix, and thus may be approximately computed in nearly-linear time. Using this approach, we develop the fastest known a… ▽ More

    Submitted 19 October, 2010; v1 submitted 14 October, 2010; originally announced October 2010.