Skip to main content

Showing 1–3 of 3 results for author: Balwit, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.13798  [pdf, other

    cs.CL cs.AI

    Specific versus General Principles for Constitutional AI

    Authors: Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson , et al. (11 additional authors not shown)

    Abstract: Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expressi… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  2. arXiv:2205.04279  [pdf

    cs.CY cs.AI

    Aligned with Whom? Direct and social goals for AI systems

    Authors: Anton Korinek, Avital Balwit

    Abstract: As artificial intelligence (AI) becomes more powerful and widespread, the AI alignment problem - how to ensure that AI systems pursue the goals that we want them to pursue - has garnered growing attention. This article distinguishes two types of alignment problems depending on whose goals we consider, and analyzes the different solutions necessitated by each. The direct alignment problem considers… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: Prepared for the Oxford Handbook of AI Governance (23 pages, 2 figures)

  3. arXiv:2110.06674  [pdf, other

    cs.CY cs.AI cs.CL

    Truthful AI: Develo** and governing AI that does not lie

    Authors: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders

    Abstract: In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI "lies" (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    ACM Class: I.2.0