AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Lindström, Adam Dahlgren; Methnani, Leila; Krause, Lea; Ericson, Petter; de Troya, Íñigo Martínez de Rituerto; Mollo, Dimitri Coelho; Dobbe, Roel

Computer Science > Artificial Intelligence

arXiv:2406.18346 (cs)

[Submitted on 26 Jun 2024]

Title:AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Authors:Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe

View PDF HTML (experimental)

Abstract:This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.

Comments:	12 pages, 1 table, to be submitted
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.18346 [cs.AI]
	(or arXiv:2406.18346v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2406.18346

Submission history

From: Petter Ericson [view email]
[v1] Wed, 26 Jun 2024 13:42:13 UTC (29 KB)

Computer Science > Artificial Intelligence

Title:AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators