Dishonesty in Helpful and Harmless Alignment

Huang, Youcheng; Tang, **gkun; Feng, Duanyu; Zhang, Zheng; Lei, Wenqiang; Lv, Jiancheng; Cohn, Anthony G.

Computer Science > Computation and Language

arXiv:2406.01931 (cs)

[Submitted on 4 Jun 2024 (v1), last revised 5 Jun 2024 (this version, v2)]

Title:Dishonesty in Helpful and Harmless Alignment

Authors:Youcheng Huang, **gkun Tang, Duanyu Feng, Zheng Zhang, Wenqiang Lei, Jiancheng Lv, Anthony G. Cohn

View PDF HTML (experimental)

Abstract:People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.01931 [cs.CL]
	(or arXiv:2406.01931v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.01931

Submission history

From: Youcheng Huang [view email]
[v1] Tue, 4 Jun 2024 03:31:09 UTC (1,409 KB)
[v2] Wed, 5 Jun 2024 07:21:19 UTC (1,405 KB)

Computer Science > Computation and Language

Title:Dishonesty in Helpful and Harmless Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Dishonesty in Helpful and Harmless Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators