Mechanistic Interpretability for AI Safety -- A Review

Bereska, Leonard; Gavves, Efstratios

Computer Science > Artificial Intelligence

arXiv:2404.14082 (cs)

[Submitted on 22 Apr 2024]

Title:Mechanistic Interpretability for AI Safety -- A Review

Authors:Leonard Bereska, Efstratios Gavves

View PDF HTML (experimental)

Abstract:Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.14082 [cs.AI]
	(or arXiv:2404.14082v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2404.14082

Submission history

From: Leonard Bereska [view email]
[v1] Mon, 22 Apr 2024 11:01:51 UTC (949 KB)

Computer Science > Artificial Intelligence

Title:Mechanistic Interpretability for AI Safety -- A Review

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Mechanistic Interpretability for AI Safety -- A Review

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators