Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Cheng, Xiang; Chen, Yuxin; Sra, Suvrit

Computer Science > Machine Learning

arXiv:2312.06528v3 (cs)

[Submitted on 11 Dec 2023 (v1), revised 26 Dec 2023 (this version, v3), latest version 4 Jun 2024 (v6)]

Title:Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Authors:Xiang Cheng, Yuxin Chen, Suvrit Sra

View PDF

Abstract:Many neural network architectures have been shown to be Turing Complete, and can thus implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms \emph{under simple parameter configurations}. A line of recent work shows that linear Transformers naturally learn to implement gradient descent (GD) when trained on a linear regression in-context learning task. But the linearity assumption (either in the Transformer architecture or in the learning task) is far from realistic settings where non-linear activations crucially enable Transformers to learn complicated non-linear functions. In this paper, we provide theoretical and empirical evidence that non-linear Transformers can, and \emph{in fact do}, learn to implement learning algorithms to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures, and non-linear in-context learning tasks. Interestingly, we show that the optimal choice of non-linear activation depends in a natural way on the non-linearity of the learning task.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2312.06528 [cs.LG]
	(or arXiv:2312.06528v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.06528

Submission history

From: Xiang Cheng [view email]
[v1] Mon, 11 Dec 2023 17:05:25 UTC (155 KB)
[v2] Thu, 14 Dec 2023 17:19:55 UTC (156 KB)
[v3] Tue, 26 Dec 2023 21:20:12 UTC (158 KB)
[v4] Thu, 15 Feb 2024 22:45:28 UTC (209 KB)
[v5] Fri, 19 Apr 2024 21:05:34 UTC (257 KB)
[v6] Tue, 4 Jun 2024 00:20:05 UTC (244 KB)

Computer Science > Machine Learning

Title:Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators