-
Self-Attention in Colors: Another Take on Encoding Graph Structure in Transformers
Authors:
Romain Menegaux,
Emmanuel Jehanno,
Margot Selosse,
Julien Mairal
Abstract:
We introduce a novel self-attention mechanism, which we call CSA (Chromatic Self-Attention), which extends the notion of attention scores to attention _filters_, independently modulating the feature channels. We showcase CSA in a fully-attentional graph Transformer CGT (Chromatic Graph Transformer) which integrates both graph structural information and edge features, completely bypassing the need…
▽ More
We introduce a novel self-attention mechanism, which we call CSA (Chromatic Self-Attention), which extends the notion of attention scores to attention _filters_, independently modulating the feature channels. We showcase CSA in a fully-attentional graph Transformer CGT (Chromatic Graph Transformer) which integrates both graph structural information and edge features, completely bypassing the need for local message-passing components. Our method flexibly encodes graph structure through node-node interactions, by enriching the original edge features with a relative positional encoding scheme. We propose a new scheme based on random walks that encodes both structural and positional information, and show how to incorporate higher-order topological information, such as rings in molecular graphs. Our approach achieves state-of-the-art results on the ZINC benchmark dataset, while providing a flexible framework for encoding graph structure and incorporating higher-order topology.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
Rethinking Gauss-Newton for learning over-parameterized models
Authors:
Michael Arbel,
Romain Menegaux,
Pierre Wolinski
Abstract:
This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task…
▽ More
This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.
△ Less
Submitted 12 December, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Bounds for VIX Futures given S&P 500 Smiles
Authors:
Julien Guyon,
Romain Menegaux,
Marcel Nutz
Abstract:
We derive sharp bounds for the prices of VIX futures using the full information of S&P 500 smiles. To that end, we formulate the model-free sub/superreplication of the VIX by trading in the S&P 500 and its vanilla options as well as the forward-starting log-contracts. A dual problem of minimizing/maximizing certain risk-neutral expectations is introduced and shown to yield the same value.
The cl…
▽ More
We derive sharp bounds for the prices of VIX futures using the full information of S&P 500 smiles. To that end, we formulate the model-free sub/superreplication of the VIX by trading in the S&P 500 and its vanilla options as well as the forward-starting log-contracts. A dual problem of minimizing/maximizing certain risk-neutral expectations is introduced and shown to yield the same value.
The classical bounds for VIX futures given the smiles only use a calendar spread of log-contracts on the S&P 500. We analyze for which smiles the classical bounds are sharp and how they can be improved when they are not. In particular, we introduce a family of functionally generated portfolios which often improves the classical bounds while still being tractable; more precisely, determined by a single concave/convex function on the line. Numerical experiments on market data and SABR smiles show that the classical lower bound can be improved dramatically, whereas the upper bound is often close to optimal.
△ Less
Submitted 22 June, 2017; v1 submitted 19 September, 2016;
originally announced September 2016.