-
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
Authors:
Emily Cheng,
Diego Doimo,
Corentin Kervadec,
Iuri Macocco,
Jade Yu,
Alessandro Laio,
Marco Baroni
Abstract:
A language model (LM) is a map** from a linguistic context to an output token. However, much remains to be known about this map**, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionalit…
▽ More
A language model (LM) is a map** from a linguistic context to an output token. However, much remains to be known about this map**, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
Authors:
Antonio Di Noia,
Iuri Macocco,
Aldo Glielmo,
Alessandro Laio,
Antonietta Mira
Abstract:
The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large…
▽ More
The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Intrinsic dimension estimation for discrete metrics
Authors:
Iuri Macocco,
Aldo Glielmo,
Jacopo Grilli,
Alessandro Laio
Abstract:
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intri…
▽ More
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
△ Less
Submitted 12 March, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
DADApy: Distance-based Analysis of DAta-manifolds in Python
Authors:
Aldo Glielmo,
Iuri Macocco,
Diego Doimo,
Matteo Carli,
Claudio Zeni,
Romina Wild,
Maria d'Errico,
Alex Rodriguez,
Alessandro Laio
Abstract:
DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADA…
▽ More
DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.
△ Less
Submitted 19 September, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.