Skip to main content

Showing 1–4 of 4 results for author: Macocco, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.15471  [pdf, other

    cs.CL

    Emergence of a High-Dimensional Abstraction Phase in Language Transformers

    Authors: Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, Marco Baroni

    Abstract: A language model (LM) is a map** from a linguistic context to an output token. However, much remains to be known about this map**, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionalit… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  2. arXiv:2405.15132  [pdf, other

    stat.ML cs.LG math.ST stat.CO stat.ME

    Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

    Authors: Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira

    Abstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  3. arXiv:2207.09688  [pdf, other

    stat.ML cs.LG physics.comp-ph

    Intrinsic dimension estimation for discrete metrics

    Authors: Iuri Macocco, Aldo Glielmo, Jacopo Grilli, Alessandro Laio

    Abstract: Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intri… ▽ More

    Submitted 12 March, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: RevTeX4.2, 13 pages, 10 figures

  4. arXiv:2205.03373  [pdf, other

    cs.LG physics.comp-ph stat.ML

    DADApy: Distance-based Analysis of DAta-manifolds in Python

    Authors: Aldo Glielmo, Iuri Macocco, Diego Doimo, Matteo Carli, Claudio Zeni, Romina Wild, Maria d'Errico, Alex Rodriguez, Alessandro Laio

    Abstract: DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADA… ▽ More

    Submitted 19 September, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: 9 pages, 6 figures. Patterns (2022)