-
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Authors:
Huy V. Vo,
Vasil Khalidov,
Timothée Darcet,
Théo Moutakanni,
Nikita Smetanin,
Marc Szafraniec,
Hugo Touvron,
Camille Couprie,
Maxime Oquab,
Armand Joulin,
Hervé Jégou,
Patrick Labatut,
Piotr Bojanowski
Abstract:
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the datas…
▽ More
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
△ Less
Submitted 28 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
DINOv2: Learning Robust Visual Features without Supervision
Authors:
Maxime Oquab,
Timothée Darcet,
Théo Moutakanni,
Huy Vo,
Marc Szafraniec,
Vasil Khalidov,
Pierre Fernandez,
Daniel Haziza,
Francisco Massa,
Alaaeldin El-Nouby,
Mahmoud Assran,
Nicolas Ballas,
Wojciech Galuba,
Russell Howes,
Po-Yao Huang,
Shang-Wen Li,
Ishan Misra,
Michael Rabbat,
Vasu Sharma,
Gabriel Synnaeve,
Hu Xu,
Hervé Jegou,
Julien Mairal,
Patrick Labatut,
Armand Joulin
, et al. (1 additional authors not shown)
Abstract:
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pr…
▽ More
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
△ Less
Submitted 2 February, 2024; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Conjugate Mixture Models for Clustering Multimodal Data
Authors:
Vasil Khalidov,
Florence Forbes,
Radu Horaud
Abstract:
The problem of multimodal clustering arises whenever the data are gathered with several physically different sensors. Observations from different modalities are not necessarily aligned in the sense there there is no obvious way to associate or to compare them in some common space. A solution may consist in considering multiple clustering tasks independently for each modality. The main difficulty w…
▽ More
The problem of multimodal clustering arises whenever the data are gathered with several physically different sensors. Observations from different modalities are not necessarily aligned in the sense there there is no obvious way to associate or to compare them in some common space. A solution may consist in considering multiple clustering tasks independently for each modality. The main difficulty with such an approach is to guarantee that the unimodal clusterings are mutually consistent. In this paper we show that multimodal clustering can be addressed within a novel framework, namely conjugate mixture models. These models exploit the explicit transformations that are often available between an unobserved parameter space (objects) and each one of the observation spaces (sensors). We formulate the problem as a likelihood maximization task and we derive the associated conjugate expectation-maximization algorithm. The convergence properties of the proposed algorithm are thoroughly investigated. Several local/global optimization techniques are proposed in order to increase its convergence speed. Two initialization strategies are proposed and compared. A consistent model-selection criterion is proposed. The algorithm and its variants are tested and evaluated within the task of 3D localization of several speakers using both auditory and visual data.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Continuous Surface Embeddings
Authors:
Natalia Neverova,
David Novotny,
Vasil Khalidov,
Marc Szafraniec,
Patrick Labatut,
Andrea Vedaldi
Abstract:
In this work, we focus on the task of learning and representing dense correspondences in deformable object categories. While this problem has been considered before, solutions so far have been rather ad-hoc for specific object types (i.e., humans), often with significant manual work involved. However, scaling the geometry understanding to all objects in nature requires more automated approaches th…
▽ More
In this work, we focus on the task of learning and representing dense correspondences in deformable object categories. While this problem has been considered before, solutions so far have been rather ad-hoc for specific object types (i.e., humans), often with significant manual work involved. However, scaling the geometry understanding to all objects in nature requires more automated approaches that can also express correspondences between related, but geometrically different objects. To this end, we propose a new, learnable image-based representation of dense correspondences. Our model predicts, for each pixel in a 2D image, an embedding vector of the corresponding vertex in the object mesh, therefore establishing dense correspondences between image pixels and 3D object geometry. We demonstrate that the proposed approach performs on par or better than the state-of-the-art methods for dense pose estimation for humans, while being conceptually simpler. We also collect a new in-the-wild dataset of dense correspondences for animal classes and demonstrate that our framework scales naturally to the new deformable object categories.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Transferring Dense Pose to Proximal Animal Classes
Authors:
Artsiom Sanakoyeu,
Vasil Khalidov,
Maureen S. McCarthy,
Andrea Vedaldi,
Natalia Neverova
Abstract:
Recent contributions have demonstrated that it is possible to recognize the pose of humans densely and accurately given a large dataset of poses annotated in detail. In principle, the same approach could be extended to any animal class, but the effort required for collecting new annotations for each case makes this strategy impractical, despite important applications in natural conservation, scien…
▽ More
Recent contributions have demonstrated that it is possible to recognize the pose of humans densely and accurately given a large dataset of poses annotated in detail. In principle, the same approach could be extended to any animal class, but the effort required for collecting new annotations for each case makes this strategy impractical, despite important applications in natural conservation, science and business. We show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in dense pose recognition for humans, as well as in more general object detectors and segmenters, to the problem of dense pose recognition in other classes. We do this by (1) establishing a DensePose model for the new animal which is also geometrically aligned to humans (2) introducing a multi-head R-CNN architecture that facilitates transfer of multiple recognition tasks between classes, (3) finding which combination of known classes can be transferred most effectively to the new animal and (4) using self-calibrated uncertainty heads to generate pseudo-labels graded by quality for training a model for this class. We also introduce two benchmark datasets labelled in the manner of DensePose for the class chimpanzee and use them to evaluate our approach, showing excellent transfer learning performance.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Polygames: Improved Zero Learning
Authors:
Tristan Cazenave,
Yen-Chi Chen,
Guan-Wei Chen,
Shi-Yu Chen,
Xian-Dong Chiu,
Julien Dehos,
Maria Elsa,
Qucheng Gong,
Hengyuan Hu,
Vasil Khalidov,
Cheng-Ling Li,
Hsin-I Lin,
Yu-** Lin,
Xavier Martinet,
Vegard Mella,
Jeremy Rapin,
Baptiste Roziere,
Gabriel Synnaeve,
Fabien Teytaud,
Olivier Teytaud,
Shi-Cheng Ye,
Yi-Jun Ye,
Shi-Jim Yen,
Sergey Zagoruyko
Abstract:
Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by kee** track of the best checkpoints during the training and by train…
▽ More
Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by kee** track of the best checkpoints during the training and by training against them. Using these features, we release Polygames, our framework for Zero learning, with its library of games and its checkpoints. We won against strong humans at the game of Hex in 19x19, which was often said to be untractable for zero learning; and in Havannah. We also won several first places at the TAAI competitions.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger
Authors:
Gabriel Synnaeve,
Zeming Lin,
Jonas Gehring,
Dan Gant,
Vegard Mella,
Vasil Khalidov,
Nicolas Carion,
Nicolas Usunier
Abstract:
We formulate the problem of defogging as state estimation and future state prediction from previous, partial observations in the context of real-time strategy games. We propose to employ encoder-decoder neural networks for this task, and introduce proxy tasks and baselines for evaluation to assess their ability of capturing basic game rules and high-level dynamics. By combining convolutional neura…
▽ More
We formulate the problem of defogging as state estimation and future state prediction from previous, partial observations in the context of real-time strategy games. We propose to employ encoder-decoder neural networks for this task, and introduce proxy tasks and baselines for evaluation to assess their ability of capturing basic game rules and high-level dynamics. By combining convolutional neural networks and recurrent networks, we exploit spatial and sequential correlations and train well-performing models on a large dataset of human games of StarCraft: Brood War. Finally, we demonstrate the relevance of our models to downstream tasks by applying them for enemy unit prediction in a state-of-the-art, rule-based StarCraft bot. We observe improvements in win rates against several strong community bots.
△ Less
Submitted 30 November, 2018;
originally announced December 2018.
-
STARDATA: A StarCraft AI Research Dataset
Authors:
Zeming Lin,
Jonas Gehring,
Vasil Khalidov,
Gabriel Synnaeve
Abstract:
We release a dataset of 65646 StarCraft replays that contains 1535 million frames and 496 million player actions. We provide full game state data along with the original replays that can be viewed in StarCraft. The game state data was recorded every 3 frames which ensures suitability for a wide variety of machine learning tasks such as strategy classification, inverse reinforcement learning, imita…
▽ More
We release a dataset of 65646 StarCraft replays that contains 1535 million frames and 496 million player actions. We provide full game state data along with the original replays that can be viewed in StarCraft. The game state data was recorded every 3 frames which ensures suitability for a wide variety of machine learning tasks such as strategy classification, inverse reinforcement learning, imitation learning, forward modeling, partial information extraction, and others. We use TorchCraft to extract and store the data, which standardizes the data format for both reading from replays and reading directly from the game. Furthermore, the data can be used on different operating systems and platforms. The dataset contains valid, non-corrupted replays only and its quality and diversity was ensured by a number of heuristics. We illustrate the diversity of the data with various statistics and provide examples of tasks that benefit from the dataset. We make the dataset available at https://github.com/TorchCraft/StarData . En Taro Adun!
△ Less
Submitted 7 August, 2017;
originally announced August 2017.