Skip to main content

Showing 1–18 of 18 results for author: El-Nouby, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11794  [pdf, other

    cs.LG cs.CL

    DataComp-LM: In search of the next generation of training sets for language models

    Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

    Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More

    Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.datacomp.ai/dclm/

  2. arXiv:2401.08541  [pdf, other

    cs.CV

    Scalable Pre-training of Large Autoregressive Image Models

    Authors: Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin

    Abstract: This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value o… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: https://github.com/apple/ml-aim

  3. arXiv:2305.05665  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    ImageBind: One Embedding Space To Bind Them All

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their… ▽ More

    Submitted 31 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: CVPR 2023 (Highlighted Paper). Website: https://imagebind.metademolab.com/ Code/Models: https://github.com/facebookresearch/ImageBind

  4. arXiv:2304.07193  [pdf, other

    cs.CV

    DINOv2: Learning Robust Visual Features without Supervision

    Authors: Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin , et al. (1 additional authors not shown)

    Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pr… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 April, 2023; originally announced April 2023.

  5. arXiv:2304.04518  [pdf, other

    cs.CV cs.AI cs.LG

    Are Visual Recognition Models Robust to Image Compression?

    Authors: João Maria Janeiro, Stanislav Frolov, Alaaeldin El-Nouby, Jakob Verbeek

    Abstract: Reducing the data footprint of visual content via image compression is essential to reduce storage requirements, but also to reduce the bandwidth and latency requirements for transmission. In particular, the use of compressed images allows for faster transfer of data, and faster response times for visual recognition in edge devices that rely on cloud-based services. In this paper, we first analyze… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  6. arXiv:2301.11189  [pdf, other

    eess.IV cs.AI cs.CV cs.IT

    Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models

    Authors: Matthew J. Muckley, Alaaeldin El-Nouby, Karen Ullrich, Hervé Jégou, Jakob Verbeek

    Abstract: Lossy image compression aims to represent images in as few bits as possible while maintaining fidelity to the original. Theoretical results indicate that optimizing distortion metrics such as PSNR or MS-SSIM necessarily leads to a discrepancy in the statistics of original images from those of reconstructions, in particular at low bitrates, often manifested by the blurring of the compressed images.… ▽ More

    Submitted 10 August, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

    Comments: Upload camera-ready to arXiv. Official version available at https://proceedings.mlr.press/v202/muckley23a.html

    Journal ref: Proceedings of the 40th International Conference on Machine Learning (2023) 25426-25443

  7. arXiv:2212.07372  [pdf, other

    cs.CV eess.IV

    Image Compression with Product Quantized Masked Image Modeling

    Authors: Alaaeldin El-Nouby, Matthew J. Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, Hervé Jégou

    Abstract: Recent neural compression methods have been based on the popular hyperprior framework. It relies on Scalar Quantization and offers a very strong compression performance. This contrasts from recent advances in image generation and representation learning, where Vector Quantization is more commonly employed. In this work, we attempt to bring these lines of research closer by revisiting vector quanti… ▽ More

    Submitted 6 November, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

  8. arXiv:2206.08356  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    OmniMAE: Single Model Masked Pretraining on Images and Videos

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse pe… ▽ More

    Submitted 31 May, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: CVPR 2023. Code/models: https://github.com/facebookresearch/omnivore

  9. arXiv:2203.09795  [pdf, other

    cs.CV

    Three things everyone should know about Vision Transformers

    Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, Hervé Jégou

    Abstract: After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis. We offer three insights based on simple and easy to implement variants of vision transformers. (1) The residual layers of vision transformers, wh… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

  10. arXiv:2112.13692  [pdf, other

    cs.CV

    Augmenting Convolutional networks with attention-based aggregation

    Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou

    Abstract: We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parame… ▽ More

    Submitted 27 December, 2021; originally announced December 2021.

  11. arXiv:2112.10740  [pdf, other

    cs.CV

    Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

    Authors: Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

    Abstract: Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

  12. arXiv:2106.09681  [pdf, other

    cs.CV cs.LG

    XCiT: Cross-Covariance Image Transformers

    Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

    Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic comple… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

  13. arXiv:2105.03404  [pdf, other

    cs.CV

    ResMLP: Feedforward networks for image classification with data-efficient training

    Authors: Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou

    Abstract: We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using hea… ▽ More

    Submitted 10 June, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

  14. arXiv:2104.01136  [pdf, other

    cs.CV

    LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

    Authors: Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze

    Abstract: We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular… ▽ More

    Submitted 6 May, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  15. arXiv:2102.05644  [pdf, other

    cs.CV

    Training Vision Transformers for Image Retrieval

    Authors: Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou

    Abstract: Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy… ▽ More

    Submitted 10 February, 2021; originally announced February 2021.

  16. arXiv:1910.12770  [pdf, other

    cs.CV

    Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

    Authors: Alaaeldin El-Nouby, Shuangfei Zhai, Graham W. Taylor, Joshua M. Susskind

    Abstract: Deep neural networks require collecting and annotating large amounts of data to train successfully. In order to alleviate the annotation bottleneck, we propose a novel self-supervised representation learning approach for spatiotemporal features extracted from videos. We introduce Skip-Clip, a method that utilizes temporal coherence in videos, by training a deep model for future clip order ranking… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    Comments: Holistic Video Understanding Workshop ICCV2019

  17. arXiv:1811.09845  [pdf, other

    cs.CV

    Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction

    Authors: Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, Graham W. Taylor

    Abstract: Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly mo… ▽ More

    Submitted 23 September, 2019; v1 submitted 24 November, 2018; originally announced November 2018.

    Comments: Accepted at ICCV 2019

    Journal ref: Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV)

  18. arXiv:1802.08362  [pdf, other

    cs.CV

    Real-Time End-to-End Action Detection with Two-Stream Networks

    Authors: Alaaeldin El-Nouby, Graham W. Taylor

    Abstract: Two-stream networks have been very successful for solving the problem of action detection. However, prior work using two-stream networks train both streams separately, which prevents the network from exploiting regularities between the two streams. Moreover, unlike the visual stream, the dominant forms of optical flow computation typically do not maximally exploit GPU parallelism. We present a rea… ▽ More

    Submitted 22 February, 2018; originally announced February 2018.