Skip to main content

Showing 1–30 of 30 results for author: Kalantidis, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.04072  [pdf, other

    cs.CV cs.LG

    Label Propagation for Zero-shot Classification with Vision-Language Models

    Authors: Vladan Stojnić, Yannis Kalantidis, Giorgos Tolias

    Abstract: Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification, i.e. classification when provided merely with a list of class names. In this paper, we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP, a method based on label propagation (LP) that utilizes geode… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  2. arXiv:2402.09237  [pdf, other

    cs.CV

    Weatherproofing Retrieval for Localization with Generative AI and Geometric Consistency

    Authors: Yannis Kalantidis, Mert Bülent Sarıyıldız, Rafael S. Rezende, Philippe Weinzaepfel, Diane Larlus, Gabriela Csurka

    Abstract: State-of-the-art visual localization approaches generally rely on a first image retrieval step whose role is crucial. Yet, retrieval often struggles when facing varying conditions, due to e.g. weather or time of day, with dramatic consequences on the visual localization accuracy. In this paper, we improve this retrieval step and tailor it to the final localization task. Among the several changes w… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024. Project Page: https://europe.naverlabs.com/ret4loc

  3. arXiv:2303.16084  [pdf, other

    cs.CV

    Rethinking matching-based few-shot action recognition

    Authors: Juliette Bertrand, Yannis Kalantidis, Giorgos Tolias

    Abstract: Few-shot action recognition, i.e. recognizing new action classes given only a few examples, benefits from incorporating temporal information. Prior work either encodes such information in the representation itself and learns classifiers at test time, or obtains frame-level features and performs pairwise temporal matching. We first evaluate a number of matching-based approaches using features from… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at SCIA 2023

  4. arXiv:2212.08420  [pdf, other

    cs.CV cs.LG

    Fake it till you make it: Learning transferable representations from synthetic ImageNet clones

    Authors: Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, Yannis Kalantidis

    Abstract: Recent image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by investigating the need for real images when training models for ImageNet classification.… ▽ More

    Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

    Comments: Accepted to CVPR 2023

  5. arXiv:2210.02254  [pdf, other

    cs.CV

    Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

    Authors: Jon Almazán, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis

    Abstract: Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from th… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: ECCV 2022

  6. arXiv:2208.10211  [pdf, other

    cs.CV

    PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

    Authors: Fabien Baradel, Romain Brégier, Thibault Groueix, Philippe Weinzaepfel, Yannis Kalantidis, Grégory Rogez

    Abstract: Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a… ▽ More

    Submitted 19 October, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

    Comments: Accepted to TPAMI 2022

  7. arXiv:2206.15369  [pdf, other

    cs.CV cs.LG

    No Reason for No Supervision: Improved Generalization in Supervised Models

    Authors: Mert Bulent Sariyildiz, Yannis Kalantidis, Karteek Alahari, Diane Larlus

    Abstract: We consider the problem of training a deep neural network on a given classification task, e.g., ImageNet-1K (IN1K), so that it excels at both the training task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model's generalization and maintaining its performance on the original task. Models trained with self-supervi… ▽ More

    Submitted 10 March, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted to ICLR 2023 (spotlight)

  8. arXiv:2201.13182  [pdf, other

    cs.CV

    Learning Super-Features for Image Retrieval

    Authors: Philippe Weinzaepfel, Thomas Lucas, Diane Larlus, Yannis Kalantidis

    Abstract: Methods that combine local and global features have recently shown excellent performance on multiple challenging deep image retrieval benchmarks, but their use of local features raises at least two issues. First, these local features simply boil down to the localized map activations of a neural network, and hence can be extremely redundant. Second, they are typically trained with a global loss tha… ▽ More

    Submitted 31 January, 2022; originally announced January 2022.

    Comments: ICLR 2022

  9. arXiv:2110.09455  [pdf, other

    cs.CV cs.AI cs.LG

    TLDR: Twin Learning for Dimensionality Reduction

    Authors: Yannis Kalantidis, Carlos Lassance, Jon Almazan, Diane Larlus

    Abstract: Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of "neighborhood", are preserved. Such methods usually require propagation on large k-NN graphs or complicated optimization solvers. On the other hand, self-supervised learning approaches, typically used to learn representations from scrat… ▽ More

    Submitted 15 June, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

    Comments: Accepted at Transactions on Machine Learning Research (TMLR). Code available at: https://github.com/naver/tldr

  10. arXiv:2110.09243  [pdf, other

    cs.CV

    Leveraging MoCap Data for Human Mesh Recovery

    Authors: Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, Grégory Rogez

    Abstract: Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: 3DV 2021

  11. arXiv:2101.05068  [pdf, other

    cs.CV

    Probabilistic Embeddings for Cross-Modal Retrieval

    Authors: Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, Diane Larlus

    Abstract: Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this paper, w… ▽ More

    Submitted 14 June, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

    Comments: Accepted to CVPR 2021; Code is available at https://github.com/naver-ai/pcme

  12. arXiv:2012.05649  [pdf, other

    cs.CV cs.LG

    Concept Generalization in Visual Representation Learning

    Authors: Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, Karteek Alahari

    Abstract: Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the… ▽ More

    Submitted 10 September, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

    Comments: Accepted to ICCV 2021. See our project website: https://europe.naverlabs.com/cog-benchmark for code and ImageNet-CoG level files

  13. arXiv:2010.01028  [pdf, other

    cs.CV cs.LG

    Hard Negative Mixing for Contrastive Learning

    Authors: Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, Diane Larlus

    Abstract: Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucia… ▽ More

    Submitted 4 December, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Accepted at NeurIPS 2020. Project page with pretrained models: https://europe.naverlabs.com/mochi

  14. arXiv:2004.11051   

    cs.CV cs.AI

    Proceedings of the ICLR Workshop on Computer Vision for Agriculture (CV4A) 2020

    Authors: Yannis Kalantidis, Laura Sevilla-Lara, Ernest Mwebaze, Dina Machuve, Hamed Alemohammad, David Guerena

    Abstract: This is the proceedings of the Computer Vision for Agriculture (CV4A) Workshop that was held in conjunction with the International Conference on Learning Representations (ICLR) 2020. The Computer Vision for Agriculture (CV4A) 2020 workshop was scheduled to be held in Addis Ababa, Ethiopia, on April 26th, 2020. It was held virtually that same day due to the COVID-19 pandemic. The workshop was hel… ▽ More

    Submitted 17 May, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

    Comments: 14 papers accepted, 4 as oral, 10 as spotlights

  15. arXiv:1910.09217  [pdf, other

    cs.CV

    Decoupling Representation and Classifier for Long-Tailed Recognition

    Authors: Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, Yannis Kalantidis

    Abstract: The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but most of them adhere to the scheme of jointly learning representations and cl… ▽ More

    Submitted 19 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Journal ref: Published as a conference paper at ICLR 2020

  16. arXiv:1906.00283  [pdf, other

    cs.CV cs.CL cs.LG

    Learning to Generate Grounded Visual Captions without Localization Supervision

    Authors: Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira

    Abstract: When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is t… ▽ More

    Submitted 17 July, 2020; v1 submitted 1 June, 2019; originally announced June 2019.

    Comments: ECCV 2020. Code is available at https://github.com/chihyaoma/cyclical-visual-captioning

  17. arXiv:1904.05049  [pdf, ps, other

    cs.CV

    Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

    Authors: Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

    Abstract: In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their freq… ▽ More

    Submitted 18 August, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

    Comments: Accepted to ICCV 2019

  18. arXiv:1903.00859  [pdf, other

    cs.CV

    Less is More: Learning Highlight Detection from Video Duration

    Authors: Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, Kristen Grauman

    Abstract: Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated… ▽ More

    Submitted 3 March, 2019; originally announced March 2019.

    Comments: To appear in CVPR 2019

  19. arXiv:1901.03460  [pdf, other

    cs.CV

    DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

    Authors: Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan

    Abstract: Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is… ▽ More

    Submitted 7 May, 2019; v1 submitted 10 January, 2019; originally announced January 2019.

    Comments: Accepted by CVPR'19

  20. arXiv:1812.06587  [pdf, other

    cs.CV

    Grounded Video Description

    Authors: Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach

    Abstract: Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the v… ▽ More

    Submitted 5 May, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

    Comments: CVPR 2019 oral, camera-ready version including appendix

  21. arXiv:1811.12814  [pdf, other

    cs.CV

    Graph-Based Global Reasoning Networks

    Authors: Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, Yannis Kalantidis

    Abstract: Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional Neural Networks (CNNs) excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose… ▽ More

    Submitted 30 November, 2018; originally announced November 2018.

  22. arXiv:1810.11579  [pdf, other

    cs.CV

    $A^2$-Nets: Double Attention Networks

    Authors: Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng

    Abstract: Learning to capture long-range relations is fundamental to image/video recognition. Existing CNN models generally rely on increasing depth to model such relations which is highly inefficient. In this work, we propose the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subse… ▽ More

    Submitted 26 October, 2018; originally announced October 2018.

    Comments: Accepted at NIPS 2018

  23. arXiv:1807.11195  [pdf, other

    cs.CV

    Multi-Fiber Networks for Video Recognition

    Authors: Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng

    Abstract: In this paper, we aim to reduce the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while preserving state-of-the-art accuracy on video recognition benchmarks. To this end, we present the novel Multi-Fiber architecture that slices a complex neural network into an ensemble of lightweight networks or fibers that run through the network. To… ▽ More

    Submitted 18 September, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

    Comments: ECCV 2018, Code is on GitHub

  24. arXiv:1804.10660  [pdf, other

    cs.CV

    Large-Scale Visual Relationship Understanding

    Authors: Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny

    Abstract: Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces wh… ▽ More

    Submitted 16 August, 2019; v1 submitted 27 April, 2018; originally announced April 2018.

  25. arXiv:1708.01336  [pdf, other

    cs.CV cs.CL

    MemexQA: Visual Memex Question Answering

    Authors: Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalantidis, Sachin Farfade, Alexander Hauptmann

    Abstract: This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection. Towards solving the task, we 1) present the MemexQA dataset, a large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers, 2) propose M… ▽ More

    Submitted 3 August, 2017; originally announced August 2017.

    Comments: https://memexqa.cs.cmu.edu/

  26. arXiv:1612.01922  [pdf, other

    cs.CV

    Tag Prediction at Flickr: a View from the Darkroom

    Authors: Kofi Boakye, Sachin Farfade, Hamid Izadinia, Yannis Kalantidis, Pierre Garrigues

    Abstract: Automated photo tagging has established itself as one of the most compelling applications of deep learning. While deep convolutional neural networks have repeatedly demonstrated top performance on standard datasets for classification, there are a number of often overlooked but important considerations when deploying this technology in a real-world scenario. In this paper, we present our efforts in… ▽ More

    Submitted 19 December, 2017; v1 submitted 6 December, 2016; originally announced December 2016.

    Comments: Presented at the ACM Multimedia Thematic Workshops, 2017

  27. arXiv:1604.06481  [pdf, other

    cs.CV cs.HC

    Visual Congruent Ads for Image Search

    Authors: Yannis Kalantidis, Ayman Farahat, Lyndon Kennedy, Ricardo Baeza-Yates, David A. Shamma

    Abstract: The quality of user experience online is affected by the relevance and placement of advertisements. We propose a new system for selecting and displaying visual advertisements in image search result sets. Our method compares the visual similarity of candidate ads to the image search results and selects the most visually similar ad to be displayed. The method further selects an appropriate location… ▽ More

    Submitted 21 April, 2016; originally announced April 2016.

  28. arXiv:1604.06480  [pdf, other

    cs.CV cs.IR cs.MM

    LOH and behold: Web-scale visual search, recommendation and clustering using Locally Optimized Hashing

    Authors: Yannis Kalantidis, Lyndon Kennedy, Huy Nguyen, Clayton Mellina, David A. Shamma

    Abstract: We propose a novel hashing-based matching scheme, called Locally Optimized Hashing (LOH), based on a state-of-the-art quantization algorithm that can be used for efficient, large-scale search, recommendation, clustering, and deduplication. We show that matching with LOH only requires set intersections and summations to compute and so is easily implemented in generic distributed computing systems.… ▽ More

    Submitted 29 July, 2016; v1 submitted 21 April, 2016; originally announced April 2016.

    Comments: Accepted for publication at the 4th Workshop on Web-scale Vision and Social Media (VSM), ECCV 2016

  29. arXiv:1602.07332  [pdf, other

    cs.CV cs.AI

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Authors: Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li

    Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designe… ▽ More

    Submitted 23 February, 2016; originally announced February 2016.

    Comments: 44 pages, 37 figures

  30. arXiv:1512.04065  [pdf, other

    cs.CV

    Cross-dimensional Weighting for Aggregated Deep Convolutional Features

    Authors: Yannis Kalantidis, Clayton Mellina, Simon Osindero

    Abstract: We propose a simple and straightforward way of creating powerful image representations via cross-dimensional weighting and aggregation of deep convolutional neural network layer outputs. We first present a generalized framework that encompasses a broad family of approaches and includes cross-dimensional pooling and weighting steps. We then propose specific non-parametric schemes for both spatial-… ▽ More

    Submitted 29 July, 2016; v1 submitted 13 December, 2015; originally announced December 2015.

    Comments: Accepted for publications at the 4th Workshop on Web-scale Vision and Social Media (VSM), ECCV 2016