-
On the uniqueness of best approximation in Orlicz spaces
Authors:
Ana Benavente,
Juan Costa Ponce,
Sergio Favier
Abstract:
We study uniqueness of best approximation in Orlicz spaces L$Φ$, for different types of convex functions $Φ$ and for some finite dimensional approximation classes of functions, where Tchebycheff spaces, and more general approximation ones, are involved
We study uniqueness of best approximation in Orlicz spaces L$Φ$, for different types of convex functions $Φ$ and for some finite dimensional approximation classes of functions, where Tchebycheff spaces, and more general approximation ones, are involved
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Road to perdition? The effect of illicit drug use on labour market outcomes of prime-age men in Mexico
Authors:
José-Ignacio Antón,
Juan Ponce,
Rafael Muñoz de Bustillo
Abstract:
This study addresses the impact of illicit drug use on the labour market outcomes of men in Mexico. We leverage statistical information from three waves of a comparable national survey and make use of Lewbel's heteroskedasticity-based instrumental variable strategy to deal with the endogeneity of drug consumption. Our results suggests that drug consumption has quite negative effects in the Mexican…
▽ More
This study addresses the impact of illicit drug use on the labour market outcomes of men in Mexico. We leverage statistical information from three waves of a comparable national survey and make use of Lewbel's heteroskedasticity-based instrumental variable strategy to deal with the endogeneity of drug consumption. Our results suggests that drug consumption has quite negative effects in the Mexican context: It reduces employment, occupational attainment and formality and raises unemployment of local men. These effects seem larger than those estimated for high-income economies
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
Authors:
Adrien Bardes,
Quentin Garrido,
Jean Ponce,
Xinlei Chen,
Michael Rabbat,
Yann LeCun,
Mahmoud Assran,
Nicolas Ballas
Abstract:
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datase…
▽ More
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
△ Less
Submitted 15 February, 2024;
originally announced April 2024.
-
PICNIQ: Pairwise Comparisons for Natural Image Quality Assessment
Authors:
Nicolas Chahine,
Sira Ferradans,
Jean Ponce
Abstract:
Blind image quality assessment (BIQA) approaches, while promising for automating image quality evaluation, often fall short in real-world scenarios due to their reliance on a generic quality standard applied uniformly across diverse images. This one-size-fits-all approach overlooks the crucial perceptual relationship between image content and quality, leading to a 'domain shift' challenge where a…
▽ More
Blind image quality assessment (BIQA) approaches, while promising for automating image quality evaluation, often fall short in real-world scenarios due to their reliance on a generic quality standard applied uniformly across diverse images. This one-size-fits-all approach overlooks the crucial perceptual relationship between image content and quality, leading to a 'domain shift' challenge where a single quality metric inadequately represents various content types. Furthermore, BIQA techniques typically overlook the inherent differences in the human visual system among different observers. In response to these challenges, this paper introduces PICNIQ, an innovative pairwise comparison framework designed to bypass the limitations of conventional BIQA by emphasizing relative, rather than absolute, quality assessment. PICNIQ is specifically designed to assess the quality differences between image pairs. The proposed framework implements a carefully crafted deep learning architecture, a specialized loss function, and a training strategy optimized for sparse comparison settings. By employing psychometric scaling algorithms like TrueSkill, PICNIQ transforms pairwise comparisons into just-objectionable-difference (JOD) quality scores, offering a granular and interpretable measure of image quality. We conduct our research using comparison matrices from the PIQ23 dataset, which are published in this paper. Our extensive experimental analysis showcases PICNIQ's broad applicability and superior performance over existing models, highlighting its potential to set new standards in the field of BIQA.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Generalized Portrait Quality Assessment
Authors:
Nicolas Chahine,
Sira Ferradans,
Javier Vazquez-Corral,
Jean Ponce
Abstract:
Automated and robust portrait quality assessment (PQA) is of paramount importance in high-impact applications such as smartphone photography. This paper presents FHIQA, a learning-based approach to PQA that introduces a simple but effective quality score rescaling method based on image semantics, to enhance the precision of fine-grained image quality metrics while ensuring robust generalization to…
▽ More
Automated and robust portrait quality assessment (PQA) is of paramount importance in high-impact applications such as smartphone photography. This paper presents FHIQA, a learning-based approach to PQA that introduces a simple but effective quality score rescaling method based on image semantics, to enhance the precision of fine-grained image quality metrics while ensuring robust generalization to various scene settings beyond the training dataset. The proposed approach is validated by extensive experiments on the PIQ23 benchmark and comparisons with the current state of the art. The source code of FHIQA will be made publicly available on the PIQ23 GitHub repository at https://github.com/DXOMARK-Research/PIQ2023.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Fine Dense Alignment of Image Bursts through Camera Pose and Depth Estimation
Authors:
Bruno Lecouat,
Yann Dubois de Mont-Marin,
Théo Bodrito,
Julien Mairal,
Jean Ponce
Abstract:
This paper introduces a novel approach to the fine alignment of images in a burst captured by a handheld camera. In contrast to traditional techniques that estimate two-dimensional transformations between frame pairs or rely on discrete correspondences, the proposed algorithm establishes dense correspondences by optimizing both the camera motion and surface depth and orientation at every pixel. Th…
▽ More
This paper introduces a novel approach to the fine alignment of images in a burst captured by a handheld camera. In contrast to traditional techniques that estimate two-dimensional transformations between frame pairs or rely on discrete correspondences, the proposed algorithm establishes dense correspondences by optimizing both the camera motion and surface depth and orientation at every pixel. This approach improves alignment, particularly in scenarios with parallax challenges. Extensive experiments with synthetic bursts featuring small and even tiny baselines demonstrate that it outperforms the best optical flow methods available today in this setting, without requiring any training. Beyond enhanced alignment, our method opens avenues for tasks beyond simple image restoration, such as depth estimation and 3D reconstruction, as supported by promising preliminary results. This positions our approach as a versatile tool for various burst image processing applications.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Dense Optical Tracking: Connecting the Dots
Authors:
Guillaume Le Moing,
Jean Ponce,
Cordelia Schmid
Abstract:
Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set…
▽ More
Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .
△ Less
Submitted 4 March, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Towards Real-World Focus Stacking with Deep Learning
Authors:
Alexandre Araujo,
Jean Ponce,
Julien Mairal
Abstract:
Focus stacking is widely used in micro, macro, and landscape photography to reconstruct all-in-focus images from multiple frames obtained with focus bracketing, that is, with shallow depth of field and different focus planes. Existing deep learning approaches to the underlying multi-focus image fusion problem have limited applicability to real-world imagery since they are designed for very short i…
▽ More
Focus stacking is widely used in micro, macro, and landscape photography to reconstruct all-in-focus images from multiple frames obtained with focus bracketing, that is, with shallow depth of field and different focus planes. Existing deep learning approaches to the underlying multi-focus image fusion problem have limited applicability to real-world imagery since they are designed for very short image sequences (two to four images), and are typically trained on small, low-resolution datasets either acquired by light-field cameras or generated synthetically. We introduce a new dataset consisting of 94 high-resolution bursts of raw images with focus bracketing, with pseudo ground truth computed from the data using state-of-the-art commercial software. This dataset is used to train the first deep learning algorithm for focus stacking capable of handling bursts of sufficient length for real-world applications. Qualitative experiments demonstrate that it is on par with existing commercial solutions in the long-burst, realistic regime while being significantly more tolerant to noise. The code and dataset are available at https://github.com/araujoalexandre/FocusStackingDataset.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
The long-term impact of (un)conditional cash transfers on labour market outcomes in Ecuador
Authors:
Juan Ponce,
José-Ignacio Antón,
Mercedes Onofa,
Roberto Castillo
Abstract:
Despite the popularity of conditional cash transfers in low- and middle-income countries, evidence on their long-term effects remains scarce. This paper assesses the impact of the Ecuador's Human Development Grant on the formal sector labour market outcomes of children in eligible households. This grant -- one of the first of its kind -- is characterised by weak enforcement of its eligibility crit…
▽ More
Despite the popularity of conditional cash transfers in low- and middle-income countries, evidence on their long-term effects remains scarce. This paper assesses the impact of the Ecuador's Human Development Grant on the formal sector labour market outcomes of children in eligible households. This grant -- one of the first of its kind -- is characterised by weak enforcement of its eligibility criteria. By means of a regression discontinuity design, we find that this programme increased formal employment rates and labour income around a decade after exposure, thereby curbing the intergenerational transmission of poverty. We discuss possible mediating mechanisms based on findings from previous literature and, in particular, provide evidence on how the programme contributed to persistence in school in the medium run.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
Revisiting Deformable Convolution for Depth Completion
Authors:
Xinglong Sun,
Jean Ponce,
Yu-Xiong Wang
Abstract:
Depth completion, which aims to generate high-quality dense depth maps from sparse depth maps, has attracted increasing attention in recent years. Previous work usually employs RGB images as guidance, and introduces iterative spatial propagation to refine estimated coarse depth maps. However, most of the propagation refinement methods require several iterations and suffer from a fixed receptive fi…
▽ More
Depth completion, which aims to generate high-quality dense depth maps from sparse depth maps, has attracted increasing attention in recent years. Previous work usually employs RGB images as guidance, and introduces iterative spatial propagation to refine estimated coarse depth maps. However, most of the propagation refinement methods require several iterations and suffer from a fixed receptive field, which may contain irrelevant and useless information with very sparse input. In this paper, we address these two challenges simultaneously by revisiting the idea of deformable convolution. We propose an effective architecture that leverages deformable kernel convolution as a single-pass refinement module, and empirically demonstrate its superiority. To better understand the function of deformable convolution and exploit it for depth completion, we further systematically investigate a variety of representative strategies. Our study reveals that, different from prior work, deformable convolution needs to be applied on an estimated depth map with a relatively high density for better performance. We evaluate our model on the large-scale KITTI dataset and achieve state-of-the-art level performance in both accuracy and inference speed. Our code is available at https://github.com/AlexSunNik/ReDC.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features
Authors:
Adrien Bardes,
Jean Ponce,
Yann LeCun
Abstract:
Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and intro…
▽ More
Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Combining multi-spectral data with statistical and deep-learning models for improved exoplanet detection in direct imaging at high contrast
Authors:
Olivier Flasseur,
Théo Bodrito,
Julien Mairal,
Jean Ponce,
Maud Langlois,
Anne-Marie Lagrange
Abstract:
Exoplanet detection by direct imaging is a difficult task: the faint signals from the objects of interest are buried under a spatially structured nuisance component induced by the host star. The exoplanet signals can only be identified when combining several observations with dedicated detection algorithms. In contrast to most of existing methods, we propose to learn a model of the spatial, tempor…
▽ More
Exoplanet detection by direct imaging is a difficult task: the faint signals from the objects of interest are buried under a spatially structured nuisance component induced by the host star. The exoplanet signals can only be identified when combining several observations with dedicated detection algorithms. In contrast to most of existing methods, we propose to learn a model of the spatial, temporal and spectral characteristics of the nuisance, directly from the observations. In a pre-processing step, a statistical model of their correlations is built locally, and the data are centered and whitened to improve both their stationarity and signal-to-noise ratio (SNR). A convolutional neural network (CNN) is then trained in a supervised fashion to detect the residual signature of synthetic sources in the pre-processed images. Our method leads to a better trade-off between precision and recall than standard approaches in the field. It also outperforms a state-of-the-art algorithm based solely on a statistical framework. Besides, the exploitation of the spectral diversity improves the performance compared to a similar model built solely from spatio-temporal data.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
An Image Quality Assessment Dataset for Portraits
Authors:
Nicolas Chahine,
Ana-Stefania Calarasanu,
Davide Garcia-Civiero,
Theo Cayla,
Sira Ferradans,
Jean Ponce
Abstract:
Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary…
▽ More
Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit and informed consent for their photographs to be used in public research. It is annotated by pairwise comparisons (PWC) collected from over 30 image quality experts for three image attributes: face detail preservation, face target exposure, and overall image quality. An in-depth statistical analysis of these annotations allows us to evaluate their consistency over PIQ23. Finally, we show through an extensive comparison with existing baselines that semantic information (image context) can be used to improve IQA predictions. The dataset along with the proposed statistical analysis and BIQA algorithms are available: https://github.com/DXOMARK-Research/PIQ2023
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Pixel-wise Agricultural Image Time Series Classification: Comparisons and a Deformable Prototype-based Approach
Authors:
Elliot Vincent,
Jean Ponce,
Mathieu Aubry
Abstract:
Improvements in Earth observation by satellites allow for imagery of ever higher temporal and spatial resolution. Leveraging this data for agricultural monitoring is key for addressing environmental and economic challenges. Current methods for crop segmentation using temporal data either rely on annotated data or are heavily engineered to compensate the lack of supervision. In this paper, we prese…
▽ More
Improvements in Earth observation by satellites allow for imagery of ever higher temporal and spatial resolution. Leveraging this data for agricultural monitoring is key for addressing environmental and economic challenges. Current methods for crop segmentation using temporal data either rely on annotated data or are heavily engineered to compensate the lack of supervision. In this paper, we present and compare datasets and methods for both supervised and unsupervised pixel-wise segmentation of satellite image time series (SITS). We also introduce an approach to add invariance to spectral deformations and temporal shifts to classical prototype-based methods such as K-means and Nearest Centroid Classifier (NCC). We show this simple and highly interpretable method leads to meaningful results in both the supervised and unsupervised settings and significantly improves the state of the art for unsupervised classification of agricultural time series on four recent SITS datasets.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
deep PACO: Combining statistical models with deep learning for exoplanet detection and characterization in direct imaging at high contrast
Authors:
Olivier Flasseur,
Théo Bodrito,
Julien Mairal,
Jean Ponce,
Maud Langlois,
Anne-Marie Lagrange
Abstract:
Direct imaging is an active research topic in astronomy for the detection and the characterization of young sub-stellar objects. The very high contrast between the host star and its companions makes the observations particularly challenging. In this context, post-processing methods combining several images recorded with the pupil tracking mode of telescope are needed. In previous works, we have pr…
▽ More
Direct imaging is an active research topic in astronomy for the detection and the characterization of young sub-stellar objects. The very high contrast between the host star and its companions makes the observations particularly challenging. In this context, post-processing methods combining several images recorded with the pupil tracking mode of telescope are needed. In previous works, we have presented a data-driven algorithm, PACO, capturing locally the spatial correlations of the data with a multi-variate Gaussian model. PACO delivers better detection sensitivity and confidence than the standard post-processing methods of the field. However, there is room for improvement due to the approximate fidelity of the PACO statistical model to the time evolving observations. In this paper, we propose to combine the statistical model of PACO with supervised deep learning. The data are first pre-processed with the PACO framework to improve the stationarity and the contrast. A convolutional neural network (CNN) is then trained in a supervised fashion to detect the residual signature of synthetic sources. Finally, the trained network delivers a detection map. The photometry of detected sources is estimated by a second CNN. We apply the proposed approach to several datasets from the VLT/SPHERE instrument. Our results show that its detection stage performs significantly better than baseline methods (cADI, PCA), and leads to a contrast improvement up to half a magnitude compared to PACO. The characterization stage of the proposed method performs on average on par with or better than the comparative algorithms (PCA, PACO) for angular separation above 0.5".
△ Less
Submitted 11 October, 2023; v1 submitted 4 March, 2023;
originally announced March 2023.
-
WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction
Authors:
Guillaume Le Moing,
Jean Ponce,
Cordelia Schmid
Abstract:
This paper presents WALDO (WAr** Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining par…
▽ More
This paper presents WALDO (WAr** Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and war** the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.
△ Less
Submitted 29 August, 2023; v1 submitted 25 November, 2022;
originally announced November 2022.
-
A minimum swept-volume metric structure for configuration space
Authors:
Yann de Mont-Marin,
Jean Ponce,
Jean-Paul Laumond
Abstract:
Borrowing elementary ideas from solid mechanics and differential geometry, this presentation shows that the volume swept by a regular solid undergoing a wide class of volume-preserving deformations induces a rather natural metric structure with well-defined and computable geodesics on its configuration space. This general result applies to concrete classes of articulated objects such as robot mani…
▽ More
Borrowing elementary ideas from solid mechanics and differential geometry, this presentation shows that the volume swept by a regular solid undergoing a wide class of volume-preserving deformations induces a rather natural metric structure with well-defined and computable geodesics on its configuration space. This general result applies to concrete classes of articulated objects such as robot manipulators, and we demonstrate as a proof of concept the computation of geodesic paths for a free flying rod and planar robotic arms as well as their use in path planning with many obstacles.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
Learning Reward Functions for Robotic Manipulation by Observing Humans
Authors:
Minttu Alakuijala,
Gabriel Dulac-Arnold,
Julien Mairal,
Jean Ponce,
Cordelia Schmid
Abstract:
Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-…
▽ More
Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
△ Less
Submitted 7 March, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
VICRegL: Self-Supervised Learning of Local Visual Features
Authors:
Adrien Bardes,
Jean Ponce,
Yann LeCun
Abstract:
Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called…
▽ More
Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: https://github.com/facebookresearch/VICRegL
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
High Dynamic Range and Super-Resolution from Raw Image Bursts
Authors:
Bruno Lecouat,
Thomas Eboli,
Jean Ponce,
Julien Mairal
Abstract:
Photographs captured by smartphones and mid-range cameras have limited spatial resolution and dynamic range, with noisy response in underexposed regions and color artefacts in saturated areas. This paper introduces the first approach (to the best of our knowledge) to the reconstruction of high-resolution, high-dynamic range color images from raw photographic bursts captured by a handheld camera wi…
▽ More
Photographs captured by smartphones and mid-range cameras have limited spatial resolution and dynamic range, with noisy response in underexposed regions and color artefacts in saturated areas. This paper introduces the first approach (to the best of our knowledge) to the reconstruction of high-resolution, high-dynamic range color images from raw photographic bursts captured by a handheld camera with exposure bracketing. This method uses a physically-accurate model of image formation to combine an iterative optimization algorithm for solving the corresponding inverse problem with a learned image representation for robust alignment and a learned natural image prior. The proposed algorithm is fast, with low memory requirements compared to state-of-the-art learning-based approaches to image restoration, and features that are learned end to end from synthetic yet realistic data. Extensive experiments demonstrate its excellent performance with super-resolution factors of up to $\times 4$ on real photographs taken in the wild with hand-held cameras, and high robustness to low-light conditions, noise, camera shake, and moderate object motion.
△ Less
Submitted 29 July, 2022;
originally announced July 2022.
-
Active Learning Strategies for Weakly-supervised Object Detection
Authors:
Huy V. Vo,
Oriane Siméoni,
Spyros Gidaris,
Andrei Bursuc,
Patrick Pérez,
Jean Ponce
Abstract:
Object detectors trained with weak annotations are affordable alternatives to fully-supervised counterparts. However, there is still a significant performance gap between them. We propose to narrow this gap by fine-tuning a base pre-trained weakly-supervised detector with a few fully-annotated samples automatically selected from the training set using ``box-in-box'' (BiB), a novel active learning…
▽ More
Object detectors trained with weak annotations are affordable alternatives to fully-supervised counterparts. However, there is still a significant performance gap between them. We propose to narrow this gap by fine-tuning a base pre-trained weakly-supervised detector with a few fully-annotated samples automatically selected from the training set using ``box-in-box'' (BiB), a novel active learning strategy designed specifically to address the well-documented failure modes of weakly-supervised detectors. Experiments on the VOC07 and COCO benchmarks show that BiB outperforms other active learning techniques and significantly improves the base weakly-supervised detector's performance with only a few fully-annotated images per class. BiB reaches 97% of the performance of fully-supervised Fast RCNN with only 10% of fully-annotated images on VOC07. On COCO, using on average 10 fully-annotated images per class, or equivalently 1% of the training set, BiB also reduces the performance gap (in AP) between the weakly-supervised detector and the fully-supervised Fast RCNN by over 70%, showing a good trade-off between performance and data efficiency. Our code is publicly available at https://github.com/huyvvo/BiB.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Assembly Planning from Observations under Physical Constraints
Authors:
Thomas Chabal,
Robin Strudel,
Etienne Arlaud,
Jean Ponce,
Cordelia Schmid
Abstract:
This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation. The proposed algorithm uses a simple combination of physical stability constraints, convex optimization and Monte Carlo tree search to plan assemblies as sequences o…
▽ More
This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation. The proposed algorithm uses a simple combination of physical stability constraints, convex optimization and Monte Carlo tree search to plan assemblies as sequences of pick-and-place operations represented by STRIPS operators. It is efficient and, most importantly, robust to the errors in object detection and pose estimation unavoidable in any real robotic system. The proposed approach is demonstrated with thorough experiments on a UR5 manipulator.
△ Less
Submitted 25 October, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Modeling Immunity to Malaria with an Age-Structured PDE Framework
Authors:
Zhuolin Qu,
Denis Patterson,
Lauren Childs,
Christina Edholm,
Joan Ponce,
Olivia Prosper,
Lihong Zhao
Abstract:
Malaria is one of the deadliest infectious diseases globally, causing hundreds of thousands of deaths each year. It disproportionately affects young children, with two-thirds of fatalities occurring in under-fives. Individuals acquire protection from disease through repeated exposure, and this immunity plays a crucial role in the dynamics of malaria spread. We develop a novel age-structured PDE ma…
▽ More
Malaria is one of the deadliest infectious diseases globally, causing hundreds of thousands of deaths each year. It disproportionately affects young children, with two-thirds of fatalities occurring in under-fives. Individuals acquire protection from disease through repeated exposure, and this immunity plays a crucial role in the dynamics of malaria spread. We develop a novel age-structured PDE malaria model, which couples vector-host epidemiological dynamics with immunity dynamics. Our model tracks the acquisition and loss of anti-disease immunity during transmission and its corresponding nonlinear feedback onto the transmission parameters. We derive the basic reproduction number ($\mathcal{R}_0$) as the threshold condition for the stability of disease-free equilibrium; we also interpret $\mathcal{R}_0$ probabilistically as a weighted sum of cases generated by infected individuals at different infectious stages and different ages. We parametrize our model using demographic and immunological data from sub-Saharan regions. Numerical bifurcation analysis demonstrates the existence of an endemic equilibrium, and we observe a forward bifurcation in $\mathcal{R}_0$. Our numerical simulations reproduce the heterogeneity in the age distributions of immunity profiles and infection status created by frequent exposure. Motivated by the recently approved RTS,S vaccine, we also study the impact of vaccination; our results show a reduction in severe disease among young children but a small increase in severe malaria among older children due to lower acquired immunity from delayed exposure.
△ Less
Submitted 25 January, 2023; v1 submitted 23 December, 2021;
originally announced December 2021.
-
Localizing Objects with Self-Supervised Transformers and no Labels
Authors:
Oriane Siméoni,
Gilles Puy,
Huy V. Vo,
Simon Roburin,
Spyros Gidaris,
Andrei Bursuc,
Patrick Pérez,
Renaud Marlet,
Jean Ponce
Abstract:
Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image.…
▽ More
Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.
△ Less
Submitted 29 September, 2021;
originally announced September 2021.
-
CCVS: Context-aware Controllable Video Synthesis
Authors:
Guillaume Le Moing,
Jean Ponce,
Cordelia Schmid
Abstract:
This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoen…
▽ More
This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.
△ Less
Submitted 26 October, 2021; v1 submitted 16 July, 2021;
originally announced July 2021.
-
Residual Reinforcement Learning from Demonstrations
Authors:
Minttu Alakuijala,
Gabriel Dulac-Arnold,
Julien Mairal,
Jean Ponce,
Cordelia Schmid
Abstract:
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Learning from images, proprioceptive inputs and a sparse task-completion reward relaxes the requirem…
▽ More
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Learning from images, proprioceptive inputs and a sparse task-completion reward relaxes the requirement of accessing full state features, such as object and target positions. In addition, replacing the base controller with a policy learned from demonstrations removes the dependency on a hand-engineered controller in favour of a dataset of demonstrations, which can be provided by non-experts. Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning, and is capable of solving high-dimensional, sparse-reward tasks out of reach for RL from scratch.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Large-Scale Unsupervised Object Discovery
Authors:
Huy V. Vo,
Elena Sizikova,
Cordelia Schmid,
Patrick Pérez,
Jean Ponce
Abstract:
Existing approaches to unsupervised object discovery (UOD) do not scale up to large datasets without approximations that compromise their performance. We propose a novel formulation of UOD as a ranking problem, amenable to the arsenal of distributed methods available for eigenvalue problems and link analysis. Through the use of self-supervised features, we also demonstrate the first effective full…
▽ More
Existing approaches to unsupervised object discovery (UOD) do not scale up to large datasets without approximations that compromise their performance. We propose a novel formulation of UOD as a ranking problem, amenable to the arsenal of distributed methods available for eigenvalue problems and link analysis. Through the use of self-supervised features, we also demonstrate the first effective fully unsupervised pipeline for UOD. Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1.7M images. In the multi-object discovery setting where multiple objects are sought in each image, the proposed LOD is over 14% better in average precision (AP) than all other methods for datasets ranging from 20K to 1.7M images. Using self-supervised features, we also show that the proposed method obtains state-of-the-art UOD performance on OpenImages. Our code is publicly available at https://github.com/huyvvo/LOD.
△ Less
Submitted 16 November, 2021; v1 submitted 11 June, 2021;
originally announced June 2021.
-
NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results
Authors:
Goutam Bhat,
Martin Danelljan,
Radu Timofte,
Kazutoshi Akita,
Wooyeong Cho,
Haoqiang Fan,
Lanpeng Jia,
Daeshik Kim,
Bruno Lecouat,
Youwei Li,
Shuaicheng Liu,
Ziluan Liu,
Ziwei Luo,
Takahiro Maeda,
Julien Mairal,
Christian Micheloni,
Xuan Mo,
Takeru Oba,
Pavel Ostyakov,
Jean Ponce,
Sanghyeok Son,
Jian Sun,
Norimichi Ukita,
Rao Muhammad Umer,
Youliang Yan
, et al. (3 additional authors not shown)
Abstract:
This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the final testing phase, 6 teams submitted results using…
▽ More
This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the final testing phase, 6 teams submitted results using a diverse set of solutions. The top-performing methods set a new state-of-the-art for the burst super-resolution task.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Authors:
Adrien Bardes,
Jean Ponce,
Yann LeCun
Abstract:
Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this…
▽ More
Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.
△ Less
Submitted 28 January, 2022; v1 submitted 11 May, 2021;
originally announced May 2021.
-
Unsupervised Layered Image Decomposition into Object Prototypes
Authors:
Tom Monnier,
Elliot Vincent,
Jean Ponce,
Mathieu Aubry
Abstract:
We present an unsupervised learning framework for decomposing images into layers of automatically discovered object models. Contrary to recent approaches that model image layers with autoencoder networks, we represent them as explicit transformations of a small set of prototypical images. Our model has three main components: (i) a set of object prototypes in the form of learnable images with a tra…
▽ More
We present an unsupervised learning framework for decomposing images into layers of automatically discovered object models. Contrary to recent approaches that model image layers with autoencoder networks, we represent them as explicit transformations of a small set of prototypical images. Our model has three main components: (i) a set of object prototypes in the form of learnable images with a transparency channel, which we refer to as sprites; (ii) differentiable parametric functions predicting occlusions and transformation parameters necessary to instantiate the sprites in a given image; (iii) a layered image formation model with occlusion for compositing these instances into complete images including background. By jointly learning the sprites and occlusion/transformation predictors to reconstruct images, our approach not only yields accurate layered image decompositions, but also identifies object categories and instance parameters. We first validate our approach by providing results on par with the state of the art on standard multi-object synthetic benchmarks (Tetrominoes, Multi-dSprites, CLEVR6). We then demonstrate the applicability of our model to real images in tasks that include clustering (SVHN, GTSRB), cosegmentation (Weizmann Horse) and object discovery from unfiltered social network images. To the best of our knowledge, our approach is the first layered image decomposition algorithm that learns an explicit and shared concept of object type, and is robust enough to be applied to real images.
△ Less
Submitted 23 August, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Learning to Jointly Deblur, Demosaick and Denoise Raw Images
Authors:
Thomas Eboli,
Jian Sun,
Jean Ponce
Abstract:
We address the problem of non-blind deblurring and demosaicking of noisy raw images. We adapt an existing learning-based approach to RGB image deblurring to handle raw images by introducing a new interpretable module that jointly demosaicks and deblurs them. We train this model on RGB images converted into raw ones following a realistic invertible camera pipeline. We demonstrate the effectiveness…
▽ More
We address the problem of non-blind deblurring and demosaicking of noisy raw images. We adapt an existing learning-based approach to RGB image deblurring to handle raw images by introducing a new interpretable module that jointly demosaicks and deblurs them. We train this model on RGB images converted into raw ones following a realistic invertible camera pipeline. We demonstrate the effectiveness of this model over two-stage approaches stacking demosaicking and deblurring modules on quantitive benchmarks. We also apply our approach to remove a camera's inherent blur (its color-dependent point-spread function) from real images, in essence deblurring sharp images.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Lucas-Kanade Reloaded: End-to-End Super-Resolution from Raw Image Bursts
Authors:
Bruno Lecouat,
Jean Ponce,
Julien Mairal
Abstract:
This presentation addresses the problem of reconstructing a high-resolution image from multiple lower-resolution snapshots captured from slightly different viewpoints in space and time. Key challenges for solving this problem include (i) aligning the input pictures with sub-pixel accuracy, (ii) handling raw (noisy) images for maximal faithfulness to native camera data, and (iii) designing/learning…
▽ More
This presentation addresses the problem of reconstructing a high-resolution image from multiple lower-resolution snapshots captured from slightly different viewpoints in space and time. Key challenges for solving this problem include (i) aligning the input pictures with sub-pixel accuracy, (ii) handling raw (noisy) images for maximal faithfulness to native camera data, and (iii) designing/learning an image prior (regularizer) well suited to the task. We address these three challenges with a hybrid algorithm building on the insight from Wronski et al. that aliasing is an ally in this setting, with parameters that can be learned end to end, while retaining the interpretability of classical approaches to inverse problems. The effectiveness of our approach is demonstrated on synthetic and real image bursts, setting a new state of the art on several benchmarks and delivering excellent qualitative results on real raw bursts captured by smartphones and prosumer cameras.
△ Less
Submitted 23 August, 2021; v1 submitted 13 April, 2021;
originally announced April 2021.
-
Air-to-Ground Directional Channel Sounder With 64-antenna Dual-polarized Cylindrical Array
Authors:
Jorge Gomez Ponce,
Thomas Choi,
Naveed A. Abbasi,
Aldo Adame,
Alexander Alvarado,
Colton Bullard,
Ruiyi Shen,
Fred Daneshgaran,
Harpreet S. Dhillon,
Andreas F. Molisch
Abstract:
Unmanned Aerial Vehicles (UAVs), popularly called drones, are an important part of future wireless communications, either as user equipment that needs communication with a ground station, or as base station in a 3D network. For both the analysis of the "useful" links, and for investigation of possible interference to other ground-based nodes, an understanding of the air-to-ground channel is requir…
▽ More
Unmanned Aerial Vehicles (UAVs), popularly called drones, are an important part of future wireless communications, either as user equipment that needs communication with a ground station, or as base station in a 3D network. For both the analysis of the "useful" links, and for investigation of possible interference to other ground-based nodes, an understanding of the air-to-ground channel is required. Since ground-based nodes often are equipped with antenna arrays, the channel investigations need to account for it. This study presents a massive MIMO-based air-to-ground channel sounder we have recently developed in our lab, which can perform measurements for the aforementioned requirements. After outlining the principle and functionality of the sounder, we present sample measurements that demonstrate the capabilities, and give first insights into air-to-ground massive MIMO channels in an urban environment. Our results provide a platform for future investigations and possible enhancements of massive MIMO systems.
△ Less
Submitted 13 February, 2021;
originally announced March 2021.
-
Learning to Compose Hypercolumns for Visual Correspondence
Authors:
Juhong Min,
Jongmin Lee,
Jean Ponce,
Minsu Cho
Abstract:
Feature representation plays a crucial role in visual correspondence, and recent methods for image matching resort to deeply stacked convolutional layers. These models, however, are both monolithic and static in the sense that they typically use a specific level of features, e.g., the output of the last layer, and adhere to it regardless of the images to match. In this work, we introduce a novel a…
▽ More
Feature representation plays a crucial role in visual correspondence, and recent methods for image matching resort to deeply stacked convolutional layers. These models, however, are both monolithic and static in the sense that they typically use a specific level of features, e.g., the output of the last layer, and adhere to it regardless of the images to match. In this work, we introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. Inspired by both multi-layer feature composition in object detection and adaptive inference architectures in classification, the proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network. We demonstrate the effectiveness on the task of semantic correspondence, i.e., establishing correspondences between images depicting different instances of the same object or scene category. Experiments on standard benchmarks show that the proposed method greatly improves matching performance over the state of the art in an adaptive and efficient manner.
△ Less
Submitted 21 July, 2020;
originally announced July 2020.
-
Toward unsupervised, multi-object discovery in large-scale image collections
Authors:
Huy V. Vo,
Patrick Pérez,
Jean Ponce
Abstract:
This paper addresses the problem of discovering the objects present in a collection of images without any supervision. We build on the optimization approach of Vo et al. (CVPR'19) with several key novelties: (1) We propose a novel saliency-based region proposal algorithm that achieves significantly higher overlap with ground-truth objects than other competitive methods. This procedure leverages of…
▽ More
This paper addresses the problem of discovering the objects present in a collection of images without any supervision. We build on the optimization approach of Vo et al. (CVPR'19) with several key novelties: (1) We propose a novel saliency-based region proposal algorithm that achieves significantly higher overlap with ground-truth objects than other competitive methods. This procedure leverages off-the-shelf CNN features trained on classification tasks without any bounding box information, but is otherwise unsupervised. (2) We exploit the inherent hierarchical structure of proposals as an effective regularizer for the approach to object discovery of Vo et al., boosting its performance to significantly improve over the state of the art on several standard benchmarks. (3) We adopt a two-stage strategy to select promising proposals using small random sets of images before using the whole image collection to discover the objects it depicts, allowing us to tackle, for the first time (to the best of our knowledge), the discovery of multiple objects in each one of the pictures making up datasets with up to 20,000 images, an over five-fold increase compared to existing methods, and a first step toward true large-scale unsupervised image interpretation.
△ Less
Submitted 25 August, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
End-to-end Interpretable Learning of Non-blind Image Deblurring
Authors:
Thomas Eboli,
Jian Sun,
Jean Ponce
Abstract:
Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture's gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Ric…
▽ More
Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture's gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. Using convolutions instead of a generic linear preconditioner allows extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods. More importantly, the proposed architecture is easily adapted to learning both the preconditioner and the proximal operator using CNN embeddings. This yields a simple and efficient algorithm for non-blind image deblurring which is fully interpretable, can be learned end to end, and whose accuracy matches or exceeds the state of the art, quite significantly, in the non-uniform case.
△ Less
Submitted 15 September, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.
-
A Flexible Framework for Designing Trainable Priors with Adaptive Smoothing and Game Encoding
Authors:
Bruno Lecouat,
Jean Ponce,
Julien Mairal
Abstract:
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems, and whose architectures are derived from an optimization algorithm. We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appeal…
▽ More
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems, and whose architectures are derived from an optimization algorithm. We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end. The priors used in this presentation include variants of total variation, Laplacian regularization, bilateral filtering, sparse coding on learned dictionaries, and non-local self similarities. Our models are fully interpretable as well as parameter and data efficient. Our experiments demonstrate their effectiveness on a large diversity of tasks ranging from image denoising and compressed sensing for fMRI to dense stereo matching.
△ Less
Submitted 9 November, 2020; v1 submitted 26 June, 2020;
originally announced June 2020.
-
Structured and Localized Image Restoration
Authors:
Thomas Eboli,
Alex Nowak-Vila,
Jian Sun,
Francis Bach,
Jean Ponce,
Alessandro Rudi
Abstract:
We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees lev…
▽ More
We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees leveraging local dependency properties of overlap** patches. We derive the corresponding algorithms for energies based on the mean-squared and Euclidean norm errors. Finally, we demonstrate the practical effectiveness of our model on different image restoration problems using standard benchmarks.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Fully Trainable and Interpretable Non-Local Sparse Models for Image Restoration
Authors:
Bruno Lecouat,
Jean Ponce,
Julien Mairal
Abstract:
Non-local self-similarity and sparsity principles have proven to be powerful priors for natural image modeling. We propose a novel differentiable relaxation of joint sparsity that exploits both principles and leads to a general framework for image restoration which is (1) trainable end to end, (2) fully interpretable, and (3) much more compact than competing deep learning architectures. We apply t…
▽ More
Non-local self-similarity and sparsity principles have proven to be powerful priors for natural image modeling. We propose a novel differentiable relaxation of joint sparsity that exploits both principles and leads to a general framework for image restoration which is (1) trainable end to end, (2) fully interpretable, and (3) much more compact than competing deep learning architectures. We apply this approach to denoising, jpeg deblocking, and demosaicking, and show that, with as few as 100K parameters, its performance on several standard benchmarks is on par or better than state-of-the-art methods that may have an order of magnitude or more parameters.
△ Less
Submitted 20 August, 2020; v1 submitted 5 December, 2019;
originally announced December 2019.
-
Learning Semantic Correspondence Exploiting an Object-level Prior
Authors:
Junghyup Lee,
Dohyung Kim,
Wonkyung Lee,
Jean Ponce,
Bumsub Ham
Abstract:
We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signa…
▽ More
We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal provides an object-level prior for the semantic correspondence task and offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.
△ Less
Submitted 21 July, 2020; v1 submitted 28 November, 2019;
originally announced November 2019.
-
Deformable Kernel Networks for Joint Image Filtering
Authors:
Beomjun Kim,
Jean Ponce,
Bumsub Ham
Abstract:
Joint image filters are used to transfer structural details from a guidance picture used as a prior to a target image, in tasks such as enhancing spatial resolution and suppressing noise. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details and regress the filtering result. In this paper, we inste…
▽ More
Joint image filters are used to transfer structural details from a guidance picture used as a prior to a target image, in tasks such as enhancing spatial resolution and suppressing noise. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details and regress the filtering result. In this paper, we instead learn explicitly sparse and spatially-variant kernels. We propose a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sets of neighbors and the corresponding weights adaptively for each pixel. The filtering result is then computed as a weighted average. We also propose a fast version of DKN that runs about seventeen times faster for an image of size 640 x 480. We demonstrate the effectiveness and flexibility of our models on the tasks of depth map upsampling, saliency map upsampling, cross-modality image restoration, texture removal, and semantic segmentation. In particular, we show that the weighted averaging process with sparsely sampled 3 x 3 kernels outperforms the state of the art by a significant margin in all cases.
△ Less
Submitted 20 October, 2020; v1 submitted 17 October, 2019;
originally announced October 2019.
-
SPair-71k: A Large-scale Benchmark for Semantic Correspondence
Authors:
Juhong Min,
Jongmin Lee,
Jean Ponce,
Minsu Cho
Abstract:
Establishing visual correspondences under large intra-class variations, which is often referred to as semantic correspondence or semantic matching, remains a challenging problem in computer vision. Despite its significance, however, most of the datasets for semantic correspondence are limited to a small amount of image pairs with similar viewpoints and scales. In this paper, we present a new large…
▽ More
Establishing visual correspondences under large intra-class variations, which is often referred to as semantic correspondence or semantic matching, remains a challenging problem in computer vision. Despite its significance, however, most of the datasets for semantic correspondence are limited to a small amount of image pairs with similar viewpoints and scales. In this paper, we present a new large-scale benchmark dataset of semantically paired images, SPair-71k, which contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger in number and contains more accurate and richer annotations. We believe this dataset will provide a reliable testbed to study the problem of semantic correspondence and will help to advance research in this area. We provide the results of recent methods on our new dataset as baselines for further research. Our benchmark is available online at http://cvlab.postech.ac.kr/research/SPair-71k/.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
Hyperpixel Flow: Semantic Correspondence with Multi-layer Neural Features
Authors:
Juhong Min,
Jongmin Lee,
Jean Ponce,
Minsu Cho
Abstract:
Establishing visual correspondences under large intra-class variations requires analyzing images at different levels, from features linked to semantics and context to local patterns, while being invariant to instance-specific details. To tackle these challenges, we represent images by "hyperpixels" that leverage a small number of relevant features selected among early to late layers of a convoluti…
▽ More
Establishing visual correspondences under large intra-class variations requires analyzing images at different levels, from features linked to semantics and context to local patterns, while being invariant to instance-specific details. To tackle these challenges, we represent images by "hyperpixels" that leverage a small number of relevant features selected among early to late layers of a convolutional neural network. Taking advantage of the condensed features of hyperpixels, we develop an effective real-time matching algorithm based on Hough geometric voting. The proposed method, hyperpixel flow, sets a new state of the art on three standard benchmarks as well as a new dataset, SPair-71k, which contains a significantly larger number of image pairs than existing datasets, with more accurate and richer annotations for in-depth analysis.
△ Less
Submitted 18 August, 2019;
originally announced August 2019.
-
Unsupervised Image Matching and Object Discovery as Optimization
Authors:
Huy V. Vo,
Francis Bach,
Minsu Cho,
Kai Han,
Yann LeCun,
Patrick Perez,
Jean Ponce
Abstract:
Learning with complete or partial supervision is powerful but relies on ever-growing human annotation efforts. As a way to mitigate this serious problem, as well as to serve specific applications, unsupervised learning has emerged as an important field of research. In computer vision, unsupervised learning comes in various guises. We focus here on the unsupervised discovery and matching of object…
▽ More
Learning with complete or partial supervision is powerful but relies on ever-growing human annotation efforts. As a way to mitigate this serious problem, as well as to serve specific applications, unsupervised learning has emerged as an important field of research. In computer vision, unsupervised learning comes in various guises. We focus here on the unsupervised discovery and matching of object categories among images in a collection, following the work of Cho et al. 2015. We show that the original approach can be reformulated and solved as a proper optimization problem. Experiments on several benchmarks establish the merit of our approach.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.
-
SFNet: Learning Object-aware Semantic Correspondence
Authors:
Junghyup Lee,
Dohyung Kim,
Jean Ponce,
Bumsub Ham
Abstract:
We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signa…
▽ More
We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.
△ Less
Submitted 4 April, 2019; v1 submitted 3 April, 2019;
originally announced April 2019.
-
Deformable kernel networks for guided depth map upsampling
Authors:
Beomjun Kim,
Jean Ponce,
Bumsub Ham
Abstract:
We address the problem of upsampling a low-resolution (LR) depth map using a registered high-resolution (HR) color image of the same scene. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details from LR depth and HR color images, and regress upsampling results directly from the networks. In this pap…
▽ More
We address the problem of upsampling a low-resolution (LR) depth map using a registered high-resolution (HR) color image of the same scene. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details from LR depth and HR color images, and regress upsampling results directly from the networks. In this paper, we revisit the weighted averaging process that has been widely used to transfer structural details from hand-crafted visual features to LR depth maps. We instead learn explicitly sparse and spatially-variant kernels for this task. To this end, we propose a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sparse sets of neighbors and the corresponding weights adaptively for each pixel. We also propose a fast version of DKN (FDKN) that runs about 17 times faster (0.01 seconds for a HR image of size 640 x 480). Experimental results on standard benchmarks demonstrate the effectiveness of our approach. In particular, we show that the weighted averaging process with 3 x 3 kernels (i.e., aggregating 9 samples sparsely chosen) outperforms the state of the art by a significant margin.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
Enable Super-resolution Parameter Estimation for Mm-wave Channel Sounding
Authors:
Rui Wang,
C. Umit Bas,
Zihang Cheng,
Thomas Choi,
Hao Feng,
Zheda Li,
XiaoKang Ye,
Pan Tang,
Seun Sangodoyin,
Jorge G. Ponce,
Robert Monroe,
Thomas Henige,
Gary Xu,
Jianzhong,
Zhang,
Jeongho Park,
Andreas F. Molisch
Abstract:
This paper investigates the capability of millimeter-wave (mmWave) channel sounders with phased arrays to perform super-resolution parameter estimation, i.e., determine the parameters of multipath components (MPC), such as direction of arrival and delay, with resolution better than the Fourier resolution of the setup. We analyze the question both generally, and with respect to a particular novel m…
▽ More
This paper investigates the capability of millimeter-wave (mmWave) channel sounders with phased arrays to perform super-resolution parameter estimation, i.e., determine the parameters of multipath components (MPC), such as direction of arrival and delay, with resolution better than the Fourier resolution of the setup. We analyze the question both generally, and with respect to a particular novel multi-beam mmWave channel sounder that is capable of performing multiple-input-multiple-output (MIMO) measurements in dynamic environments. We firstly propose a novel two-step calibration procedure that provides higher-accuracy calibration data that are required for Rimax or SAGE. Secondly, we investigate the impact of center misalignment and residual phase noise on the performance of the parameter estimator. Finally we experimentally verify the calibration results and demonstrate the capability of our sounder to perform super-resolution parameter estimation.
△ Less
Submitted 23 January, 2019;
originally announced January 2019.
-
On the Solvability of Viewing Graphs
Authors:
Matthew Trager,
Brian Osserman,
Jean Ponce
Abstract:
A set of fundamental matrices relating pairs of cameras in some configuration can be represented as edges of a "viewing graph". Whether or not these fundamental matrices are generically sufficient to recover the global camera configuration depends on the structure of this graph. We study characterizations of "solvable" viewing graphs and present several new results that can be applied to determine…
▽ More
A set of fundamental matrices relating pairs of cameras in some configuration can be represented as edges of a "viewing graph". Whether or not these fundamental matrices are generically sufficient to recover the global camera configuration depends on the structure of this graph. We study characterizations of "solvable" viewing graphs and present several new results that can be applied to determine which pairs of views may be used to recover all camera parameters. We also discuss strategies for verifying the solvability of a graph computationally.
△ Less
Submitted 18 September, 2018; v1 submitted 8 August, 2018;
originally announced August 2018.
-
SCNet: Learning Semantic Correspondence
Authors:
Kai Han,
Rafael S. Rezende,
Bumsub Ham,
Kwan-Yee K. Wong,
Minsu Cho,
Cordelia Schmid,
Jean Ponce
Abstract:
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning…
▽ More
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning a geometrically plausible model for semantic correspondence. SCNet uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function. It is trained on image pairs obtained from the PASCAL VOC 2007 keypoint dataset, and a comparative evaluation on several standard benchmarks demonstrates that the proposed approach substantially outperforms both recent deep learning architectures and previous methods based on hand-crafted features.
△ Less
Submitted 17 August, 2017; v1 submitted 11 May, 2017;
originally announced May 2017.
-
Proposal Flow: Semantic Correspondences from Object Proposals
Authors:
Bumsub Ham,
Minsu Cho,
Cordelia Schmid,
Jean Ponce
Abstract:
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout. Semantic flow methods are designed to handle images depicting different instances of the same object or scene category. We introduce a novel approach to semantic flow, dubbed proposal flow, that establishes reliable correspondences using object proposals. Unlike…
▽ More
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout. Semantic flow methods are designed to handle images depicting different instances of the same object or scene category. We introduce a novel approach to semantic flow, dubbed proposal flow, that establishes reliable correspondences using object proposals. Unlike prevailing semantic flow approaches that operate on pixels or regularly sampled local regions, proposal flow benefits from the characteristics of modern object proposals, that exhibit high repeatability at multiple scales, and can take advantage of both local and geometric consistency constraints among proposals. We also show that the corresponding sparse proposal flow can effectively be transformed into a conventional dense flow field. We introduce two new challenging datasets that can be used to evaluate both general semantic flow techniques and region-based approaches such as proposal flow. We use these benchmarks to compare different matching algorithms, object proposals, and region features within proposal flow, to the state of the art in semantic flow. This comparison, along with experiments on standard datasets, demonstrates that proposal flow significantly outperforms existing semantic flow methods in various settings.
△ Less
Submitted 21 March, 2017;
originally announced March 2017.