Search | arXiv e-print repository

Distilling the Knowledge in Data Pruning

Authors: Emanuel Ben-Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, Gérard Medioni

Abstract: With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillati… ▽ More With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2310.19024 [pdf, other]

FPGAN-Control: A Controllable Fingerprint Generator for Training with Synthetic Data

Authors: Alon Shoshan, Nadav Bhonker, Emanuel Ben Baruch, Ori Nizan, Igor Kviatkovsky, Joshua Engelsma, Manoj Aggarwal, Gerard Medioni

Abstract: Training fingerprint recognition models using synthetic data has recently gained increased attention in the biometric community as it alleviates the dependency on sensitive personal data. Existing approaches for fingerprint generation are limited in their ability to generate diverse impressions of the same finger, a key property for providing effective data for training recognition models. To addr… ▽ More Training fingerprint recognition models using synthetic data has recently gained increased attention in the biometric community as it alleviates the dependency on sensitive personal data. Existing approaches for fingerprint generation are limited in their ability to generate diverse impressions of the same finger, a key property for providing effective data for training recognition models. To address this gap, we present FPGAN-Control, an identity preserving image generation framework which enables control over the fingerprint's image appearance (e.g., fingerprint type, acquisition device, pressure level) of generated fingerprints. We introduce a novel appearance loss that encourages disentanglement between the fingerprint's identity and appearance properties. In our experiments, we used the publicly available NIST SD302 (N2N) dataset for training the FPGAN-Control model. We demonstrate the merits of FPGAN-Control, both quantitatively and qualitatively, in terms of identity preservation level, degree of appearance control, and low synthetic-to-real domain gap. Finally, training recognition models using only synthetic datasets generated by FPGAN-Control lead to recognition accuracies that are on par or even surpass models trained using real data. To the best of our knowledge, this is the first work to demonstrate this. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2304.07389 [pdf, other]

Shape of You: Precise 3D shape estimations for diverse body types

Authors: Rohan Sarkar, Achal Dave, Gerard Medioni, Benjamin Biggs

Abstract: This paper presents Shape of You (SoY), an approach to improve the accuracy of 3D body shape estimation for vision-based clothing recommendation systems. While existing methods have successfully estimated 3D poses, there remains a lack of work in precise shape estimation, particularly for diverse human bodies. To address this gap, we propose two loss functions that can be readily integrated into p… ▽ More This paper presents Shape of You (SoY), an approach to improve the accuracy of 3D body shape estimation for vision-based clothing recommendation systems. While existing methods have successfully estimated 3D poses, there remains a lack of work in precise shape estimation, particularly for diverse human bodies. To address this gap, we propose two loss functions that can be readily integrated into parametric 3D human reconstruction pipelines. Additionally, we propose a test-time optimization routine that further improves quality. Our method improves over the recent SHAPY method by 17.7% on the challenging SSP-3D dataset. We consider our work to be a step towards a more accurate 3D shape estimation system that works reliably on diverse body types and holds promise for practical applications in the fashion industry. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2304.03879 [pdf, other]

GPT4Rec: A Generative Framework for Personalized Recommendation and User Interests Interpretation

Authors: **ming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, Gerard Medioni

Abstract: Recent advancements in Natural Language Processing (NLP) have led to the development of NLP-based recommender systems that have shown superior performance. However, current models commonly treat items as mere IDs and adopt discriminative modeling, resulting in limitations of (1) fully leveraging the content information of items and the language modeling capabilities of NLP models; (2) interpreting… ▽ More Recent advancements in Natural Language Processing (NLP) have led to the development of NLP-based recommender systems that have shown superior performance. However, current models commonly treat items as mere IDs and adopt discriminative modeling, resulting in limitations of (1) fully leveraging the content information of items and the language modeling capabilities of NLP models; (2) interpreting user interests to improve relevance and diversity; and (3) adapting practical circumstances such as growing item inventories. To address these limitations, we present GPT4Rec, a novel and flexible generative framework inspired by search engines. It first generates hypothetical "search queries" given item titles in a user's history, and then retrieves items for recommendation by searching these queries. The framework overcomes previous limitations by learning both user and item embeddings in the language space. To well-capture user interests with different aspects and granularity for improving relevance and diversity, we propose a multi-query generation technique with beam search. The generated queries naturally serve as interpretable representations of user interests and can be searched to recommend cold-start items. With GPT-2 language model and BM25 search engine, our framework outperforms state-of-the-art methods by $75.7\%$ and $22.2\%$ in Recall@K on two public datasets. Experiments further revealed that multi-query generation with beam search improves both the diversity of retrieved items and the coverage of a user's multi-interests. The adaptiveness and interpretability of generated queries are discussed with qualitative case studies. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2303.17531 [pdf, other]

Asymmetric Image Retrieval with Cross Model Compatible Ensembles

Authors: Ori Linial, Alon Shoshan, Nadav Bhonker, Elad Hirsch, Lior Zamir, Igor Kviatkovsky, Gerard Medioni

Abstract: The asymmetrical retrieval setting is a well suited solution for resource constrained applications such as face recognition and image retrieval. In this setting, a large model is used for indexing the gallery while a lightweight model is used for querying. The key principle in such systems is ensuring that both models share the same embedding space. Most methods in this domain are based on knowled… ▽ More The asymmetrical retrieval setting is a well suited solution for resource constrained applications such as face recognition and image retrieval. In this setting, a large model is used for indexing the gallery while a lightweight model is used for querying. The key principle in such systems is ensuring that both models share the same embedding space. Most methods in this domain are based on knowledge distillation. While useful, they suffer from several drawbacks: they are upper-bounded by the performance of the single best model found and cannot be extended to use an ensemble of models in a straightforward manner. In this paper we present an approach that does not rely on knowledge distillation, rather it utilizes embedding transformation models. This allows the use of N independently trained and diverse gallery models (e.g., trained on different datasets or having a different architecture) and a single query model. As a result, we improve the overall accuracy beyond that of any single model while maintaining a low computational budget for querying. Additionally, we propose a gallery image rejection method that utilizes the diversity between multiple transformed embeddings to estimate the uncertainty of gallery images. △ Less

Submitted 29 October, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2210.13994 [pdf, other]

Minutiae-Guided Fingerprint Embeddings via Vision Transformers

Authors: Steven A. Grosz, Joshua J. Engelsma, Rajeev Ranjan, Naveen Ramakrishnan, Manoj Aggarwal, Gerard G. Medioni, Anil K. Jain

Abstract: Minutiae matching has long dominated the field of fingerprint recognition. However, deep networks can be used to extract fixed-length embeddings from fingerprints. To date, the few studies that have explored the use of CNN architectures to extract such embeddings have shown extreme promise. Inspired by these early works, we propose the first use of a Vision Transformer (ViT) to learn a discriminat… ▽ More Minutiae matching has long dominated the field of fingerprint recognition. However, deep networks can be used to extract fixed-length embeddings from fingerprints. To date, the few studies that have explored the use of CNN architectures to extract such embeddings have shown extreme promise. Inspired by these early works, we propose the first use of a Vision Transformer (ViT) to learn a discriminative fixed-length fingerprint embedding. We further demonstrate that by guiding the ViT to focus in on local, minutiae related features, we can boost the recognition performance. Finally, we show that by fusing embeddings learned by CNNs and ViTs we can reach near parity with a commercial state-of-the-art (SOTA) matcher. In particular, we obtain a TAR=94.23% @ FAR=0.1% on the NIST SD 302 public-domain dataset, compared to a SOTA commercial matcher which obtains TAR=96.71% @ FAR=0.1%. Additionally, our fixed-length embeddings can be matched orders of magnitude faster than the commercial system (2.5 million matches/second compared to 50K matches/second). We make our code and models publicly available to encourage further research on this topic: https://github.com/tba. △ Less

Submitted 25 October, 2022; v1 submitted 25 October, 2022; originally announced October 2022.

arXiv:2207.12033 [pdf, other]

Contrastive Learning for Interactive Recommendation in Fashion

Authors: Karin Sevegnani, Arjun Seshadri, Tian Wang, Anurag Beniwal, Julian McAuley, Alan Lu, Gerard Medioni

Abstract: Recommender systems and search are both indispensable in facilitating personalization and ease of browsing in online fashion platforms. However, the two tools often operate independently, failing to combine the strengths of recommender systems to accurately capture user tastes with search systems' ability to process user queries. We propose a novel remedy to this problem by automatically recommend… ▽ More Recommender systems and search are both indispensable in facilitating personalization and ease of browsing in online fashion platforms. However, the two tools often operate independently, failing to combine the strengths of recommender systems to accurately capture user tastes with search systems' ability to process user queries. We propose a novel remedy to this problem by automatically recommending personalized fashion items based on a user-provided text request. Our proposed model, WhisperLite, uses contrastive learning to capture user intent from natural language text and improves the recommendation quality of fashion products. WhisperLite combines the strength of CLIP embeddings with additional neural network layers for personalization, and is trained using a composite loss function based on binary cross entropy and contrastive loss. The model demonstrates a significant improvement in offline recommendation retrieval metrics when tested on a real-world dataset collected from an online retail fashion store, as well as widely used open-source datasets in different e-commerce domains, such as restaurants, movies and TV shows, clothing and shoe reviews. We additionally conduct a user study that captures user judgements on the relevance of the model's recommended items, confirming the relevancy of WhisperLite's recommendations in an online setting. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2204.10869 [pdf, other]

Identity Preserving Loss for Learned Image Compression

Authors: Jiuhong Xiao, Lavisha Aggarwal, Prithviraj Banerjee, Manoj Aggarwal, Gerard Medioni

Abstract: Deep learning model inference on embedded devices is challenging due to the limited availability of computation resources. A popular alternative is to perform model inference on the cloud, which requires transmitting images from the embedded device to the cloud. Image compression techniques are commonly employed in such cloud-based architectures to reduce transmission latency over low bandwidth ne… ▽ More Deep learning model inference on embedded devices is challenging due to the limited availability of computation resources. A popular alternative is to perform model inference on the cloud, which requires transmitting images from the embedded device to the cloud. Image compression techniques are commonly employed in such cloud-based architectures to reduce transmission latency over low bandwidth networks. This work proposes an end-to-end image compression framework that learns domain-specific features to achieve higher compression ratios than standard HEVC/JPEG compression techniques while maintaining accuracy on downstream tasks (e.g., recognition). Our framework does not require fine-tuning of the downstream task, which allows us to drop-in any off-the-shelf downstream task model without retraining. We choose faces as an application domain due to the ready availability of datasets and off-the-shelf recognition models as representative downstream tasks. We present a novel Identity Preserving Reconstruction (IPR) loss function which achieves Bits-Per-Pixel (BPP) values that are ~38% and ~42% of CRF-23 HEVC compression for LFW (low-resolution) and CelebA-HQ (high-resolution) datasets, respectively, while maintaining parity in recognition accuracy. The superior compression ratio is achieved as the model learns to retain the domain-specific features (e.g., facial features) while sacrificing details in the background. Furthermore, images reconstructed by our proposed compression model are robust to changes in downstream model architectures. We show at-par recognition performance on the LFW dataset with an unseen recognition model while retaining a lower BPP value of ~38% of CRF-23 HEVC compression. △ Less

Submitted 26 April, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: Accepted by CVPR 2022 Workshop on New Trends in Image Restoration and Enhancement and Challenges

arXiv:2204.04812 [pdf, other]

OutfitTransformer: Learning Outfit Representations for Fashion Recommendation

Authors: Rohan Sarkar, Navaneeth Bodla, Mariya I. Vasileva, Yen-Liang Lin, Anurag Beniwal, Alan Lu, Gerard Medioni

Abstract: Learning an effective outfit-level representation is critical for predicting the compatibility of items in an outfit, and retrieving complementary items for a partial outfit. We present a framework, OutfitTransformer, that uses the proposed task-specific tokens and leverages the self-attention mechanism to learn effective outfit-level representations encoding the compatibility relationships betwee… ▽ More Learning an effective outfit-level representation is critical for predicting the compatibility of items in an outfit, and retrieving complementary items for a partial outfit. We present a framework, OutfitTransformer, that uses the proposed task-specific tokens and leverages the self-attention mechanism to learn effective outfit-level representations encoding the compatibility relationships between all items in the entire outfit for addressing both compatibility prediction and complementary item retrieval tasks. For compatibility prediction, we design an outfit token to capture a global outfit representation and train the framework using a classification loss. For complementary item retrieval, we design a target item token that additionally takes the target item specification (in the form of a category or text description) into consideration. We train our framework using a proposed set-wise outfit ranking loss to generate a target item embedding given an outfit, and a target item specification as inputs. The generated target item embedding is then used to retrieve compatible items that match the rest of the outfit. Additionally, we adopt a pre-training approach and a curriculum learning strategy to improve retrieval performance. Since our framework learns at an outfit-level, it allows us to learn a single embedding capturing higher-order relations among multiple items in the outfit more effectively than pairwise methods. Experiments demonstrate that our approach outperforms state-of-the-art methods on compatibility prediction, fill-in-the-blank, and complementary item retrieval tasks. We further validate the quality of our retrieval results with a user study. △ Less

Submitted 15 April, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

arXiv:2203.01853 [pdf, other]

Efficient Video Instance Segmentation via Tracklet Query and Proposal

Authors: Jialian Wu, Sudhir Yarram, Hui Liang, Tian Lan, Junsong Yuan, Jayan Eledath, Gerard Medioni

Abstract: Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-tim… ▽ More Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-time. These limitations are addressed by the recent VIS transformer (VisTR) which performs VIS end-to-end within a clip. However, VisTR suffers from long training time due to its frame-wise dense attention. In addition, VisTR is not fully end-to-end learnable in multiple video clips as it requires a hand-crafted data association to link instance tracklets between successive clips. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference. At the core are tracklet query and tracklet proposal that associate and segment regions-of-interest (RoIs) across space and time by an iterative query-video interaction. We further propose a correspondence learning that makes tracklets linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS requires 15x fewer training epochs while achieving state-of-the-art accuracy on the YouTube-VIS benchmark. Meanwhile, our method enables whole video instance segmentation in a single end-to-end pass without data association at all. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022

arXiv:2105.00717 [pdf, other]

Synthetic Data for Model Selection

Authors: Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, Matan Fintz, Gerard Medioni

Abstract: Recent breakthroughs in synthetic data generation approaches made it possible to produce highly photorealistic images which are hardly distinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to generate an unlimited number of images. The combination of high photorealism and scale turn synthetic data into a promising candidate for improving various machine lea… ▽ More Recent breakthroughs in synthetic data generation approaches made it possible to produce highly photorealistic images which are hardly distinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to generate an unlimited number of images. The combination of high photorealism and scale turn synthetic data into a promising candidate for improving various machine learning (ML) pipelines. Thus far, a large body of research in this field has focused on using synthetic images for training, by augmenting and enlarging training data. In contrast to using synthetic data for training, in this work we explore whether synthetic data can be beneficial for model selection. Considering the task of image classification, we demonstrate that when data is scarce, synthetic data can be used to replace the held out validation set, thus allowing to train on a larger dataset. We also introduce a novel method to calibrate the synthetic error estimation to fit that of the real domain. We show that such calibration significantly improves the usefulness of synthetic data for model selection. △ Less

Submitted 5 July, 2023; v1 submitted 3 May, 2021; originally announced May 2021.

arXiv:2103.02221 [pdf, other]

Energy-Based Learning for Scene Graph Generation

Authors: Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris Broaddus, Jayan Eledath, Gerard Medioni, Leonid Sigal

Abstract: Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. Such a formulation, however, ignores the structure in the output space, in an inherently structured prediction problem. In this work, we introduce a novel energy-based learning framework for generating scene graphs. The proposed formulation allows for effi… ▽ More Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. Such a formulation, however, ignores the structure in the output space, in an inherently structured prediction problem. In this work, we introduce a novel energy-based learning framework for generating scene graphs. The proposed formulation allows for efficiently incorporating the structure of scene graphs in the output space. This additional constraint in the learning framework acts as an inductive bias and allows models to learn efficiently from a small number of labels. We use the proposed energy-based framework to train existing state-of-the-art models and obtain a significant performance improvement, of up to 21% and 27%, on the Visual Genome and GQA benchmark datasets, respectively. Furthermore, we showcase the learning efficiency of the proposed framework by demonstrating superior performance in the zero- and few-shot settings where data is scarce. △ Less

Submitted 3 March, 2021; originally announced March 2021.

arXiv:2101.02477 [pdf, other]

GAN-Control: Explicitly Controllable GANs

Authors: Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, Gerard Medioni

Abstract: We present a framework for training GANs with explicit control over generated images. We are able to control the generated image by settings exact attributes such as age, pose, expression, etc. Most approaches for editing GAN-generated images achieve partial control by leveraging the latent space disentanglement properties, obtained implicitly after standard GAN training. Such methods are able to… ▽ More We present a framework for training GANs with explicit control over generated images. We are able to control the generated image by settings exact attributes such as age, pose, expression, etc. Most approaches for editing GAN-generated images achieve partial control by leveraging the latent space disentanglement properties, obtained implicitly after standard GAN training. Such methods are able to change the relative intensity of certain attributes, but not explicitly set their values. Recently proposed methods, designed for explicit control over human faces, harness morphable 3D face models to allow fine-grained control capabilities in GANs. Unlike these methods, our control is not constrained to morphable 3D face model parameters and is extendable beyond the domain of human faces. Using contrastive learning, we obtain GANs with an explicitly disentangled latent space. This disentanglement is utilized to train control-encoders map** human-interpretable inputs to suitable latent vectors, thus allowing explicit control. In the domain of human faces we demonstrate control over identity, age, pose, expression, hair color and illumination. We also demonstrate control capabilities of our framework in the domains of painted portraits and dog image generation. We demonstrate that our approach achieves state-of-the-art performance both qualitatively and quantitatively. △ Less

Submitted 3 October, 2021; v1 submitted 7 January, 2021; originally announced January 2021.

arXiv:2006.02110 [pdf, other]

From Real to Synthetic and Back: Synthesizing Training Data for Multi-Person Scene Understanding

Authors: Igor Kviatkovsky, Nadav Bhonker, Gerard Medioni

Abstract: We present a method for synthesizing naturally looking images of multiple people interacting in a specific scenario. These images benefit from the advantages of synthetic data: being fully controllable and fully annotated with any type of standard or custom-defined ground truth. To reduce the synthetic-to-real domain gap, we introduce a pipeline consisting of the following steps: 1) we render scen… ▽ More We present a method for synthesizing naturally looking images of multiple people interacting in a specific scenario. These images benefit from the advantages of synthetic data: being fully controllable and fully annotated with any type of standard or custom-defined ground truth. To reduce the synthetic-to-real domain gap, we introduce a pipeline consisting of the following steps: 1) we render scenes in a context modeled after the real world, 2) we train a human parsing model on the synthetic images, 3) we use the model to estimate segmentation maps for real images, 4) we train a conditional generative adversarial network (cGAN) to learn the inverse map** -- from a segmentation map to a real image, and 5) given new synthetic segmentation maps, we use the cGAN to generate realistic images. An illustration of our pipeline is presented in Figure 2. We use the generated data to train a multi-task model on the challenging tasks of UV map** and dense depth estimation. We demonstrate the value of the data generation and the trained model, both quantitatively and qualitatively on the CMU Panoptic Dataset. △ Less

Submitted 3 June, 2020; originally announced June 2020.

arXiv:2005.10481 [pdf, other]

AOWS: Adaptive and optimal network width search with latency constraints

Authors: Maxim Berman, Leonid Pishchulin, Ning Xu, Matthew B. Blaschko, Gerard Medioni

Abstract: Neural architecture search (NAS) approaches aim at automatically finding novel CNN architectures that fit computational constraints while maintaining a good performance on the target platform. We introduce a novel efficient one-shot NAS approach to optimally search for channel numbers, given latency constraints on a specific hardware. We first show that we can use a black-box approach to estimate… ▽ More Neural architecture search (NAS) approaches aim at automatically finding novel CNN architectures that fit computational constraints while maintaining a good performance on the target platform. We introduce a novel efficient one-shot NAS approach to optimally search for channel numbers, given latency constraints on a specific hardware. We first show that we can use a black-box approach to estimate a realistic latency model for a specific inference platform, without the need for low-level access to the inference computation. Then, we design a pairwise MRF to score any channel configuration and use dynamic programming to efficiently decode the best performing configuration, yielding an optimal solution for the network width search. Finally, we propose an adaptive channel configuration sampling scheme to gradually specialize the training phase to the target computational constraints. Experiments on ImageNet classification show that our approach can find networks fitting the resource constraints on different target platforms while improving accuracy over the state-of-the-art efficient networks. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: Accepted to CVPR 2020 (oral)

arXiv:2002.07362 [pdf, other]

MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention

Authors: Donghyun Kim, Tian Lan, Chuhang Zou, Ning Xu, Bryan A. Plummer, Stan Sclaroff, Jayan Eledath, Gerard Medioni

Abstract: Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, w… ▽ More Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, where the slower network runs on sparsely sampled keyframes and the light-weight shallow network runs on non-keyframes at a high frame rate. We also propose an effective adversarial learning strategy to encourage the slow and fast network to learn similar features. Our approach ensures low-latency multi-task learning while maintaining high quality predictions. Experiments show competitive accuracy compared to state-of-the-art on two multi-task learning benchmarks while reducing the number of floating point operations (FLOPs) by up to 70\%. In addition, our attention based feature propagation method (ILA) outperforms prior work in terms of task accuracy while also reducing up to 90\% of FLOPs. △ Less

Submitted 10 October, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

Comments: Accepted in ICCV 2021 MTL Workshop

arXiv:1802.00542 [pdf, other]

ExpNet: Landmark-Free, Deep, 3D Facial Expressions

Authors: Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, Gerard Medioni

Abstract: We describe a deep learning based method for estimating 3D facial expression coefficients. Unlike previous work, our process does not relay on facial landmark detection methods as a proxy step. Recent methods have shown that a CNN can be trained to regress accurate and discriminative 3D morphable model (3DMM) representations, directly from image intensities. By foregoing facial landmark detection,… ▽ More We describe a deep learning based method for estimating 3D facial expression coefficients. Unlike previous work, our process does not relay on facial landmark detection methods as a proxy step. Recent methods have shown that a CNN can be trained to regress accurate and discriminative 3D morphable model (3DMM) representations, directly from image intensities. By foregoing facial landmark detection, these methods were able to estimate shapes for occluded faces appearing in unprecedented in-the-wild viewing conditions. We build on those methods by showing that facial expressions can also be estimated by a robust, deep, landmark-free approach. Our ExpNet CNN is applied directly to the intensities of a face image and regresses a 29D vector of 3D expression coefficients. We propose a unique method for collecting data to train this network, leveraging on the robustness of deep networks to training label noise. We further offer a novel means of evaluating the accuracy of estimated expression coefficients: by measuring how well they capture facial emotions on the CK+ and EmotiW-17 emotion recognition benchmarks. We show that our ExpNet produces expression coefficients which better discriminate between facial emotions than those obtained using state of the art, facial landmark detection techniques. Moreover, this advantage grows as image scales drop, demonstrating that our ExpNet is more robust to scale changes than landmark detection methods. Finally, at the same level of accuracy, our ExpNet is orders of magnitude faster than its alternatives. △ Less

Submitted 1 February, 2018; originally announced February 2018.

Comments: Accepted to the IEEE International Conference on Automatic Face and Gesture Recognition, 2018

arXiv:1712.05083 [pdf, other]

Extreme 3D Face Reconstruction: Seeing Through Occlusions

Authors: Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval Nirkin, Gerard Medioni

Abstract: Existing single view, 3D face reconstruction methods can produce beautifully detailed 3D results, but typically only for near frontal, unobstructed viewpoints. We describe a system designed to provide detailed 3D reconstructions of faces viewed under extreme conditions, out of plane rotations, and occlusions. Motivated by the concept of bump map**, we propose a layered approach which decouples e… ▽ More Existing single view, 3D face reconstruction methods can produce beautifully detailed 3D results, but typically only for near frontal, unobstructed viewpoints. We describe a system designed to provide detailed 3D reconstructions of faces viewed under extreme conditions, out of plane rotations, and occlusions. Motivated by the concept of bump map**, we propose a layered approach which decouples estimation of a global shape from its mid-level details (e.g., wrinkles). We estimate a coarse 3D face shape which acts as a foundation and then separately layer this foundation with details represented by a bump map. We show how a deep convolutional encoder-decoder can be used to estimate such bump maps. We further show how this approach naturally extends to generate plausible details for occluded facial regions. We test our approach and its components extensively, quantitatively demonstrating the invariance of our estimated facial details. We further provide numerous qualitative examples showing that our method produces detailed 3D face shapes in viewing conditions where existing state of the art often break down. △ Less

Submitted 29 March, 2018; v1 submitted 13 December, 2017; originally announced December 2017.

Comments: Accepted to CVPR'18. Previously titled: "Extreme 3D Face Reconstruction: Looking Past Occlusions"

arXiv:1708.07517 [pdf, other]

FacePoseNet: Making a Case for Landmark-Free Face Alignment

Authors: Fengju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, Gerard Medioni

Abstract: We show how a simple convolutional neural network (CNN) can be trained to accurately and robustly regress 6 degrees of freedom (6DoF) 3D head pose, directly from image intensities. We further explain how this FacePoseNet (FPN) can be used to align faces in 2D and 3D as an alternative to explicit facial landmark detection for these tasks. We claim that in many cases the standard means of measuring… ▽ More We show how a simple convolutional neural network (CNN) can be trained to accurately and robustly regress 6 degrees of freedom (6DoF) 3D head pose, directly from image intensities. We further explain how this FacePoseNet (FPN) can be used to align faces in 2D and 3D as an alternative to explicit facial landmark detection for these tasks. We claim that in many cases the standard means of measuring landmark detector accuracy can be misleading when comparing different face alignments. Instead, we compare our FPN with existing methods by evaluating how they affect face recognition accuracy on the IJB-A and IJB-B benchmarks: using the same recognition pipeline, but varying the face alignment method. Our results show that (a) better landmark detection accuracy measured on the 300W benchmark does not necessarily imply better face recognition accuracy. (b) Our FPN provides superior 2D and 3D face alignment on both benchmarks. Finally, (c), FPN aligns faces at a small fraction of the computational cost of comparably accurate landmark detectors. For many purposes, FPN is thus a far faster and far more accurate face alignment method than using facial landmark detectors. △ Less

Submitted 31 August, 2017; v1 submitted 24 August, 2017; originally announced August 2017.

arXiv:1704.06729 [pdf, other]

On Face Segmentation, Face Swap**, and Face Perception

Authors: Yuval Nirkin, Iacopo Masi, Anh Tuan Tran, Tal Hassner, Gerard Medioni

Abstract: We show that even when face images are unconstrained and arbitrarily paired, face swap** between them is actually quite simple. To this end, we make the following contributions. (a) Instead of tailoring systems for face segmentation, as others previously proposed, we show that a standard fully convolutional network (FCN) can achieve remarkably fast and accurate segmentations, provided that it is… ▽ More We show that even when face images are unconstrained and arbitrarily paired, face swap** between them is actually quite simple. To this end, we make the following contributions. (a) Instead of tailoring systems for face segmentation, as others previously proposed, we show that a standard fully convolutional network (FCN) can achieve remarkably fast and accurate segmentations, provided that it is trained on a rich enough example set. For this purpose, we describe novel data collection and generation routines which provide challenging segmented face examples. (b) We use our segmentations to enable robust face swap** under unprecedented conditions. (c) Unlike previous work, our swap** is robust enough to allow for extensive quantitative tests. To this end, we use the Labeled Faces in the Wild (LFW) benchmark and measure the effect of intra- and inter-subject face swap** on recognition. We show that our intra-subject swapped faces remain as recognizable as their sources, testifying to the effectiveness of our method. In line with well known perceptual studies, we show that better face swap** produces less recognizable inter-subject results. This is the first time this effect was quantitatively demonstrated for machine vision systems. △ Less

Submitted 21 April, 2017; originally announced April 2017.

arXiv:1703.10714 [pdf, other]

Deep 3D Face Identification

Authors: Donghyun Kim, Matthias Hernandez, Jongmoo Choi, Gerard Medioni

Abstract: We propose a novel 3D face recognition algorithm using a deep convolutional neural network (DCNN) and a 3D augmentation technique. The performance of 2D face recognition algorithms has significantly increased by leveraging the representational power of deep neural networks and the use of large-scale labeled training data. As opposed to 2D face recognition, training discriminative deep features for… ▽ More We propose a novel 3D face recognition algorithm using a deep convolutional neural network (DCNN) and a 3D augmentation technique. The performance of 2D face recognition algorithms has significantly increased by leveraging the representational power of deep neural networks and the use of large-scale labeled training data. As opposed to 2D face recognition, training discriminative deep features for 3D face recognition is very difficult due to the lack of large-scale 3D face datasets. In this paper, we show that transfer learning from a CNN trained on 2D face images can effectively work for 3D face recognition by fine-tuning the CNN with a relatively small number of 3D facial scans. We also propose a 3D face augmentation technique which synthesizes a number of different facial expressions from a single 3D face scan. Our proposed method shows excellent recognition results on Bosphorus, BU-3DFE, and 3D-TEC datasets, without using hand-crafted features. The 3D identification using our deep features also scales well for large databases. △ Less

Submitted 30 March, 2017; originally announced March 2017.

Comments: 9 pages, 5 figures, 2 tables

arXiv:1612.04904 [pdf, other]

Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network

Authors: Anh Tuan Tran, Tal Hassner, Iacopo Masi, Gerard Medioni

Abstract: The 3D shapes of faces are well known to be discriminative. Yet despite this, they are rarely used for face recognition and always under controlled viewing conditions. We claim that this is a symptom of a serious but often overlooked problem with existing methods for single view 3D face reconstruction: when applied "in the wild", their 3D estimates are either unstable and change for different phot… ▽ More The 3D shapes of faces are well known to be discriminative. Yet despite this, they are rarely used for face recognition and always under controlled viewing conditions. We claim that this is a symptom of a serious but often overlooked problem with existing methods for single view 3D face reconstruction: when applied "in the wild", their 3D estimates are either unstable and change for different photos of the same subject or they are over-regularized and generic. In response, we describe a robust method for regressing discriminative 3D morphable face models (3DMM). We use a convolutional neural network (CNN) to regress 3DMM shape and texture parameters directly from an input photo. We overcome the shortage of training data required for this purpose by offering a method for generating huge numbers of labeled examples. The 3D estimates produced by our CNN surpass state of the art accuracy on the MICC data set. Coupled with a 3D-3D face matching pipeline, we show the first competitive face recognition results on the LFW, YTF and IJB-A benchmarks using 3D face shapes as representations, rather than the opaque deep feature vectors used by other modern systems. △ Less

Submitted 14 December, 2016; originally announced December 2016.

arXiv:1611.09510 [pdf, other]

Graph-Based Manifold Frequency Analysis for Denoising

Authors: Shay Deutsch, Antonio Ortega, Gerard Medioni

Abstract: We propose a new framework for manifold denoising based on processing in the graph Fourier frequency domain, derived from the spectral decomposition of the discrete graph Laplacian. Our approach uses the Spectral Graph Wavelet transform in order to per- form non-iterative denoising directly in the graph frequency domain, an approach inspired by conventional wavelet-based signal denoising methods.… ▽ More We propose a new framework for manifold denoising based on processing in the graph Fourier frequency domain, derived from the spectral decomposition of the discrete graph Laplacian. Our approach uses the Spectral Graph Wavelet transform in order to per- form non-iterative denoising directly in the graph frequency domain, an approach inspired by conventional wavelet-based signal denoising methods. We theoretically justify our approach, based on the fact that for smooth manifolds the coordinate information energy is localized in the low spectral graph wavelet sub-bands, while the noise affects all frequency bands in a similar way. Experimental results show that our proposed manifold frequency denoising (MFD) approach significantly outperforms the state of the art denoising meth- ods, and is robust to a wide range of parameter selections, e.g., the choice of k nearest neighbor connectivity of the graph. △ Less

Submitted 29 November, 2016; originally announced November 2016.

arXiv:1607.01450 [pdf, other]

Pooling Faces: Template based Face Recognition with Pooled Face Images

Authors: Tal Hassner, Iacopo Masi, Jungyeon Kim, Jongmoo Choi, Shai Harel, Prem Natarajan, Gerard Medioni

Abstract: We propose a novel approach to template based face recognition. Our dual goal is to both increase recognition accuracy and reduce the computational and storage costs of template matching. To do this, we leverage on an approach which was proven effective in many other domains, but, to our knowledge, never fully explored for face images: average pooling of face photos. We show how (and why!) the spa… ▽ More We propose a novel approach to template based face recognition. Our dual goal is to both increase recognition accuracy and reduce the computational and storage costs of template matching. To do this, we leverage on an approach which was proven effective in many other domains, but, to our knowledge, never fully explored for face images: average pooling of face photos. We show how (and why!) the space of a template's images can be partitioned and then pooled based on image quality and head pose and the effect this has on accuracy and template size. We perform extensive tests on the IJB-A and Janus CS2 template based face identification and verification benchmarks. These show that not only does our approach outperform published state of the art despite requiring far fewer cross template comparisons, but also, surprisingly, that image pooling performs on par with deep feature pooling. △ Less

Submitted 5 July, 2016; originally announced July 2016.

Comments: Appeared in the IEEE Computer Society Workshop on Biometrics, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June, 2016

arXiv:1604.02801 [pdf, other]

Capturing Dynamic Textured Surfaces of Moving Targets

Authors: Ruizhe Wang, Lingyu Wei, Etienne Vouga, Qixing Huang, Duygu Ceylan, Gerard Medioni, Hao Li

Abstract: We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorith… ▽ More We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15% overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry. △ Less

Submitted 11 April, 2016; originally announced April 2016.

Comments: 22 pages, 12 figures

arXiv:1603.08592 [pdf, other]

Exploring Local Context for Multi-target Tracking in Wide Area Aerial Surveillance

Authors: Bor-Jeng Chen, Gerard Medioni

Abstract: Tracking many vehicles in wide coverage aerial imagery is crucial for understanding events in a large field of view. Most approaches aim to associate detections from frame differencing into tracks. However, slow or stopped vehicles result in long-term missing detections and further cause tracking discontinuities. Relying merely on appearance clue to recover missing detections is difficult as targe… ▽ More Tracking many vehicles in wide coverage aerial imagery is crucial for understanding events in a large field of view. Most approaches aim to associate detections from frame differencing into tracks. However, slow or stopped vehicles result in long-term missing detections and further cause tracking discontinuities. Relying merely on appearance clue to recover missing detections is difficult as targets are extremely small and in grayscale. In this paper, we address the limitations of detection association methods by coupling it with a local context tracker (LCT), which does not rely on motion detections. On one hand, our LCT learns neighboring spatial relation and tracks each target in consecutive frames using graph optimization. It takes the advantage of context constraints to avoid drifting to nearby targets. We generate hypotheses from sparse and dense flow efficiently to keep solutions tractable. On the other hand, we use detection association strategy to extract short tracks in batch processing. We explicitly handle merged detections by generating additional hypotheses from them. Our evaluation on wide area aerial imagery sequences shows significant improvement over state-of-the-art methods. △ Less

Submitted 28 March, 2016; originally announced March 2016.

arXiv:1603.07388 [pdf, other]

Face Recognition Using Deep Multi-Pose Representations

Authors: Wael AbdAlmageed, Yue Wua, Stephen Rawlsa, Shai Harel, Tal Hassner, Iacopo Masi, Jongmoo Choi, Jatuporn Toy Leksut, Jungyeon Kim, Prem Natarajan, Ram Nevatia, Gerard Medioni

Abstract: We introduce our method and system for face recognition using multiple pose-aware deep learning models. In our representation, a face image is processed by several pose-specific deep convolutional neural network (CNN) models to generate multiple pose-specific features. 3D rendering is used to generate multiple face poses from the input image. Sensitivity of the recognition system to pose variation… ▽ More We introduce our method and system for face recognition using multiple pose-aware deep learning models. In our representation, a face image is processed by several pose-specific deep convolutional neural network (CNN) models to generate multiple pose-specific features. 3D rendering is used to generate multiple face poses from the input image. Sensitivity of the recognition system to pose variations is reduced since we use an ensemble of pose-specific CNN features. The paper presents extensive experimental results on the effect of landmark detection, CNN layer selection and pose model selection on the performance of the recognition pipeline. Our novel representation achieves better results than the state-of-the-art on IARPA's CS2 and NIST's IJB-A in both verification and identification (i.e. search) tasks. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: WACV 2016

arXiv:1603.07057 [pdf, other]

Do We Really Need to Collect Millions of Faces for Effective Face Recognition?

Authors: Iacopo Masi, Anh Tuan Tran, Jatuporn Toy Leksut, Tal Hassner, Gerard Medioni

Abstract: Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes -- huge numbers of face images downloaded and labeled for identity -- it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognitio… ▽ More Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes -- huge numbers of face images downloaded and labeled for identity -- it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognition systems. Rather than manually harvesting and labeling more faces, we simply synthesize them. We describe novel methods of enriching an existing dataset with important facial appearance variations by manipulating the faces it contains. We further apply this synthesis approach when matching query images represented using a standard convolutional neural network. The effect of training and testing with synthesized images is extensively tested on the LFW and IJB-A (verification and identification) benchmarks and Janus CS2. The performances obtained by our approach match state of the art results reported by systems trained on millions of downloaded images. △ Less

Submitted 10 April, 2016; v1 submitted 22 March, 2016; originally announced March 2016.

arXiv:1601.04888 [pdf, other]

A Closed-Form Solution to Tensor Voting: Theory and Applications

Authors: Tai-Pang Wu, Sai-Kit Yeung, Jiaya Jia, Chi-Keung Tang, Gerard Medioni

Abstract: We prove a closed-form solution to tensor voting (CFTV): given a point set in any dimensions, our closed-form solution provides an exact, continuous and efficient algorithm for computing a structure-aware tensor that simultaneously achieves salient structure detection and outlier attenuation. Using CFTV, we prove the convergence of tensor voting on a Markov random field (MRF), thus termed as MRFTV… ▽ More We prove a closed-form solution to tensor voting (CFTV): given a point set in any dimensions, our closed-form solution provides an exact, continuous and efficient algorithm for computing a structure-aware tensor that simultaneously achieves salient structure detection and outlier attenuation. Using CFTV, we prove the convergence of tensor voting on a Markov random field (MRF), thus termed as MRFTV, where the structure-aware tensor at each input site reaches a stationary state upon convergence in structure propagation. We then embed structure-aware tensor into expectation maximization (EM) for optimizing a single linear structure to achieve efficient and robust parameter estimation. Specifically, our EMTV algorithm optimizes both the tensor and fitting parameters and does not require random sampling consensus typically used in existing robust statistical techniques. We performed quantitative evaluation on its accuracy and robustness, showing that EMTV performs better than the original TV and other state-of-the-art techniques in fundamental matrix estimation for multiview stereo matching. The extensions of CFTV and EMTV for extracting multiple and nonlinear structures are underway. An addendum is included in this arXiv version. △ Less

Submitted 19 January, 2016; v1 submitted 19 January, 2016; originally announced January 2016.

Comments: Addendum appended to the TPAMI paper

Journal ref: TPAMI 34(8): 1482-1495 (2012)

arXiv:1511.04031 [pdf, other]

Facial Landmark Detection with Tweaked Convolutional Neural Networks

Authors: Yue Wu, Tal Hassner, KangGeon Kim, Gerard Medioni, Prem Natarajan

Abstract: We present a novel convolutional neural network (CNN) design for facial landmark coordinate regression. We examine the intermediate features of a standard CNN trained for landmark detection and show that features extracted from later, more specialized layers capture rough landmark locations. This provides a natural means of applying differential treatment midway through the network, tweaking proce… ▽ More We present a novel convolutional neural network (CNN) design for facial landmark coordinate regression. We examine the intermediate features of a standard CNN trained for landmark detection and show that features extracted from later, more specialized layers capture rough landmark locations. This provides a natural means of applying differential treatment midway through the network, tweaking processing based on facial alignment. The resulting Tweaked CNN model (TCNN) harnesses the robustness of CNNs for landmark detection, in an appearance-sensitive manner without training multi-part or multi-scale models. Our results on standard face landmark detection and face verification benchmarks show TCNN to surpasses previously published performances by wide margins. △ Less

Submitted 21 March, 2016; v1 submitted 12 November, 2015; originally announced November 2015.

Comments: First two authors had joint first authorship / equal contribution

arXiv:1506.08485 [pdf, other]

The Multi-Strand Graph for a PTZ Tracker

Authors: Shachaf Melman, Yael Moses, Gérard Medioni, Yinghao Cai

Abstract: High-resolution images can be used to resolve matching ambiguities between trajectory fragments (tracklets), which is one of the main challenges in multiple target tracking. A PTZ camera, which can pan, tilt and zoom, is a powerful and efficient tool that offers both close-up views and wide area coverage on demand. The wide-area view makes it possible to track many targets while the close-up view… ▽ More High-resolution images can be used to resolve matching ambiguities between trajectory fragments (tracklets), which is one of the main challenges in multiple target tracking. A PTZ camera, which can pan, tilt and zoom, is a powerful and efficient tool that offers both close-up views and wide area coverage on demand. The wide-area view makes it possible to track many targets while the close-up view allows individuals to be identified from high-resolution images of their faces. A central component of a PTZ tracking system is a scheduling algorithm that determines which target to zoom in on. In this paper we study this scheduling problem from a theoretical perspective, where the high resolution images are also used for tracklet matching. We propose a novel data structure, the Multi-Strand Tracking Graph (MSG), which represents the set of tracklets computed by a tracker and the possible associations between them. The MSG allows efficient scheduling as well as resolving -- directly or by elimination -- matching ambiguities between tracklets. The main feature of the MSG is the auxiliary data saved in each vertex, which allows efficient computation while avoiding time-consuming graph traversal. Synthetic data simulations are used to evaluate our scheduling algorithm and to demonstrate its superiority over a naïve one. △ Less

Submitted 28 June, 2015; originally announced June 2015.

Comments: 9 pages, 7 figures, AVSS2015

arXiv:1206.4624 [pdf]

Robust Multiple Manifolds Structure Learning

Authors: Dian Gong, Xuemei Zhao, Gerard Medioni

Abstract: We present a robust multiple manifolds structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. In the local learning stage, RMMSL efficiently estimates local tangent space by weighted low-rank matrix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure l… ▽ More We present a robust multiple manifolds structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. In the local learning stage, RMMSL efficiently estimates local tangent space by weighted low-rank matrix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarity function. Our approach is evaluated and compared to state-of-the-art methods on synthetic data, handwritten digit images, human motion capture data and motorbike videos. We demonstrate the effectiveness of the proposed approach, which yields higher clustering accuracy, and produces promising results for challenging tasks of human motion segmentation and motion flow learning from videos. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

Showing 1–32 of 32 results for author: Medioni, G