Search | arXiv e-print repository

STMPL: Human Soft-Tissue Simulation

Authors: Anton Agafonov, Lihi Zelnik-Manor

Abstract: In various applications, such as virtual reality and gaming, simulating the deformation of soft tissues in the human body during interactions with external objects is essential. Traditionally, Finite Element Methods (FEM) have been employed for this purpose, but they tend to be slow and resource-intensive. In this paper, we propose a unified representation of human body shape and soft tissue with… ▽ More In various applications, such as virtual reality and gaming, simulating the deformation of soft tissues in the human body during interactions with external objects is essential. Traditionally, Finite Element Methods (FEM) have been employed for this purpose, but they tend to be slow and resource-intensive. In this paper, we propose a unified representation of human body shape and soft tissue with a data-driven simulator of non-rigid deformations. This approach enables rapid simulation of realistic interactions. Our method builds upon the SMPL model, which generates human body shapes considering rigid transformations. We extend SMPL by incorporating a soft tissue layer and an intuitive representation of external forces applied to the body during object interactions. Specifically, we mapped the 3D body shape and soft tissue and applied external forces to 2D UV maps. Leveraging a UNET architecture designed for 2D data, our approach achieves high-accuracy inference in real time. Our experiment shows that our method achieves plausible deformation of the soft tissue layer, even for unseen scenarios. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2204.09134 [pdf, other]

Diverse Imagenet Models Transfer Better

Authors: Niv Nayman, Avram Golbert, Asaf Noy, Tan **, Lihi Zelnik-Manor

Abstract: A commonly accepted hypothesis is that models with higher accuracy on Imagenet perform better on other downstream tasks, leading to much research dedicated to optimizing Imagenet accuracy. Recently this hypothesis has been challenged by evidence showing that self-supervised models transfer better than their supervised counterparts, despite their inferior Imagenet accuracy. This calls for identifyi… ▽ More A commonly accepted hypothesis is that models with higher accuracy on Imagenet perform better on other downstream tasks, leading to much research dedicated to optimizing Imagenet accuracy. Recently this hypothesis has been challenged by evidence showing that self-supervised models transfer better than their supervised counterparts, despite their inferior Imagenet accuracy. This calls for identifying the additional factors, on top of Imagenet accuracy, that make models transferable. In this work we show that high diversity of the features learnt by the model promotes transferability jointly with Imagenet accuracy. Encouraged by the recent transferability results of self-supervised models, we propose a method that combines self-supervised and supervised pretraining to generate models with both high diversity and high accuracy, and as a result high transferability. We demonstrate our results on several architectures and multiple downstream tasks, including both single-label and multi-label classification. △ Less

Submitted 19 April, 2022; originally announced April 2022.

MSC Class: 68T07; 68T10; 68T45 ACM Class: I.2.10; I.2.6; I.4.10

arXiv:2110.12399 [pdf, other]

BINAS: Bilinear Interpretable Neural Architecture Search

Authors: Niv Nayman, Yonathan Aflalo, Asaf Noy, Rong **, Lihi Zelnik-Manor

Abstract: Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, he… ▽ More Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, hence, the resulting accuracy of the generated models is often harmed. In this work we resolve this by introducing Bilinear Interpretable Neural Architecture Search (BINAS), that is based on an accurate and simple bilinear formulation of both an accuracy estimator and the expected resource requirement, together with a scalable search method with theoretical guarantees. The simplicity of our proposed estimator together with the intuitive way it is constructed bring interpretability through many insights about the contribution of different design choices. For example, we find that in the examined search space, adding depth and width is more effective at deeper stages of the network and at the beginning of each resolution stage. Our experiments show that BINAS generates comparable to or better architectures than other state-of-the-art NAS methods within a reduced marginal search cost, while strictly satisfying the resource constraints. △ Less

Submitted 27 April, 2022; v1 submitted 24 October, 2021; originally announced October 2021.

Comments: The full code is released at https://github.com/Alibaba-MIIL/BINAS

MSC Class: 68T09; 68T45 ACM Class: G.1.6; G.3; I.2.8; I.2.10; I.5.1

arXiv:2110.10955 [pdf, ps, other]

Multi-label Classification with Partial Annotations using Class-aware Selective Loss

Authors: Emanuel Ben-Baruch, Tal Ridnik, Itamar Friedman, Avi Ben-Cohen, Nadav Zamir, Asaf Noy, Lihi Zelnik-Manor

Abstract: Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un… ▽ More Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un-annotated labels should be treated selectively according to two probability quantities: the class distribution in the overall dataset and the specific label likelihood for a given data sample. We propose to estimate the class distribution using a dedicated temporary model, and we show its improved efficiency over a naive estimation computed using the dataset's partial annotations. Second, during the training of the target model, we emphasize the contribution of annotated labels over originally un-annotated labels by using a dedicated asymmetric loss. With our novel approach, we achieve state-of-the-art results on OpenImages dataset (e.g. reaching 87.3 mAP on V6). In addition, experiments conducted on LVIS and simulated-COCO demonstrate the effectiveness of our approach. Code is available at https://github.com/Alibaba-MIIL/PartialLabelingCSL. △ Less

Submitted 21 October, 2021; originally announced October 2021.

arXiv:2109.12499 [pdf, other]

PETA: Photo Albums Event Recognition using Transformers Attention

Authors: Tamar Glaser, Emanuel Ben-Baruch, Gilad Sharir, Nadav Zamir, Asaf Noy, Lihi Zelnik-Manor

Abstract: In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images… ▽ More In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90\% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 8 pages, 10 including references, 3 figures, was submitted to WACV 2022

arXiv:2105.05926 [pdf, other]

Semantic Diversity Learning for Zero-Shot Multi-label Classification

Authors: Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Baruch, Itamar Friedman, Lihi Zelnik-Manor

Abstract: Training a neural network model for recognizing multiple labels associated with an image, including identifying unseen labels, is challenging, especially for images that portray numerous semantically diverse labels. As challenging as this task is, it is an essential task to tackle since it represents many real-world cases, such as image retrieval of natural images. We argue that using a single emb… ▽ More Training a neural network model for recognizing multiple labels associated with an image, including identifying unseen labels, is challenging, especially for images that portray numerous semantically diverse labels. As challenging as this task is, it is an essential task to tackle since it represents many real-world cases, such as image retrieval of natural images. We argue that using a single embedding vector to represent an image, as commonly practiced, is not sufficient to rank both relevant seen and unseen labels accurately. This study introduces an end-to-end model training for multi-label zero-shot learning that supports semantic diversity of the images and labels. We propose to use an embedding matrix having principal embedding vectors trained using a tailored loss function. In addition, during training, we suggest up-weighting in the loss function image samples presenting higher semantic diversity to encourage the diversity of the embedding matrix. Extensive experiments show that our proposed method improves the zero-shot model's quality in tag-based image retrieval achieving SoTA results on several common datasets (NUS-Wide, COCO, Open Images). △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2104.10972 [pdf, ps, other]

ImageNet-21K Pretraining for the Masses

Authors: Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelnik-Manor

Abstract: ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for… ▽ More ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilization of WordNet hierarchical structure, and a novel training scheme called semantic softmax, we show that various models significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks, including small mobile-oriented models. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT and Mixer. Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-MIIL/ImageNet21K △ Less

Submitted 5 August, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: Accepted to NeurIPS 2021 (Datasets and Benchmarks)

arXiv:2103.13915 [pdf, other]

An Image is Worth 16x16 Words, What is a Video Worth?

Authors: Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor

Abstract: Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip… ▽ More Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach $80.5$ top-1 accuracy with $\times 30$ less frames per video, and $\times 40$ faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM △ Less

Submitted 27 May, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

arXiv:2102.11646 [pdf, other]

HardCoRe-NAS: Hard Constrained diffeRentiable Neural Architecture Search

Authors: Niv Nayman, Yonathan Aflalo, Asaf Noy, Lihi Zelnik-Manor

Abstract: Realistic use of neural networks often requires adhering to multiple constraints on latency, energy and memory among others. A popular approach to find fitting networks is through constrained Neural Architecture Search (NAS), however, previous methods enforce the constraint only softly. Therefore, the resulting networks do not exactly adhere to the resource constraint and their accuracy is harmed.… ▽ More Realistic use of neural networks often requires adhering to multiple constraints on latency, energy and memory among others. A popular approach to find fitting networks is through constrained Neural Architecture Search (NAS), however, previous methods enforce the constraint only softly. Therefore, the resulting networks do not exactly adhere to the resource constraint and their accuracy is harmed. In this work we resolve this by introducing Hard Constrained diffeRentiable NAS (HardCoRe-NAS), that is based on an accurate formulation of the expected resource requirement and a scalable search method that satisfies the hard constraint throughout the search. Our experiments show that HardCoRe-NAS generates state-of-the-art architectures, surpassing other NAS methods, while strictly satisfying the hard resource constraints without any tuning required. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: Niv Nayman and Yonathan Aflalo contributed equally. An implementation of HardCoRe-NAS is available at: https://github.com/Alibaba-MIIL/HardCoReNAS

MSC Class: 68T09; 68T45 ACM Class: G.1.6; G.3; I.2.8; I.2.10; I.5.1

arXiv:2101.04243 [pdf, other]

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Authors: Asaf Noy, Yi Xu, Yonathan Aflalo, Lihi Zelnik-Manor, Rong **

Abstract: Deep neural networks' remarkable ability to correctly fit training data when optimized by gradient-based algorithms is yet to be fully understood. Recent theoretical results explain the convergence for ReLU networks that are wider than those used in practice by orders of magnitude. In this work, we take a step towards closing the gap between theory and practice by significantly improving the known… ▽ More Deep neural networks' remarkable ability to correctly fit training data when optimized by gradient-based algorithms is yet to be fully understood. Recent theoretical results explain the convergence for ReLU networks that are wider than those used in practice by orders of magnitude. In this work, we take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with widths quadratic in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size. This construction can be viewed as a novel technique to accelerate training, while its tight finite-width equivalence to Neural Tangent Kernel (NTK) suggests it can be utilized to study generalization as well. △ Less

Submitted 8 February, 2021; v1 submitted 11 January, 2021; originally announced January 2021.

arXiv:2009.14119 [pdf, ps, other]

Asymmetric Loss For Multi-Label Classification

Authors: Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, Lihi Zelnik-Manor

Abstract: In a typical multi-label setting, a picture contains on average few positive labels, and many negative ones. This positive-negative imbalance dominates the optimization process, and can lead to under-emphasizing gradients from positive labels during training, resulting in poor accuracy. In this paper, we introduce a novel asymmetric loss ("ASL"), which operates differently on positive and negative… ▽ More In a typical multi-label setting, a picture contains on average few positive labels, and many negative ones. This positive-negative imbalance dominates the optimization process, and can lead to under-emphasizing gradients from positive labels during training, resulting in poor accuracy. In this paper, we introduce a novel asymmetric loss ("ASL"), which operates differently on positive and negative samples. The loss enables to dynamically down-weights and hard-thresholds easy negative samples, while also discarding possibly mislabeled samples. We demonstrate how ASL can balance the probabilities of different samples, and how this balancing is translated to better mAP scores. With ASL, we reach state-of-the-art results on multiple popular multi-label datasets: MS-COCO, Pascal-VOC, NUS-WIDE and Open Images. We also demonstrate ASL applicability for other tasks, such as single-label classification and object detection. ASL is effective, easy to implement, and does not increase the training time or complexity. Implementation is available at: https://github.com/Alibaba-MIIL/ASL. △ Less

Submitted 29 July, 2021; v1 submitted 29 September, 2020; originally announced September 2020.

Comments: Accepted to ICCV 2021

ACM Class: I.2.6; I.2.10; I.0; I.4.0

arXiv:1912.11850 [pdf, other]

Graph Embedded Pose Clustering for Anomaly Detection

Authors: Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, Shai Avidan

Abstract: We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This giv… ▽ More We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not. We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal. Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods. △ Less

Submitted 10 April, 2020; v1 submitted 26 December, 2019; originally announced December 2019.

Comments: Code is available at https://github.com/amirmk89/gepc. CVPR 2020

arXiv:1910.07038 [pdf, other]

doi 10.1145/3372278.3390686

Compact Network Training for Person ReID

Authors: Hussam Lawen, Avi Ben-Cohen, Matan Protter, Itamar Friedman, Lihi Zelnik-Manor

Abstract: The task of person re-identification (ReID) has attracted growing attention in recent years leading to improved performance, albeit with little focus on real-world applications. Most SotA methods are based on heavy pre-trained models, e.g. ResNet50 (~25M parameters), which makes them less practical and more tedious to explore architecture modifications. In this study, we focus on a small-sized ran… ▽ More The task of person re-identification (ReID) has attracted growing attention in recent years leading to improved performance, albeit with little focus on real-world applications. Most SotA methods are based on heavy pre-trained models, e.g. ResNet50 (~25M parameters), which makes them less practical and more tedious to explore architecture modifications. In this study, we focus on a small-sized randomly initialized model that enables us to easily introduce architecture and training modifications suitable for person ReID. The outcomes of our study are a compact network and a fitting training regime. We show the robustness of the network by outperforming the SotA on both Market1501 and DukeMTMC. Furthermore, we show the representation power of our ReID network via SotA results on a different task of multi-object tracking. △ Less

Submitted 9 April, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

arXiv:1906.08031 [pdf, other]

XNAS: Neural Architecture Search with Expert Advice

Authors: Niv Nayman, Asaf Noy, Tal Ridnik, Itamar Friedman, Rong **, Lihi Zelnik-Manor

Abstract: This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i.e., it minimizes the regret incurred by a sub-optimal selection of operations. Unlike previous search relaxations, that require hard pruning of architectures, our method is des… ▽ More This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i.e., it minimizes the regret incurred by a sub-optimal selection of operations. Unlike previous search relaxations, that require hard pruning of architectures, our method is designed to dynamically wipe out inferior architectures and enhance superior ones. It achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates, based on the amount of information carried by the backward gradients. Experiments show that our algorithm achieves a strong performance over several image classification datasets. Specifically, it obtains an error rate of 1.6% for CIFAR-10, 24% for ImageNet under mobile settings, and achieves state-of-the-art results on three additional datasets. △ Less

Submitted 19 June, 2019; originally announced June 2019.

arXiv:1904.04123 [pdf, other]

ASAP: Architecture Search, Anneal and Prune

Authors: Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, Sivan Doveh, Itamar Friedman, Raja Giryes, Lihi Zelnik-Manor

Abstract: Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a discrete search space, thousands of days of GPU were required for convergence. A recent approach is based on constructing a differentiable search space that enables… ▽ More Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a discrete search space, thousands of days of GPU were required for convergence. A recent approach is based on constructing a differentiable search space that enables gradient-based optimization, which reduces the search time to a few days. While successful, it still includes some noncontinuous steps, e.g., the pruning of many weak connections at once. In this paper, we propose a differentiable search space that allows the annealing of architecture weights, while gradually pruning inferior operations. In this way, the search converges to a single output network in a continuous manner. Experiments on several vision datasets demonstrate the effectiveness of our method with respect to the search cost and accuracy of the achieved model. Specifically, with $0.2$ GPU search days we achieve an error rate of $1.68\%$ on CIFAR-10. △ Less

Submitted 10 October, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

arXiv:1901.11420 [pdf, other]

Is Image Memorability Prediction Solved?

Authors: Shay Perera, Ayellet Tal, Lihi Zelnik-Manor

Abstract: This paper deals with the prediction of the memorability of a given image. We start by proposing an algorithm that reaches human-level performance on the LaMem dataset - the only large scale benchmark for memorability prediction. The suggested algorithm is based on three observations we make regarding convolutional neural networks (CNNs) that affect memorability prediction. Having reached human-le… ▽ More This paper deals with the prediction of the memorability of a given image. We start by proposing an algorithm that reaches human-level performance on the LaMem dataset - the only large scale benchmark for memorability prediction. The suggested algorithm is based on three observations we make regarding convolutional neural networks (CNNs) that affect memorability prediction. Having reached human-level performance we were humbled, and asked ourselves whether indeed we have resolved memorability prediction - and answered this question in the negative. We studied a few factors and made some recommendations that should be taken into account when designing the next benchmark. △ Less

Submitted 31 January, 2019; originally announced January 2019.

arXiv:1811.08760 [pdf, other]

Dynamic-Net: Tuning the Objective Without Re-training for Synthesis Tasks

Authors: Alon Shoshan, Roey Mechrez, Lihi Zelnik-Manor

Abstract: One of the key ingredients for successful optimization of modern CNNs is identifying a suitable objective. To date, the objective is fixed a-priori at training time, and any variation to it requires re-training a new network. In this paper we present a first attempt at alleviating the need for re-training. Rather than fixing the network at training time, we train a "Dynamic-Net" that can be modifi… ▽ More One of the key ingredients for successful optimization of modern CNNs is identifying a suitable objective. To date, the objective is fixed a-priori at training time, and any variation to it requires re-training a new network. In this paper we present a first attempt at alleviating the need for re-training. Rather than fixing the network at training time, we train a "Dynamic-Net" that can be modified at inference time. Our approach considers an "objective-space" as the space of all linear combinations of two objectives, and the Dynamic-Net is emulating the traversing of this objective-space at test-time, without any further training. We show that this upgrades pre-trained networks by providing an out-of-learning extension, while maintaining the performance quality. The solution we propose is fast and allows a user to interactively modify the network, in real-time, in order to obtain the result he/she desires. We show the benefits of such an approach via several different applications. △ Less

Submitted 25 August, 2019; v1 submitted 21 November, 2018; originally announced November 2018.

Comments: version update

arXiv:1811.08126 [pdf, other]

Adversarial Feedback Loop

Authors: Firas Shama, Roey Mechrez, Alon Shoshan, Lihi Zelnik-Manor

Abstract: Thanks to their remarkable generative capabilities, GANs have gained great popularity, and are used abundantly in state-of-the-art methods and applications. In a GAN based model, a discriminator is trained to learn the real data distribution. To date, it has been used only for training purposes, where it's utilized to train the generator to provide real-looking outputs. In this paper we propose a… ▽ More Thanks to their remarkable generative capabilities, GANs have gained great popularity, and are used abundantly in state-of-the-art methods and applications. In a GAN based model, a discriminator is trained to learn the real data distribution. To date, it has been used only for training purposes, where it's utilized to train the generator to provide real-looking outputs. In this paper we propose a novel method that makes an explicit use of the discriminator in test-time, in a feedback manner in order to improve the generator results. To the best of our knowledge it is the first time a discriminator is involved in test-time. We claim that the discriminator holds significant information on the real data distribution, that could be useful for test-time as well, a potential that has not been explored before. The approach we propose does not alter the conventional training stage. At test-time, however, it transfers the output from the generator into the discriminator, and uses feedback modules (convolutional blocks) to translate the features of the discriminator layers into corrections to the features of the generator layers, which are used eventually to get a better generator result. Our method can contribute to both conditional and unconditional GANs. As demonstrated by our experiments, it can improve the results of state-of-the-art networks for super-resolution, and image generation. △ Less

Submitted 20 November, 2018; originally announced November 2018.

arXiv:1809.07517 [pdf, other]

The 2018 PIRM Challenge on Perceptual Image Super-resolution

Authors: Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, Lihi Zelnik-Manor

Abstract: This paper reports on the 2018 PIRM challenge on perceptual super-resolution (SR), held in conjunction with the Perceptual Image Restoration and Manipulation (PIRM) workshop at ECCV 2018. In contrast to previous SR challenges, our evaluation methodology jointly quantifies accuracy and perceptual quality, therefore enabling perceptual-driven methods to compete alongside algorithms that target PSNR… ▽ More This paper reports on the 2018 PIRM challenge on perceptual super-resolution (SR), held in conjunction with the Perceptual Image Restoration and Manipulation (PIRM) workshop at ECCV 2018. In contrast to previous SR challenges, our evaluation methodology jointly quantifies accuracy and perceptual quality, therefore enabling perceptual-driven methods to compete alongside algorithms that target PSNR maximization. Twenty-one participating teams introduced algorithms which well-improved upon the existing state-of-the-art methods in perceptual SR, as confirmed by a human opinion study. We also analyze popular image quality measures and draw conclusions regarding which of them correlates best with human opinion scores. We conclude with an analysis of the current trends in perceptual SR, as reflected from the leading submissions. △ Less

Submitted 31 January, 2019; v1 submitted 20 September, 2018; originally announced September 2018.

Comments: Workshop and Challenge on Perceptual Image Restoration and Manipulation in conjunction with ECCV 2018 webpage: https://www.pirm2018.org/

Journal ref: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018

arXiv:1803.04626 [pdf, other]

Maintaining Natural Image Statistics with the Contextual Loss

Authors: Roey Mechrez, Itamar Talmi, Firas Shama, Lihi Zelnik-Manor

Abstract: Maintaining natural image statistics is a crucial factor in restoration and generation of realistic looking images. When training CNNs, photorealism is usually attempted by adversarial training (GAN), that pushes the output images to lie on the manifold of natural images. GANs are very powerful, but not perfect. They are hard to train and the results still often suffer from artifacts. In this pape… ▽ More Maintaining natural image statistics is a crucial factor in restoration and generation of realistic looking images. When training CNNs, photorealism is usually attempted by adversarial training (GAN), that pushes the output images to lie on the manifold of natural images. GANs are very powerful, but not perfect. They are hard to train and the results still often suffer from artifacts. In this paper we propose a complementary approach, that could be applied with or without GAN, whose goal is to train a feed-forward CNN to maintain natural internal statistics. We look explicitly at the distribution of features in an image and train the network to generate images with natural feature distributions. Our approach reduces by orders of magnitude the number of images required for training and achieves state-of-the-art results on both single-image super-resolution, and high-resolution surface normal estimation. △ Less

Submitted 18 July, 2018; v1 submitted 13 March, 2018; originally announced March 2018.

arXiv:1803.02077 [pdf, other]

The Contextual Loss for Image Transformation with Non-Aligned Data

Authors: Roey Mechrez, Itamar Talmi, Lihi Zelnik-Manor

Abstract: Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss f… ▽ More Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth. Our code can be found at https://www.github.com/roimehrez/contextualLoss △ Less

Submitted 18 July, 2018; v1 submitted 6 March, 2018; originally announced March 2018.

Comments: ECCV Oral. Paper web page: http://cgm.technion.ac.il/Computer-Graphics-Multimedia/Software/contextual/

arXiv:1709.09828 [pdf, other]

Photorealistic Style Transfer with Screened Poisson Equation

Authors: Roey Mechrez, Eli Shechtman, Lihi Zelnik-Manor

Abstract: Recent work has shown impressive success in transferring painterly style to images. These approaches, however, fall short of photorealistic style transfer. Even when both the input and reference images are photographs, the output still exhibits distortions reminiscent of a painting. In this paper we propose an approach that takes as input a stylized image and makes it more photorealistic. It relie… ▽ More Recent work has shown impressive success in transferring painterly style to images. These approaches, however, fall short of photorealistic style transfer. Even when both the input and reference images are photographs, the output still exhibits distortions reminiscent of a painting. In this paper we propose an approach that takes as input a stylized image and makes it more photorealistic. It relies on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. Our method is fast, simple, fully automatic and shows positive progress in making a stylized image photorealistic. Our results exhibit finer details and are less prone to artifacts than the state-of-the-art. △ Less

Submitted 28 September, 2017; originally announced September 2017.

Comments: presented in BMVC 2017

arXiv:1612.02190 [pdf, other]

Template Matching with Deformable Diversity Similarity

Authors: Itamar Talmi, Roey Mechrez, Lihi Zelnik-Manor

Abstract: We propose a novel measure for template matching named Deformable Diversity Similarity -- based on the diversity of feature matches between a target image window and the template. We rely on both local appearance and geometric information that jointly lead to a powerful approach for matching. Our key contribution is a similarity measure, that is robust to complex deformations, significant backgrou… ▽ More We propose a novel measure for template matching named Deformable Diversity Similarity -- based on the diversity of feature matches between a target image window and the template. We rely on both local appearance and geometric information that jointly lead to a powerful approach for matching. Our key contribution is a similarity measure, that is robust to complex deformations, significant background clutter, and occlusions. Empirical evaluation on the most up-to-date benchmark shows that our method outperforms the current state-of-the-art in its detection accuracy while improving computational complexity. △ Less

Submitted 18 April, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

Comments: accepted to CVPR2017 (spotlight)

arXiv:1612.02184 [pdf, other]

Saliency Driven Image Manipulation

Authors: Roey Mechrez, Eli Shechtman, Lihi Zelnik-Manor

Abstract: Have you ever taken a picture only to find out that an unimportant background object ended up being overly salient? Or one of those team sports photos where your favorite player blends with the rest? Wouldn't it be nice if you could tweak these pictures just a little bit so that the distractor would be attenuated and your favorite player will stand-out among her peers? Manipulating images in order… ▽ More Have you ever taken a picture only to find out that an unimportant background object ended up being overly salient? Or one of those team sports photos where your favorite player blends with the rest? Wouldn't it be nice if you could tweak these pictures just a little bit so that the distractor would be attenuated and your favorite player will stand-out among her peers? Manipulating images in order to control the saliency of objects is the goal of this paper. We propose an approach that considers the internal color and saliency properties of the image. It changes the saliency map via an optimization framework that relies on patch-based manipulation using only patches from within the same image to achieve realistic looking results. Applications include object enhancement, distractors attenuation and background decluttering. Comparing our method to previous ones shows significant improvement, both in the achieved saliency manipulation and in the realistic appearance of the resulting images. △ Less

Submitted 17 January, 2018; v1 submitted 7 December, 2016; originally announced December 2016.

Comments: to appear in WACV'18

arXiv:1508.07953 [pdf, other]

Approximate Nearest Neighbor Fields in Video

Authors: Nir Ben-Zrihem, Lihi Zelnik-Manor

Abstract: We introduce RIANN (Ring Intersection Approximate Nearest Neighbor search), an algorithm for matching patches of a video to a set of reference patches in real-time. For each query, RIANN finds potential matches by intersecting rings around key points in appearance space. Its search complexity is reversely correlated to the amount of temporal change, making it a good fit for videos, where typically… ▽ More We introduce RIANN (Ring Intersection Approximate Nearest Neighbor search), an algorithm for matching patches of a video to a set of reference patches in real-time. For each query, RIANN finds potential matches by intersecting rings around key points in appearance space. Its search complexity is reversely correlated to the amount of temporal change, making it a good fit for videos, where typically most patches change slowly with time. Experiments show that RIANN is up to two orders of magnitude faster than previous ANN methods, and is the only solution that operates in real-time. We further demonstrate how RIANN can be used for real-time video processing and provide examples for a range of real-time video applications, including colorization, denoising, and several artistic effects. △ Less

Submitted 31 August, 2015; originally announced August 2015.

Comments: A CVPR 2015 oral paper

arXiv:1204.3367 [pdf, other]

Crowdsourcing Gaze Data Collection

Authors: Dmitry Rudoy, Dan B. Goldman, Eli Shechtman, Lihi Zelnik-Manor

Abstract: Knowing where people look is a useful tool in many various image and video applications. However, traditional gaze tracking hardware is expensive and requires local study participants, so acquiring gaze location data from a large number of participants is very problematic. In this work we propose a crowdsourced method for acquisition of gaze direction data from a virtually unlimited number of part… ▽ More Knowing where people look is a useful tool in many various image and video applications. However, traditional gaze tracking hardware is expensive and requires local study participants, so acquiring gaze location data from a large number of participants is very problematic. In this work we propose a crowdsourced method for acquisition of gaze direction data from a virtually unlimited number of participants, using a robust self-reporting mechanism (see Figure 1). Our system collects temporally sparse but spatially dense points-of-attention in any visual information. We apply our approach to an existing video data set and demonstrate that we obtain results similar to traditional gaze tracking. We also explore the parameter ranges of our method, and collect gaze tracking data for a large set of YouTube videos. △ Less

Submitted 16 April, 2012; originally announced April 2012.

Comments: Presented at Collective Intelligence conference, 2012 (arXiv:1204.2991)

Report number: CollectiveIntelligence/2012/106

arXiv:1009.1533 [pdf, ps, other]

Sensing Matrix Optimization for Block-Sparse Decoding

Authors: Kevin Rosenblum, Lihi Zelnik-Manor, Yonina C. Eldar

Abstract: Recent work has demonstrated that using a carefully designed sensing matrix rather than a random one, can improve the performance of compressed sensing. In particular, a well-designed sensing matrix can reduce the coherence between the atoms of the equivalent dictionary, and as a consequence, reduce the reconstruction error. In some applications, the signals of interest can be well approximated by… ▽ More Recent work has demonstrated that using a carefully designed sensing matrix rather than a random one, can improve the performance of compressed sensing. In particular, a well-designed sensing matrix can reduce the coherence between the atoms of the equivalent dictionary, and as a consequence, reduce the reconstruction error. In some applications, the signals of interest can be well approximated by a union of a small number of subspaces (e.g., face recognition and motion segmentation). This implies the existence of a dictionary which leads to block-sparse representations. In this work, we propose a framework for sensing matrix design that improves the ability of block-sparse approximation techniques to reconstruct and classify signals. This method is based on minimizing a weighted sum of the inter-block coherence and the sub-block coherence of the equivalent dictionary. Our experiments show that the proposed algorithm significantly improves signal recovery and classification ability of the Block-OMP algorithm compared to sensing matrix optimization methods that do not employ block structure. △ Less

Submitted 8 September, 2010; originally announced September 2010.

arXiv:1005.0202 [pdf, ps, other]

Dictionary Optimization for Block-Sparse Representations

Authors: Kevin Rosenblum, Lihi Zelnik-Manor, Yonina C. Eldar

Abstract: Recent work has demonstrated that using a carefully designed dictionary instead of a predefined one, can improve the sparsity in jointly representing a class of signals. This has motivated the derivation of learning methods for designing a dictionary which leads to the sparsest representation for a given set of signals. In some applications, the signals of interest can have further structure, so t… ▽ More Recent work has demonstrated that using a carefully designed dictionary instead of a predefined one, can improve the sparsity in jointly representing a class of signals. This has motivated the derivation of learning methods for designing a dictionary which leads to the sparsest representation for a given set of signals. In some applications, the signals of interest can have further structure, so that they can be well approximated by a union of a small number of subspaces (e.g., face recognition and motion segmentation). This implies the existence of a dictionary which enables block-sparse representations of the input signals once its atoms are properly sorted into blocks. In this paper, we propose an algorithm for learning a block-sparsifying dictionary of a given set of signals. We do not require prior knowledge on the association of signals into groups (subspaces). Instead, we develop a method that automatically detects the underlying block structure. This is achieved by iteratively alternating between updating the block structure of the dictionary and updating the dictionary atoms to better fit the data. Our experiments show that for block-sparse data the proposed algorithm significantly improves the dictionary recovery ability and lowers the representation error compared to dictionary learning methods that do not employ block structure. △ Less

Submitted 3 May, 2010; originally announced May 2010.

Comments: submitted to IEEE Transactions on Signal Processing

Showing 1–28 of 28 results for author: Zelnik-Manor, L