-
It's All in the Head: Representation Knowledge Distillation through Classifier Sharing
Authors:
Emanuel Ben-Baruch,
Matan Karklinsky,
Yossi Biton,
Avi Ben-Cohen,
Hussam Lawen,
Nadav Zamir
Abstract:
Representation knowledge distillation aims at transferring rich information from one model to another. Common approaches for representation distillation mainly focus on the direct minimization of distance metrics between the models' embedding vectors. Such direct methods may be limited in transferring high-order dependencies embedded in the representation vectors, or in handling the capacity gap b…
▽ More
Representation knowledge distillation aims at transferring rich information from one model to another. Common approaches for representation distillation mainly focus on the direct minimization of distance metrics between the models' embedding vectors. Such direct methods may be limited in transferring high-order dependencies embedded in the representation vectors, or in handling the capacity gap between the teacher and student models. Moreover, in standard knowledge distillation, the teacher is trained without awareness of the student's characteristics and capacity. In this paper, we explore two mechanisms for enhancing representation distillation using classifier sharing between the teacher and student. We first investigate a simple scheme where the teacher's classifier is connected to the student backbone, acting as an additional classification head. Then, we propose a student-aware mechanism that asks to tailor the teacher model to a student with limited capacity by training the teacher with a temporary student's head. We analyze and compare these two mechanisms and show their effectiveness on various datasets and tasks, including image classification, fine-grained classification, and face verification. In particular, we achieve state-of-the-art results for face verification on the IJB-C dataset for a MobileFaceNet model: TAR@(FAR=1e-5)=93.7\%. Code is available at https://github.com/Alibaba-MIIL/HeadSharingKD.
△ Less
Submitted 5 April, 2022; v1 submitted 18 January, 2022;
originally announced January 2022.
-
Multi-label Classification with Partial Annotations using Class-aware Selective Loss
Authors:
Emanuel Ben-Baruch,
Tal Ridnik,
Itamar Friedman,
Avi Ben-Cohen,
Nadav Zamir,
Asaf Noy,
Lihi Zelnik-Manor
Abstract:
Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un…
▽ More
Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un-annotated labels should be treated selectively according to two probability quantities: the class distribution in the overall dataset and the specific label likelihood for a given data sample. We propose to estimate the class distribution using a dedicated temporary model, and we show its improved efficiency over a naive estimation computed using the dataset's partial annotations. Second, during the training of the target model, we emphasize the contribution of annotated labels over originally un-annotated labels by using a dedicated asymmetric loss. With our novel approach, we achieve state-of-the-art results on OpenImages dataset (e.g. reaching 87.3 mAP on V6). In addition, experiments conducted on LVIS and simulated-COCO demonstrate the effectiveness of our approach. Code is available at https://github.com/Alibaba-MIIL/PartialLabelingCSL.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
PETA: Photo Albums Event Recognition using Transformers Attention
Authors:
Tamar Glaser,
Emanuel Ben-Baruch,
Gilad Sharir,
Nadav Zamir,
Asaf Noy,
Lihi Zelnik-Manor
Abstract:
In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images…
▽ More
In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90\% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications.
△ Less
Submitted 26 September, 2021;
originally announced September 2021.
-
Semantic Diversity Learning for Zero-Shot Multi-label Classification
Authors:
Avi Ben-Cohen,
Nadav Zamir,
Emanuel Ben Baruch,
Itamar Friedman,
Lihi Zelnik-Manor
Abstract:
Training a neural network model for recognizing multiple labels associated with an image, including identifying unseen labels, is challenging, especially for images that portray numerous semantically diverse labels. As challenging as this task is, it is an essential task to tackle since it represents many real-world cases, such as image retrieval of natural images. We argue that using a single emb…
▽ More
Training a neural network model for recognizing multiple labels associated with an image, including identifying unseen labels, is challenging, especially for images that portray numerous semantically diverse labels. As challenging as this task is, it is an essential task to tackle since it represents many real-world cases, such as image retrieval of natural images. We argue that using a single embedding vector to represent an image, as commonly practiced, is not sufficient to rank both relevant seen and unseen labels accurately. This study introduces an end-to-end model training for multi-label zero-shot learning that supports semantic diversity of the images and labels. We propose to use an embedding matrix having principal embedding vectors trained using a tailored loss function. In addition, during training, we suggest up-weighting in the loss function image samples presenting higher semantic diversity to encourage the diversity of the embedding matrix. Extensive experiments show that our proposed method improves the zero-shot model's quality in tag-based image retrieval achieving SoTA results on several common datasets (NUS-Wide, COCO, Open Images).
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Asymmetric Loss For Multi-Label Classification
Authors:
Emanuel Ben-Baruch,
Tal Ridnik,
Nadav Zamir,
Asaf Noy,
Itamar Friedman,
Matan Protter,
Lihi Zelnik-Manor
Abstract:
In a typical multi-label setting, a picture contains on average few positive labels, and many negative ones. This positive-negative imbalance dominates the optimization process, and can lead to under-emphasizing gradients from positive labels during training, resulting in poor accuracy. In this paper, we introduce a novel asymmetric loss ("ASL"), which operates differently on positive and negative…
▽ More
In a typical multi-label setting, a picture contains on average few positive labels, and many negative ones. This positive-negative imbalance dominates the optimization process, and can lead to under-emphasizing gradients from positive labels during training, resulting in poor accuracy. In this paper, we introduce a novel asymmetric loss ("ASL"), which operates differently on positive and negative samples. The loss enables to dynamically down-weights and hard-thresholds easy negative samples, while also discarding possibly mislabeled samples. We demonstrate how ASL can balance the probabilities of different samples, and how this balancing is translated to better mAP scores. With ASL, we reach state-of-the-art results on multiple popular multi-label datasets: MS-COCO, Pascal-VOC, NUS-WIDE and Open Images. We also demonstrate ASL applicability for other tasks, such as single-label classification and object detection. ASL is effective, easy to implement, and does not increase the training time or complexity.
Implementation is available at: https://github.com/Alibaba-MIIL/ASL.
△ Less
Submitted 29 July, 2021; v1 submitted 29 September, 2020;
originally announced September 2020.
-
Probably Approximately Knowing
Authors:
Nitzan Zamir,
Yoram Moses
Abstract:
Whereas deterministic protocols are typically guaranteed to obtain particular goals of interest, probabilistic protocols typically provide only probabilistic guarantees. This paper initiates an investigation of the interdependence between actions and subjective beliefs of agents in a probabilistic setting. In particular, we study what probabilistic beliefs an agent should have when performing acti…
▽ More
Whereas deterministic protocols are typically guaranteed to obtain particular goals of interest, probabilistic protocols typically provide only probabilistic guarantees. This paper initiates an investigation of the interdependence between actions and subjective beliefs of agents in a probabilistic setting. In particular, we study what probabilistic beliefs an agent should have when performing actions, in a protocol that satisfies a probabilistic constraint of the form: 'Condition C should hold with probability at least p when action a is performed'. Our main result is that the expected degree of an agent's belief in C when it performs a equals the probability that C holds when a is performed. Indeed, if the threshold of the probabilistic constraint should hold with probaility p=1-x^2 for some small value of x then, with probability 1-x, when the agent acts it will assign a probabilistic belief no smaller than 1-x to the possibility that C holds. In other words, viewing strong belief as, intuitively, approximate knowledge, the agent must probably approximately know (PAK-know) that C is true when it acts.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
ASAP: Architecture Search, Anneal and Prune
Authors:
Asaf Noy,
Niv Nayman,
Tal Ridnik,
Nadav Zamir,
Sivan Doveh,
Itamar Friedman,
Raja Giryes,
Lihi Zelnik-Manor
Abstract:
Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a discrete search space, thousands of days of GPU were required for convergence. A recent approach is based on constructing a differentiable search space that enables…
▽ More
Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a discrete search space, thousands of days of GPU were required for convergence. A recent approach is based on constructing a differentiable search space that enables gradient-based optimization, which reduces the search time to a few days. While successful, it still includes some noncontinuous steps, e.g., the pruning of many weak connections at once. In this paper, we propose a differentiable search space that allows the annealing of architecture weights, while gradually pruning inferior operations. In this way, the search converges to a single output network in a continuous manner. Experiments on several vision datasets demonstrate the effectiveness of our method with respect to the search cost and accuracy of the achieved model. Specifically, with $0.2$ GPU search days we achieve an error rate of $1.68\%$ on CIFAR-10.
△ Less
Submitted 10 October, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.