-
DCoM: Active Learning for All Learners
Authors:
Inbal Mishal,
Daphna Weinshall
Abstract:
Deep Active Learning (AL) techniques can be effective in reducing annotation costs for training deep models. However, their effectiveness in low- and high-budget scenarios seems to require different strategies, and achieving optimal results across varying budget scenarios remains a challenge. In this study, we introduce Dynamic Coverage & Margin mix (DCoM), a novel active learning approach designe…
▽ More
Deep Active Learning (AL) techniques can be effective in reducing annotation costs for training deep models. However, their effectiveness in low- and high-budget scenarios seems to require different strategies, and achieving optimal results across varying budget scenarios remains a challenge. In this study, we introduce Dynamic Coverage & Margin mix (DCoM), a novel active learning approach designed to bridge this gap. Unlike existing strategies, DCoM dynamically adjusts its strategy, considering the competence of the current model. Through theoretical analysis and empirical evaluations on diverse datasets, including challenging computer vision tasks, we demonstrate DCoM's ability to overcome the cold start problem and consistently improve results across different budgetary constraints. Thus DCoM achieves state-of-the-art performance in both low- and high-budget regimes.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
TEAL: New Selection Strategy for Small Buffers in Experience Replay Class Incremental Learning
Authors:
Shahar Shaul-Ariel,
Daphna Weinshall
Abstract:
Continual Learning is an unresolved challenge, whose relevance increases when considering modern applications. Unlike the human brain, trained deep neural networks suffer from a phenomenon called Catastrophic Forgetting, where they progressively lose previously acquired knowledge upon learning new tasks. To mitigate this problem, numerous methods have been developed, many relying on replaying past…
▽ More
Continual Learning is an unresolved challenge, whose relevance increases when considering modern applications. Unlike the human brain, trained deep neural networks suffer from a phenomenon called Catastrophic Forgetting, where they progressively lose previously acquired knowledge upon learning new tasks. To mitigate this problem, numerous methods have been developed, many relying on replaying past exemplars during new task training. However, as the memory allocated for replay decreases, the effectiveness of these approaches diminishes. On the other hand, maintaining a large memory for the purpose of replay is inefficient and often impractical. Here we introduce TEAL, a novel approach to populate the memory with exemplars, that can be integrated with various experience-replay methods and significantly enhance their performance on small memory buffers. We show that TEAL improves the average accuracy of the SOTA method XDER as well as ER and ER-ACE on several image recognition benchmarks, with a small memory buffer of 1-3 exemplars per class in the final task. This confirms the hypothesis that when memory is scarce, it is best to prioritize the most typical data.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free Ensembles of DNNs
Authors:
Uri Stern,
Daphna Weinshall
Abstract:
The infrequent occurrence of overfit in deep neural networks is perplexing. On the one hand, theory predicts that as models get larger they should eventually become too specialized for a specific training set, with ensuing decrease in generalization. In contrast, empirical results in image classification indicate that increasing the training time of deep models or using bigger models almost never…
▽ More
The infrequent occurrence of overfit in deep neural networks is perplexing. On the one hand, theory predicts that as models get larger they should eventually become too specialized for a specific training set, with ensuing decrease in generalization. In contrast, empirical results in image classification indicate that increasing the training time of deep models or using bigger models almost never hurts generalization. Is it because the way we measure overfit is too limited? Here, we introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data. Presumably, this score indicates that even while generalization improves overall, there are certain regions of the data space where it deteriorates. When thus measured, we show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated. This observation may help to clarify the aforementioned confusing picture. We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement in performance without any additional cost in training time. An extensive empirical evaluation with modern deep models shows our method's utility on multiple datasets, neural networks architectures and training schemes, both when training from scratch and when using pre-trained networks in transfer learning. Notably, our method outperforms comparable methods while being easier to implement and use, and further improves the performance of competitive networks on Imagenet by 1%.
△ Less
Submitted 28 December, 2023; v1 submitted 17 October, 2023;
originally announced October 2023.
-
United We Stand: Using Epoch-wise Agreement of Ensembles to Combat Overfit
Authors:
Uri Stern,
Daniel Shwartz,
Daphna Weinshall
Abstract:
Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate…
▽ More
Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stop** is used. To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction -- that the variance among classifiers increases when overfit occurs -- is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stop**. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit. Code is available at https://github.com/uristern123/United-We-Stand-Using-Epoch-wise-Agreement-of-Ensembles-to-Combat-Overfit.
△ Less
Submitted 28 December, 2023; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Semi-Supervised Learning in the Few-Shot Zero-Shot Scenario
Authors:
Noam Fluss,
Guy Hacohen,
Daphna Weinshall
Abstract:
Semi-Supervised Learning (SSL) is a framework that utilizes both labeled and unlabeled data to enhance model performance. Conventional SSL methods operate under the assumption that labeled and unlabeled data share the same label space. However, in practical real-world scenarios, especially when the labeled training dataset is limited in size, some classes may be totally absent from the labeled set…
▽ More
Semi-Supervised Learning (SSL) is a framework that utilizes both labeled and unlabeled data to enhance model performance. Conventional SSL methods operate under the assumption that labeled and unlabeled data share the same label space. However, in practical real-world scenarios, especially when the labeled training dataset is limited in size, some classes may be totally absent from the labeled set. To address this broader context, we propose a general approach to augment existing SSL methods, enabling them to effectively handle situations where certain classes are missing. This is achieved by introducing an additional term into their objective function, which penalizes the KL-divergence between the probability vectors of the true class frequencies and the inferred class frequencies. Our experimental results reveal significant improvements in accuracy when compared to state-of-the-art SSL, open-set SSL, and open-world SSL methods. We conducted these experiments on two benchmark image classification datasets, CIFAR-100 and STL-10, with the most remarkable improvements observed when the labeled data is severely limited, with only a few labeled examples per class
△ Less
Submitted 15 November, 2023; v1 submitted 27 August, 2023;
originally announced August 2023.
-
Pruning the Unlabeled Data to Improve Semi-Supervised Learning
Authors:
Guy Hacohen,
Daphna Weinshall
Abstract:
In the domain of semi-supervised learning (SSL), the conventional approach involves training a learner with a limited amount of labeled data alongside a substantial volume of unlabeled data, both drawn from the same underlying distribution. However, for deep learning models, this standard practice may not yield optimal results. In this research, we propose an alternative perspective, suggesting th…
▽ More
In the domain of semi-supervised learning (SSL), the conventional approach involves training a learner with a limited amount of labeled data alongside a substantial volume of unlabeled data, both drawn from the same underlying distribution. However, for deep learning models, this standard practice may not yield optimal results. In this research, we propose an alternative perspective, suggesting that distributions that are more readily separable could offer superior benefits to the learner as compared to the original distribution. To achieve this, we present PruneSSL, a practical technique for selectively removing examples from the original unlabeled dataset to enhance its separability. We present an empirical study, showing that although PruneSSL reduces the quantity of available training data for the learner, it significantly improves the performance of various competitive SSL algorithms, thereby achieving state-of-the-art results across several image classification tasks.
△ Less
Submitted 27 August, 2023;
originally announced August 2023.
-
How to Select Which Active Learning Strategy is Best Suited for Your Specific Problem and Budget
Authors:
Guy Hacohen,
Daphna Weinshall
Abstract:
In the domain of Active Learning (AL), a learner actively selects which unlabeled examples to seek labels from an oracle, while operating within predefined budget constraints. Importantly, it has been recently shown that distinct query strategies are better suited for different conditions and budgetary constraints. In practice, the determination of the most appropriate AL strategy for a given situ…
▽ More
In the domain of Active Learning (AL), a learner actively selects which unlabeled examples to seek labels from an oracle, while operating within predefined budget constraints. Importantly, it has been recently shown that distinct query strategies are better suited for different conditions and budgetary constraints. In practice, the determination of the most appropriate AL strategy for a given situation remains an open problem. To tackle this challenge, we propose a practical derivative-based method that dynamically identifies the best strategy for a given budget. Intuitive motivation for our approach is provided by the theoretical analysis of a simplified scenario. We then introduce a method to dynamically select an AL strategy, which takes into account the unique characteristics of the problem and the available budget. Empirical results showcase the effectiveness of our approach across diverse budgets and computer vision tasks.
△ Less
Submitted 23 October, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
The Dynamic of Consensus in Deep Networks and the Identification of Noisy Labels
Authors:
Daniel Shwartz,
Uri Stern,
Daphna Weinshall
Abstract:
Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean examples by the end of training. Recent research has dealt with this challenge by utilizing the fact that deep networks seem to memorize clean examples much ea…
▽ More
Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean examples by the end of training. Recent research has dealt with this challenge by utilizing the fact that deep networks seem to memorize clean examples much earlier than noisy examples. Here we report a new empirical result: for each example, when looking at the time it has been memorized by each model in an ensemble of networks, the diversity seen in noisy examples is much larger than the clean examples. We use this observation to develop a new method for noisy labels filtration. The method is based on a statistics of the data, which captures the differences in ensemble learning dynamics between clean and noisy data. We test our method on three tasks: (i) noise amount estimation; (ii) noise filtration; (iii) supervised classification. We show that our method improves over existing baselines in all three tasks using a variety of datasets, noise models, and noise levels. Aside from its improved performance, our method has two other advantages. (i) Simplicity, which implies that no additional hyperparameters are introduced. (ii) Our method is modular: it does not work in an end-to-end fashion, and can therefore be used to clean a dataset for any other future usage.
△ Less
Submitted 2 October, 2022;
originally announced October 2022.
-
Active Learning Through a Covering Lens
Authors:
Ofer Yehuda,
Avihu Dekel,
Guy Hacohen,
Daphna Weinshall
Abstract:
Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the da…
▽ More
Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover - a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/TypiClust.
△ Less
Submitted 29 December, 2022; v1 submitted 23 May, 2022;
originally announced May 2022.
-
Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets
Authors:
Guy Hacohen,
Avihu Dekel,
Daphna Weinshall
Abstract:
Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical examples are best queried when the budget is low, while unrepresentative examples are best queried when the budget is large. Combined evidence shows that a similar phenomeno…
▽ More
Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical examples are best queried when the budget is low, while unrepresentative examples are best queried when the budget is large. Combined evidence shows that a similar phenomenon occurs in common classification models. Accordingly, we propose TypiClust -- a deep active learning strategy suited for low budgets. In a comparative empirical investigation of supervised learning, using a variety of architectures and image datasets, TypiClust outperforms all other active learning strategies in the low-budget regime. Using TypiClust in the semi-supervised framework, performance gets an even more significant boost. In particular, state-of-the-art semi-supervised methods trained on CIFAR-10 with 10 labeled examples selected by TypiClust, reach 93.2% accuracy -- an improvement of 39.4% over random selection. Code is available at https://github.com/avihu111/TypiClust.
△ Less
Submitted 16 June, 2022; v1 submitted 6 February, 2022;
originally announced February 2022.
-
The Grammar-Learning Trajectories of Neural Language Models
Authors:
Leshem Choshen,
Guy Hacohen,
Daphna Weinshall,
Omri Abend
Abstract:
The learning trajectories of linguistic phenomena in humans provide insight into linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply a similar approach to analyze neural language models (NLM), it is first necessary to establish that different models are similar enough in the generalizations they make. In this paper, we show that NLMs wit…
▽ More
The learning trajectories of linguistic phenomena in humans provide insight into linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply a similar approach to analyze neural language models (NLM), it is first necessary to establish that different models are similar enough in the generalizations they make. In this paper, we show that NLMs with different initialization, architecture, and training data acquire linguistic phenomena in a similar order, despite their different end performance. These findings suggest that there is some mutual inductive bias that underlies these models' learning of linguistic phenomena. Taking inspiration from psycholinguistics, we argue that studying this inductive bias is an opportunity to study the linguistic representation implicit in NLMs.
Leveraging these findings, we compare the relative performance on different phenomena at varying learning stages with simpler reference models. Results suggest that NLMs exhibit consistent "developmental" stages. Moreover, we find the learning trajectory to be approximately one-dimensional: given an NLM with a certain overall performance, it is possible to predict what linguistic generalizations it has already acquired. Initial analysis of these stages presents phenomena clusters (notably morphological ones), whose performance progresses in unison, suggesting a potential link between the generalizations behind them.
△ Less
Submitted 6 April, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks
Authors:
Guy Hacohen,
Daphna Weinshall
Abstract:
Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principa…
▽ More
Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stop** and its connection to PCA, and why deep networks converge more slowly with random labels.
△ Less
Submitted 8 June, 2022; v1 submitted 12 May, 2021;
originally announced May 2021.
-
More Is More -- Narrowing the Generalization Gap by Adding Classification Heads
Authors:
Roee Cates,
Daphna Weinshall
Abstract:
Overfit is a fundamental problem in machine learning in general, and in deep learning in particular. In order to reduce overfit and improve generalization in the classification of images, some employ invariance to a group of transformations, such as rotations and reflections. However, since not all objects exhibit necessarily the same invariance, it seems desirable to allow the network to learn th…
▽ More
Overfit is a fundamental problem in machine learning in general, and in deep learning in particular. In order to reduce overfit and improve generalization in the classification of images, some employ invariance to a group of transformations, such as rotations and reflections. However, since not all objects exhibit necessarily the same invariance, it seems desirable to allow the network to learn the useful level of invariance from the data. To this end, motivated by self-supervision, we introduce an architecture enhancement for existing neural network models based on input transformations, termed 'TransNet', together with a training algorithm suitable for it. Our model can be employed during training time only and then pruned for prediction, resulting in an equivalent architecture to the base model. Thus pruned, we show that our model improves performance on various data-sets while exhibiting improved generalization, which is achieved in turn by enforcing soft invariance on the convolutional kernels of the last layer in the base model. Theoretical analysis is provided to support the proposed method.
△ Less
Submitted 11 February, 2021; v1 submitted 9 February, 2021;
originally announced February 2021.
-
Boosting the Performance of Semi-Supervised Learning with Unsupervised Clustering
Authors:
Boaz Lerner,
Guy Shiran,
Daphna Weinshall
Abstract:
Recently, Semi-Supervised Learning (SSL) has shown much promise in leveraging unlabeled data while being provided with very few labels. In this paper, we show that ignoring the labels altogether for whole epochs intermittently during training can significantly improve performance in the small sample regime. More specifically, we propose to train a network on two tasks jointly. The primary classifi…
▽ More
Recently, Semi-Supervised Learning (SSL) has shown much promise in leveraging unlabeled data while being provided with very few labels. In this paper, we show that ignoring the labels altogether for whole epochs intermittently during training can significantly improve performance in the small sample regime. More specifically, we propose to train a network on two tasks jointly. The primary classification task is exposed to both the unlabeled and the scarcely annotated data, whereas the secondary task seeks to cluster the data without any labels. As opposed to hand-crafted pretext tasks frequently used in self-supervision, our clustering phase utilizes the same classification network and head in an attempt to relax the primary task and propagate the information from the labels without overfitting them. On top of that, the self-supervised technique of classifying image rotations is incorporated during the unsupervised learning phase to stabilize training. We demonstrate our method's efficacy in boosting several state-of-the-art SSL algorithms, significantly improving their results and reducing running time in various standard semi-supervised benchmarks, including 92.6% accuracy on CIFAR-10 and 96.9% on SVHN, using only 4 labels per class in each task. We also notably improve the results in the extreme cases of 1,2 and 3 labels per class, and show that features learned by our model are more meaningful for separating the data.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Multiclass non-Adversarial Image Synthesis, with Application to Classification from Very Small Sample
Authors:
Itamar Winter,
Daphna Weinshall
Abstract:
The generation of synthetic images is currently being dominated by Generative Adversarial Networks (GANs). Despite their outstanding success in generating realistic looking images, they still suffer from major drawbacks, including an unstable and highly sensitive training procedure, mode-collapse and mode-mixture, and dependency on large training sets. In this work we present a novel non-adversari…
▽ More
The generation of synthetic images is currently being dominated by Generative Adversarial Networks (GANs). Despite their outstanding success in generating realistic looking images, they still suffer from major drawbacks, including an unstable and highly sensitive training procedure, mode-collapse and mode-mixture, and dependency on large training sets. In this work we present a novel non-adversarial generative method - Clustered Optimization of LAtent space (COLA), which overcomes some of the limitations of GANs, and outperforms GANs when training data is scarce. In the full data regime, our method is capable of generating diverse multi-class images with no supervision, surpassing previous non-adversarial methods in terms of image quality and diversity. In the small-data regime, where only a small sample of labeled images is available for training with no access to additional unlabeled data, our results surpass state-of-the-art GAN models trained on the same amount of data. Finally, when utilizing our model to augment small datasets, we surpass the state-of-the-art performance in small-sample classification tasks on challenging datasets, including CIFAR-10, CIFAR-100, STL-10 and Tiny-ImageNet. A theoretical analysis supporting the essence of the method is presented.
△ Less
Submitted 1 December, 2020; v1 submitted 25 November, 2020;
originally announced November 2020.
-
Generative Latent Implicit Conditional Optimization when Learning from Small Sample
Authors:
Idan Azuri,
Daphna Weinshall
Abstract:
We revisit the long-standing problem of learning from a small sample, to which end we propose a novel method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a map** from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent works, which rely on access to large amounts of unlabeled data,…
▽ More
We revisit the long-standing problem of learning from a small sample, to which end we propose a novel method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a map** from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent works, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes without imposing any prior. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation, and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art, when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.
△ Less
Submitted 15 December, 2020; v1 submitted 31 March, 2020;
originally announced March 2020.
-
Multi-Modal Deep Clustering: Unsupervised Partitioning of Images
Authors:
Guy Shiran,
Daphna Weinshall
Abstract:
The clustering of unlabeled raw images is a daunting task, which has recently been approached with some success by deep learning methods. Here we propose an unsupervised clustering framework, which learns a deep neural network in an end-to-end fashion, providing direct cluster assignments of images without additional processing. Multi-Modal Deep Clustering (MMDC), trains a deep network to align it…
▽ More
The clustering of unlabeled raw images is a daunting task, which has recently been approached with some success by deep learning methods. Here we propose an unsupervised clustering framework, which learns a deep neural network in an end-to-end fashion, providing direct cluster assignments of images without additional processing. Multi-Modal Deep Clustering (MMDC), trains a deep network to align its image embeddings with target points sampled from a Gaussian Mixture Model distribution. The cluster assignments are then determined by mixture component association of image embeddings. Simultaneously, the same deep network is trained to solve an additional self-supervised task of predicting image rotations. This pushes the network to learn more meaningful image representations that facilitate a better clustering. Experimental results show that MMDC achieves or exceeds state-of-the-art performance on six challenging benchmarks. On natural image datasets we improve on previous results with significant margins of up to 20% absolute accuracy points, yielding an accuracy of 82% on CIFAR-10, 45% on CIFAR-100 and 69% on STL-10.
△ Less
Submitted 15 December, 2020; v1 submitted 5 December, 2019;
originally announced December 2019.
-
Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets
Authors:
Guy Hacohen,
Leshem Choshen,
Daphna Weinshall
Abstract:
We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architectu…
▽ More
We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries -- models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We further show that this pattern of results reflects the interplay between the way neural networks learn benchmark datasets. Thus, when fixing the architecture, we show synthetic datasets where this pattern ceases to exist. When fixing the dataset, we show that other learning paradigms may learn the data in a different order. We hypothesize that our results reflect how neural networks discover structure in natural datasets.
△ Less
Submitted 20 July, 2020; v1 submitted 26 May, 2019;
originally announced May 2019.
-
On The Power of Curriculum Learning in Training Deep Networks
Authors:
Guy Hacohen,
Daphna Weinshall
Abstract:
Training neural networks is traditionally done by providing a sequence of random mini-batches sampled uniformly from the entire training data. In this work, we analyze the effect of curriculum learning, which involves the non-uniform sampling of mini-batches, on the training of deep networks, and specifically CNNs trained for image recognition. To employ curriculum learning, the training algorithm…
▽ More
Training neural networks is traditionally done by providing a sequence of random mini-batches sampled uniformly from the entire training data. In this work, we analyze the effect of curriculum learning, which involves the non-uniform sampling of mini-batches, on the training of deep networks, and specifically CNNs trained for image recognition. To employ curriculum learning, the training algorithm must resolve 2 problems: (i) sort the training examples by difficulty; (ii) compute a series of mini-batches that exhibit an increasing level of difficulty. We address challenge (i) using two methods: transfer learning from some competitive ``teacher" network, and bootstrap**. In our empirical evaluation, both methods show similar benefits in terms of increased learning speed and improved final performance on test data. We address challenge (ii) by investigating different pacing functions to guide the sampling. The empirical investigation includes a variety of network architectures, using images from CIFAR-10, CIFAR-100 and subsets of ImageNet. We conclude with a novel theoretical analysis of curriculum learning, where we show how it effectively modifies the optimization landscape. We then define the concept of an ideal curriculum, and show that under mild conditions it does not change the corresponding global minimum of the optimization function.
△ Less
Submitted 29 May, 2019; v1 submitted 7 April, 2019;
originally announced April 2019.
-
Blurred Images Lead to Bad Local Minima
Authors:
Gal Katzhendler,
Daphna Weinshall
Abstract:
Blurred Images Lead to Bad Local Minima
Blurred Images Lead to Bad Local Minima
△ Less
Submitted 30 January, 2019;
originally announced January 2019.
-
Theory of Curriculum Learning, with Convex Loss Functions
Authors:
Daphna Weinshall,
Dan Amir
Abstract:
Curriculum Learning - the idea of teaching by gradually exposing the learner to examples in a meaningful order, from easy to hard, has been investigated in the context of machine learning long ago. Although methods based on this concept have been empirically shown to improve performance of several learning algorithms, no theoretical analysis has been provided even for simple cases. To address this…
▽ More
Curriculum Learning - the idea of teaching by gradually exposing the learner to examples in a meaningful order, from easy to hard, has been investigated in the context of machine learning long ago. Although methods based on this concept have been empirically shown to improve performance of several learning algorithms, no theoretical analysis has been provided even for simple cases. To address this shortfall, we start by formulating an ideal definition of difficulty score - the loss of the optimal hypothesis at a given datapoint. We analyze the possible contribution of curriculum learning based on this score in two convex problems - linear regression, and binary classification by hinge loss minimization. We show that in both cases, the expected convergence rate decreases monotonically with the ideal difficulty score, in accordance with earlier empirical results. We also prove that when the ideal difficulty score is fixed, the convergence rate is monotonically increasing with respect to the loss of the current hypothesis at each point. We discuss how these results bring to term two apparently contradicting heuristics: curriculum learning on the one hand, and hard data mining on the other.
△ Less
Submitted 9 December, 2018;
originally announced December 2018.
-
Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images
Authors:
Matan Ben-Yosef,
Daphna Weinshall
Abstract:
Generative Adversarial Networks (GANs) have been shown to produce realistically looking synthetic images with remarkable success, yet their performance seems less impressive when the training set is highly diverse. In order to provide a better fit to the target data distribution when the dataset includes many different classes, we propose a variant of the basic GAN model, called Gaussian Mixture G…
▽ More
Generative Adversarial Networks (GANs) have been shown to produce realistically looking synthetic images with remarkable success, yet their performance seems less impressive when the training set is highly diverse. In order to provide a better fit to the target data distribution when the dataset includes many different classes, we propose a variant of the basic GAN model, called Gaussian Mixture GAN (GM-GAN), where the probability distribution over the latent space is a mixture of Gaussians. We also propose a supervised variant which is capable of conditional sample synthesis. In order to evaluate the model's performance, we propose a new scoring method which separately takes into account two (typically conflicting) measures - diversity vs. quality of the generated data. Through a series of empirical experiments, using both synthetic and real-world datasets, we quantitatively show that GM-GANs outperform baselines, both when evaluated using the commonly used Inception Score, and when evaluated using our own alternative scoring method. In addition, we qualitatively demonstrate how the \textit{unsupervised} variant of GM-GAN tends to map latent vectors sampled from different Gaussians in the latent space to samples of different classes in the data space. We show how this phenomenon can be exploited for the task of unsupervised clustering, and provide quantitative evaluation showing the superiority of our method for the unsupervised clustering of image datasets. Finally, we demonstrate a feature which further sets our model apart from other GAN models: the option to control the quality-diversity trade-off by altering, post-training, the probability distribution of the latent space. This allows one to sample higher quality and lower diversity samples, or vice versa, according to one's needs.
△ Less
Submitted 30 August, 2018;
originally announced August 2018.
-
Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks
Authors:
Daphna Weinshall,
Gad Cohen,
Dan Amir
Abstract:
We provide theoretical investigation of curriculum learning in the context of stochastic gradient descent when optimizing the convex linear regression loss. We prove that the rate of convergence of an ideal curriculum learning method is monotonically increasing with the difficulty of the examples. Moreover, among all equally difficult points, convergence is faster when using points which incur hig…
▽ More
We provide theoretical investigation of curriculum learning in the context of stochastic gradient descent when optimizing the convex linear regression loss. We prove that the rate of convergence of an ideal curriculum learning method is monotonically increasing with the difficulty of the examples. Moreover, among all equally difficult points, convergence is faster when using points which incur higher loss with respect to the current hypothesis. We then analyze curriculum learning in the context of training a CNN. We describe a method which infers the curriculum by way of transfer learning from another network, pre-trained on a different task. While this approach can only approximate the ideal curriculum, we observe empirically similar behavior to the one predicted by the theory, namely, a significant boost in convergence speed at the beginning of training. When the task is made more difficult, improvement in generalization performance is also observed. Finally, curriculum learning exhibits robustness against unfavorable conditions such as excessive regularization.
△ Less
Submitted 8 June, 2018; v1 submitted 11 February, 2018;
originally announced February 2018.
-
Distance-based Confidence Score for Neural Network Classifiers
Authors:
Amit Mandelbaum,
Daphna Weinshall
Abstract:
The reliable measurement of confidence in classifiers' predictions is very important for many applications and is, therefore, an important part of classifier design. Yet, although deep learning has received tremendous attention in recent years, not much progress has been made in quantifying the prediction confidence of neural network classifiers. Bayesian models offer a mathematically grounded fra…
▽ More
The reliable measurement of confidence in classifiers' predictions is very important for many applications and is, therefore, an important part of classifier design. Yet, although deep learning has received tremendous attention in recent years, not much progress has been made in quantifying the prediction confidence of neural network classifiers. Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with prohibitive computational costs. In this paper we propose a simple, scalable method to achieve a reliable confidence score, based on the data embedding derived from the penultimate layer of the network. We investigate two ways to achieve desirable embeddings, by using either a distance-based loss or Adversarial Training. We then test the benefits of our method when used for classification error prediction, weighting an ensemble of classifiers, and novelty detection. In all tasks we show significant improvement over traditional, commonly used confidence scores.
△ Less
Submitted 28 September, 2017;
originally announced September 2017.
-
Every Untrue Label is Untrue in its Own Way: Controlling Error Type with the Log Bilinear Loss
Authors:
Yehezkel S. Resheff,
Amit Mandelbaum,
Daphna Weinshall
Abstract:
Deep learning has become the method of choice in many application domains of machine learning in recent years, especially for multi-class classification tasks. The most common loss function used in this context is the cross-entropy loss, which reduces to the log loss in the typical case when there is a single correct response label. While this loss is insensitive to the identity of the assigned cl…
▽ More
Deep learning has become the method of choice in many application domains of machine learning in recent years, especially for multi-class classification tasks. The most common loss function used in this context is the cross-entropy loss, which reduces to the log loss in the typical case when there is a single correct response label. While this loss is insensitive to the identity of the assigned class in the case of misclassification, in practice it is often the case that some errors may be more detrimental than others. Here we present the bilinear-loss (and related log-bilinear-loss) which differentially penalizes the different wrong assignments of the model. We thoroughly test this method using standard models and benchmark image datasets. As one application, we show the ability of this method to better contain error within the correct super-class, in the hierarchically labeled CIFAR100 dataset, without affecting the overall performance of the classifier.
△ Less
Submitted 20 April, 2017;
originally announced April 2017.
-
Novelty Detection in MultiClass Scenarios with Incomplete Set of Class Labels
Authors:
Nomi Vinokurov,
Daphna Weinshall
Abstract:
We address the problem of novelty detection in multiclass scenarios where some class labels are missing from the training set. Our method is based on the initial assignment of confidence values, which measure the affinity between a new test point and each known class. We first compare the values of the two top elements in this vector of confidence values. In the heart of our method lies the traini…
▽ More
We address the problem of novelty detection in multiclass scenarios where some class labels are missing from the training set. Our method is based on the initial assignment of confidence values, which measure the affinity between a new test point and each known class. We first compare the values of the two top elements in this vector of confidence values. In the heart of our method lies the training of an ensemble of classifiers, each trained to discriminate known from novel classes based on some partition of the training data into presumed-known and presumednovel classes. Our final novelty score is derived from the output of this ensemble of classifiers. We evaluated our method on two datasets of images containing a relatively large number of classes - the Caltech-256 and Cifar-100 datasets. We compared our method to 3 alternative methods which represent commonly used approaches, including the one-class SVM, novelty based on k-NN, novelty based on maximal confidence, and the recent KNFST method. The results show a very clear and marked advantage for our method over all alternative methods, in an experimental setup where class labels are missing during training.
△ Less
Submitted 15 May, 2016; v1 submitted 21 April, 2016;
originally announced April 2016.
-
Topic Modeling of Behavioral Modes Using Sensor Data
Authors:
Yehezkel S. Resheff,
Shay Rotics,
Ran Nathan,
Daphna Weinshall
Abstract:
The field of Movement Ecology, like so many other fields, is experiencing a period of rapid growth in availability of data. As the volume rises, traditional methods are giving way to machine learning and data science, which are playing an increasingly large part it turning this data into science-driving insights. One rich and interesting source is the bio-logger. These small electronic wearable de…
▽ More
The field of Movement Ecology, like so many other fields, is experiencing a period of rapid growth in availability of data. As the volume rises, traditional methods are giving way to machine learning and data science, which are playing an increasingly large part it turning this data into science-driving insights. One rich and interesting source is the bio-logger. These small electronic wearable devices are attached to animals free to roam in their natural habitats, and report back readings from multiple sensors, including GPS and accelerometer bursts. A common use of accelerometer data is for supervised learning of behavioral modes. However, we need unsupervised analysis tools as well, in order to overcome the inherent difficulties of obtaining a labeled dataset, which in some cases is either infeasible or does not successfully encompass the full repertoire of behavioral modes of interest. Here we present a matrix factorization based topic-model method for accelerometer bursts, derived using a linear mixture property of patch features. Our method is validated via comparison to a labeled dataset, and is further compared to standard clustering algorithms.
△ Less
Submitted 16 November, 2015;
originally announced November 2015.
-
A Cheap System for Vehicle Speed Detection
Authors:
Chaim Ginzburg,
Amit Raphael,
Daphna Weinshall
Abstract:
The reliable detection of speed of moving vehicles is considered key to traffic law enforcement in most countries, and is seen by many as an important tool to reduce the number of traffic accidents and fatalities. Many automatic systems and different methods are employed in different countries, but as a rule they tend to be expensive and/or labor intensive, often employing outdated technology due…
▽ More
The reliable detection of speed of moving vehicles is considered key to traffic law enforcement in most countries, and is seen by many as an important tool to reduce the number of traffic accidents and fatalities. Many automatic systems and different methods are employed in different countries, but as a rule they tend to be expensive and/or labor intensive, often employing outdated technology due to the long development time. Here we describe a speed detection system that relies on simple everyday equipment - a laptop and a consumer web camera. Our method is based on tracking the license plates of cars, which gives the relative movement of the cars in the image. This image displacement is translated to actual motion by using the method of projection to a reference plane, where the reference plane is the road itself. However, since license plates do not touch the road, we must compensate for the entailed distortion in speed measurement. We show how to compute the compensation factor using knowledge of the license plate standard dimensions. Consequently our system computes the true speed of moving vehicles fast and accurately. We show promising results on videos obtained in a number of scenes and with different car models.
△ Less
Submitted 27 January, 2015;
originally announced January 2015.