Search | arXiv e-print repository

Distributed Speculative Inference of Large Language Models

Authors: Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

Abstract: Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI al… ▽ More Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI. △ Less

Submitted 28 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2306.01610 [pdf, other]

Centered Self-Attention Layers

Authors: Ameen Ali, Tomer Galanti, Lior Wolf

Abstract: The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a… ▽ More The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a correction term to the aggregating operator of these mechanisms. Empirically, this simple term eliminates much of the oversmoothing problem in visual transformers, obtaining performance in weakly supervised segmentation that surpasses elaborate baseline methods that introduce multiple auxiliary networks and training phrases. In graph neural networks, the correction term enables the training of very deep architectures more effectively than many recent solutions to the same problem. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.15614 [pdf, other]

Reverse Engineering Self-Supervised Learning

Authors: Ido Ben-Shaul, Ravid Shwartz-Ziv, Tomer Galanti, Shai Dekel, Yann LeCun

Abstract: Self-supervised learning (SSL) is a powerful tool in machine learning, but understanding the learned representations and their underlying mechanisms remains a challenge. This paper presents an in-depth empirical analysis of SSL-trained representations, encompassing diverse models, architectures, and hyperparameters. Our study reveals an intriguing aspect of the SSL training process: it inherently… ▽ More Self-supervised learning (SSL) is a powerful tool in machine learning, but understanding the learned representations and their underlying mechanisms remains a challenge. This paper presents an in-depth empirical analysis of SSL-trained representations, encompassing diverse models, architectures, and hyperparameters. Our study reveals an intriguing aspect of the SSL training process: it inherently facilitates the clustering of samples with respect to semantic labels, which is surprisingly driven by the SSL objective's regularization term. This clustering process not only enhances downstream classification but also compresses the data information. Furthermore, we establish that SSL-trained representations align more closely with semantic classes rather than random classes. Remarkably, we show that learned representations align with semantic classes across various hierarchical levels, and this alignment increases during training and when moving deeper into the network. Our findings provide valuable insights into SSL's representation learning mechanisms and their impact on performance across different sets of classes. △ Less

Submitted 31 May, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2303.13093 [pdf, other]

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

Authors: Liu Ziyin, Botao Li, Tomer Galanti, Masahito Ueda

Abstract: Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading orde… ▽ More Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle. △ Less

Submitted 2 July, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

Comments: preprint

arXiv:2301.12033 [pdf, other]

Norm-based Generalization Bounds for Compositionally Sparse Neural Networks

Authors: Tomer Galanti, Mengjia Xu, Liane Galanti, Tomaso Poggio

Abstract: In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toepli… ▽ More In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks. △ Less

Submitted 27 January, 2023; originally announced January 2023.

arXiv:2301.04605 [pdf, ps, other]

Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions

Authors: Ido Ben-Shaul, Tomer Galanti, Shai Dekel

Abstract: Multiplication layers are a key component in various influential neural network modules, including self-attention and hypernetwork layers. In this paper, we investigate the approximation capabilities of deep neural networks with intermediate neurons connected by simple multiplication operations. We consider two classes of target functions: generalized bandlimited functions, which are frequently us… ▽ More Multiplication layers are a key component in various influential neural network modules, including self-attention and hypernetwork layers. In this paper, we investigate the approximation capabilities of deep neural networks with intermediate neurons connected by simple multiplication operations. We consider two classes of target functions: generalized bandlimited functions, which are frequently used to model real-world signals with finite bandwidth, and Sobolev-Type balls, which are embedded in the Sobolev Space $\mathcal{W}^{r,2}$. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons compared to standard ReLU neural networks, with respect to both input dimension and approximation error. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design. △ Less

Submitted 11 January, 2023; originally announced January 2023.

MSC Class: 41A25; 68Q32; 68T07

arXiv:2212.12532 [pdf, other]

Generalization Bounds for Few-Shot Transfer Learning with Pretrained Classifiers

Authors: Tomer Galanti, András György, Marcus Hutter

Abstract: We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer a theoretical expl… ▽ More We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer a theoretical explanation for this behavior based on the recently discovered phenomenon of class-feature-variability collapse, that is, that during the training of deep classification networks the feature embeddings of samples belonging to the same class tend to concentrate around their class means. More specifically, we show that the few-shot error of the learned feature map on new classes (defined as the classification error of the nearest class-center classifier using centers learned from a small number of random samples from each new class) is small in case of class-feature-variability collapse, under the assumption that the classes are selected independently from a fixed distribution. This suggests that foundation models can provide feature maps that are transferable to new downstream tasks, even with very few samples; to our knowledge, this is the first performance bound for transfer-learning that is non-vacuous in the few-shot setting. △ Less

Submitted 16 July, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2112.15121

arXiv:2206.05794 [pdf, other]

Characterizing the Implicit Bias of Regularized SGD in Rank Minimization

Authors: Tomer Galanti, Zachary S. Siegel, Aparna Gupte, Tomaso Poggio

Abstract: We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes,… ▽ More We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. △ Less

Submitted 25 October, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

arXiv:2202.09028 [pdf, other]

On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

Authors: Tomer Galanti, Liane Galanti, Ido Ben-Shaul

Abstract: Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that meas… ▽ More Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{degree of separability} in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance. Finally, we empirically show that the effective depth of a trained neural network monotonically increases when increasing the number of random labels in data. △ Less

Submitted 27 September, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

arXiv:2112.15121 [pdf, other]

On the Role of Neural Collapse in Transfer Learning

Authors: Tomer Galanti, András György, Marcus Hutter

Abstract: We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an… ▽ More We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and -- more importantly -- to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting. △ Less

Submitted 3 January, 2022; v1 submitted 30 December, 2021; originally announced December 2021.

arXiv:2110.02900 [pdf, other]

Meta Internal Learning

Authors: Raphael Bensadoun, Shir Gur, Tomer Galanti, Lior Wolf

Abstract: Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image. Since these models are trained on a single image, they are limited in their scale and application. To overcome these issues, we propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the… ▽ More Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image. Since these models are trained on a single image, they are limited in their scale and application. To overcome these issues, we propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively. In the presented meta-learning approach, a single-image GAN model is generated given an input image, via a convolutional feedforward hypernetwork $f$. This network is trained over a dataset of images, allowing for feature sharing among different models, and for interpolation in the space of generative models. The generated single-image model contains a hierarchy of multiple generators and discriminators. It is therefore required to train the meta-learner in an adversarial manner, which requires careful design choices that we justify by a theoretical analysis. Our results show that the models obtained are as suitable as single-image GANs for many common image applications, significantly reduce the training time per image without loss in performance, and introduce novel capabilities, such as interpolation and feedforward modeling of novel images. △ Less

Submitted 6 October, 2021; originally announced October 2021.

arXiv:2106.04180 [pdf, other]

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

Authors: Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu, Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka

Abstract: 3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper explores the potential of transferring 2D model architectures and weights to understand 3D point-clouds, by empirically investigating the feasibi… ▽ More 3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper explores the potential of transferring 2D model architectures and weights to understand 3D point-clouds, by empirically investigating the feasibility of the transfer, the benefits of the transfer, and shedding light on why the transfer works. We discover that we can indeed use the same architecture and pretrained weights of a neural net model to understand both images and point-clouds. Specifically, we transfer the image-pretrained model to a point-cloud model by copying or inflating the weights. We find that finetuning the transformed image-pretrained models (FIP) with minimal efforts -- only on input, output, and normalization layers -- can achieve competitive performance on 3D point-cloud classification, beating a wide range of point-cloud models that adopt task-specific architectures and use a variety of tricks. When finetuning the whole model, the performance improves even further. Meanwhile, FIP improves data efficiency, reaching up to 10.0 top-1 accuracy percent on few-shot classification. It also speeds up the training of point-cloud models by up to 11.1x for a target accuracy (e.g., 90 % accuracy). Lastly, we provide an explanation of the image to point-cloud transfer from the aspect of neural collapse. The code is available at: \url{https://github.com/chenfengxu714/image2point}. △ Less

Submitted 23 April, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: The code is avaliable at: \url{https://github.com/chenfengxu714/image2point}

arXiv:2103.11888 [pdf, other]

Weakly Supervised Recovery of Semantic Attributes

Authors: Ameen Ali, Tomer Galanti, Evgeniy Zheltonozhskiy, Chaim Baskin, Lior Wolf

Abstract: We consider the problem of the extraction of semantic attributes, supervised only with classification labels. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. To tackle this problem, we propose training a neural network with discrete features in the last layer, which is followed by two hea… ▽ More We consider the problem of the extraction of semantic attributes, supervised only with classification labels. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. To tackle this problem, we propose training a neural network with discrete features in the last layer, which is followed by two heads: a multi-layered perceptron (MLP) and a decision tree. Since decision trees utilize simple binary decision stumps we expect those discrete features to obtain semantic meaning. We present a theoretical analysis as well as a practical method for learning in the intersection of two hypothesis classes. Our results on multiple benchmarks show an improved ability to extract a set of features that are highly correlated with the set of unseen attributes. △ Less

Submitted 11 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

arXiv:2004.12361 [pdf, other]

doi 10.1007/s11263-020-01424-w

Evaluation Metrics for Conditional Image Generation

Authors: Yaniv Benny, Tomer Galanti, Sagie Benaim, Lior Wolf

Abstract: We present two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fre'chet Inception Distance (FID). A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterpart… ▽ More We present two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fre'chet Inception Distance (FID). A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts. The link takes the form of a product in the case of IS or an upper bound in the FID case. We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models, thus providing additional insights about their performance, from unlearned classes to mode collapse. △ Less

Submitted 8 February, 2021; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: To be published in "INTERNATIONAL JOURNAL OF COMPUTER VISION"

arXiv:2003.12193 [pdf, other]

On Infinite-Width Hypernetworks

Authors: Etai Littwin, Tomer Galanti, Lior Wolf, Greg Yang

Abstract: {\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such repr… ▽ More {\em Hypernetworks} are architectures that produce the weights of a task-specific {\em primary network}. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork's output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the {\em hyperkernel}. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks. △ Less

Submitted 22 February, 2021; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: The first two authors contributed equally

arXiv:2002.10007 [pdf, other]

A Critical View of the Structural Causal Model

Authors: Tomer Galanti, Ofir Nabati, Lior Wolf

Abstract: In the univariate case, we show that by comparing the individual complexities of univariate cause and effect, one can identify the cause and the effect, without considering their interaction at all. In our framework, complexities are captured by the reconstruction error of an autoencoder that operates on the quantiles of the distribution. Comparing the reconstruction errors of the two autoencoders… ▽ More In the univariate case, we show that by comparing the individual complexities of univariate cause and effect, one can identify the cause and the effect, without considering their interaction at all. In our framework, complexities are captured by the reconstruction error of an autoencoder that operates on the quantiles of the distribution. Comparing the reconstruction errors of the two autoencoders, one for each variable, is shown to perform surprisingly well on the accepted causality directionality benchmarks. Hence, the decision as to which of the two is the cause and which is the effect may not be based on causality but on complexity. In the multivariate case, where one can ensure that the complexities of the cause and effect are balanced, we propose a new adversarial training method that mimics the disentangled structure of the causal model. We prove that in the multidimensional case, such modeling is likely to fit the data only in the direction of causality. Furthermore, a uniqueness result shows that the learned model is able to identify the underlying causal and residual (noise) components. Our multidimensional method outperforms the literature methods on both synthetic and real world datasets. △ Less

Submitted 23 February, 2020; originally announced February 2020.

arXiv:2002.10006 [pdf, other]

On the Modularity of Hypernetworks

Authors: Tomer Galanti, Lior Wolf

Abstract: In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $θ_I$ of the function $h_I(x) = g(x;θ_I)$ ar… ▽ More In the context of learning to map an input $I$ to a function $h_I:\mathcal{X}\to \mathbb{R}$, two alternative methods are compared: (i) an embedding-based method, which learns a fixed function in which $I$ is encoded as a conditioning signal $e(I)$ and the learned function takes the form $h_I(x) = q(x,e(I))$, and (ii) hypernetworks, in which the weights $θ_I$ of the function $h_I(x) = g(x;θ_I)$ are given by a hypernetwork $f$ as $θ_I=f(I)$. In this paper, we define the property of modularity as the ability to effectively learn a different function for each input instance $I$. For this purpose, we adopt an expressivity perspective of this property and extend the theory of Devore et al. 1996 and provide a lower bound on the complexity (number of trainable parameters) of neural networks as function approximators, by eliminating the requirements for the approximation method to be robust. Our results are then used to compare the complexities of $q$ and $g$, showing that under certain conditions and when letting the functions $e$ and $f$ be as large as we wish, $g$ can be smaller than $q$ by orders of magnitude. This sheds light on the modularity of hypernetworks in comparison with the embedding-based method. Besides, we show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method. △ Less

Submitted 2 November, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

Comments: Accepted to Advances in Neural Information Processing Systems (NeurIPS) 2020

arXiv:2001.10460 [pdf, other]

On Random Kernels of Residual Architectures

Authors: Etai Littwin, Tomer Galanti, Lior Wolf

Abstract: We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual architectures are initialized much closer to the "kernel regime" than their vanilla counterparts: while in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers' width. Our fi… ▽ More We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual architectures are initialized much closer to the "kernel regime" than their vanilla counterparts: while in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers' width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features. △ Less

Submitted 17 June, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

arXiv:2001.05207 [pdf, ps, other]

A Formal Approach to Explainability

Authors: Lior Wolf, Tomer Galanti, Tamir Hazan

Abstract: We regard explanations as a blending of the input sample and the model's output and offer a few definitions that capture various desired properties of the function that generates these explanations. We study the links between these properties and between explanation-generating functions and intermediate representations of learned models and are able to show, for example, that if the activations of… ▽ More We regard explanations as a blending of the input sample and the model's output and offer a few definitions that capture various desired properties of the function that generates these explanations. We study the links between these properties and between explanation-generating functions and intermediate representations of learned models and are able to show, for example, that if the activations of a given layer are consistent with an explanation, then so do all other subsequent layers. In addition, we study the intersection and union of explanations as a way to construct new explanations. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Journal ref: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, January 2019, Pages 255-261

arXiv:2001.05026 [pdf, other]

Unsupervised Learning of the Set of Local Maxima

Authors: Lior Wolf, Sagie Benaim, Tomer Galanti

Abstract: This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of th… ▽ More This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of the unknown function v. Loss terms are used to ensure that all training samples x are a local maxima of v, according to h and satisfy c(x)=1. Therefore, c and h provide training signals to each other: a point x' in the vicinity of x satisfies c(x)=-1 or is deemed by h to be lower in value than x. We present an algorithm, show an example where it is more efficient to use local maxima as an indicator function than to employ conventional classification, and derive a suitable generalization bound. Our experiments show that the method is able to outperform one-class classification algorithms in the task of anomaly detection and also provide an additional signal that is extracted in a completely unsupervised way. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: ICLR 2019

arXiv:2001.05017 [pdf, other]

Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer

Authors: Ori Press, Tomer Galanti, Sagie Benaim, Lior Wolf

Abstract: We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B contain all the information that exists in samples a in A and some additional information. For example, ignoring occlusions, B can be people with glasses, A people without, and the glasses, would be the added information. When map** a sample a from the first domain to the other… ▽ More We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B contain all the information that exists in samples a in A and some additional information. For example, ignoring occlusions, B can be people with glasses, A people without, and the glasses, would be the added information. When map** a sample a from the first domain to the other domain, the missing information is replicated from an independent reference sample b in B. Thus, in the above example, we can create, for every person without glasses a version with the glasses observed in any face image. Our solution employs a single two-pathway encoder and a single decoder for both domains. The common part of the two domains and the separate part are encoded as two vectors, and the separate part is fixed at zero for domain A. The loss terms are minimal and involve reconstruction losses for the two domains and a domain confusion term. Our analysis shows that under mild assumptions, this architecture, which is much simpler than the literature guided-translation methods, is enough to ensure disentanglement between the two domains. We present convincing results in a few visual domains, such as no-glasses to glasses, adding facial hair based on a reference image, etc. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Journal ref: ICLR 2019

arXiv:1908.11628 [pdf, other]

Domain Intersection and Domain Difference

Authors: Sagie Benaim, Michael Khaitov, Tomer Galanti, Lior Wolf

Abstract: We present a method for recovering the shared content between two visual domains as well as the content that is unique to each domain. This allows us to map from one domain to the other, in a way in which the content that is specific for the first domain is removed and the content that is specific for the second is imported from any image in the second domain. In addition, our method enables gener… ▽ More We present a method for recovering the shared content between two visual domains as well as the content that is unique to each domain. This allows us to map from one domain to the other, in a way in which the content that is specific for the first domain is removed and the content that is specific for the second is imported from any image in the second domain. In addition, our method enables generation of images from the intersection of the two domains as well as their union, despite having no such samples during training. The method is shown analytically to contain all the sufficient and necessary constraints. It also outperforms the literature methods in an extensive set of experiments. Our code is available at https://github.com/sagiebenaim/DomainIntersectionDifference. △ Less

Submitted 30 August, 2019; originally announced August 2019.

Journal ref: ICCV 2019

arXiv:1807.08501 [pdf, other]

Risk Bounds for Unsupervised Cross-Domain Map** with IPMs

Authors: Tomer Galanti, Sagie Benaim, Lior Wolf

Abstract: The recent empirical success of unsupervised cross-domain map** algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such map**s. We work with adversarial training methods based on IPMs and derive a novel risk bound, which upper bounds the risk between the lear… ▽ More The recent empirical success of unsupervised cross-domain map** algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such map**s. We work with adversarial training methods based on IPMs and derive a novel risk bound, which upper bounds the risk between the learned map** $h$ and the target map** $y$, by a sum of three terms: (i) the risk between $h$ and the most distant alternative map** that was learned by the same cross-domain map** algorithm, (ii) the minimal discrepancy between the target domain and the domain obtained by applying a hypothesis $h^*$ on the samples of the source domain, where $h^*$ is a hypothesis selectable by the same algorithm. The bound is directly related to Occam's razor and encourages the selection of the minimal architecture that supports a small map** discrepancy and (iii) an approximation error term that decreases as the complexity of the class of discriminators increases and is empirically shown to be small. The bound leads to multiple algorithmic consequences, including a method for hyperparameters selection and for early stop** in cross-domain map** GANs. We also demonstrate a novel capability for unsupervised learning of estimating confidence in the map** of every specific sample. △ Less

Submitted 2 November, 2020; v1 submitted 23 July, 2018; originally announced July 2018.

Comments: arXiv admin note: text overlap with arXiv:1709.00074

arXiv:1712.07886 [pdf, other]

Estimating the Success of Unsupervised Image to Image Translation

Authors: Sagie Benaim, Tomer Galanti, Lior Wolf

Abstract: While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a map** in an unsupervised way. As a result, when training GANs and specifically when using GANs for learning to map between domains in a completely unsupervised way, one is forced to select the h… ▽ More While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a map** in an unsupervised way. As a result, when training GANs and specifically when using GANs for learning to map between domains in a completely unsupervised way, one is forced to select the hyperparameters and the stop** epoch by subjectively examining multiple options. We propose a novel bound for predicting the success of unsupervised cross domain map** methods, which is motivated by the recently proposed Simplicity Principle. The bound can be applied both in expectation, for comparing hyperparameters and for selecting a stop** criterion, or per sample, in order to predict the success of a specific cross-domain translation. The utility of the bound is demonstrated in an extensive set of experiments employing multiple recent algorithms. Our code is available at https://github.com/sagiebenaim/gan_bound . △ Less

Submitted 22 March, 2018; v1 submitted 21 December, 2017; originally announced December 2017.

Comments: The first and second authors contributed equally

arXiv:1709.00074 [pdf, other]

The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Map**s

Authors: Tomer Galanti, Lior Wolf, Sagie Benaim

Abstract: We discuss the feasibility of the following learning problem: given unmatched samples from two domains and nothing else, learn a map** between the two, which preserves semantics. Due to the lack of paired samples and without any definition of the semantic information, the problem might seem ill-posed. Specifically, in typical cases, it seems possible to build infinitely many alternative map**s… ▽ More We discuss the feasibility of the following learning problem: given unmatched samples from two domains and nothing else, learn a map** between the two, which preserves semantics. Due to the lack of paired samples and without any definition of the semantic information, the problem might seem ill-posed. Specifically, in typical cases, it seems possible to build infinitely many alternative map**s from every target map**. This apparent ambiguity stands in sharp contrast to the recent empirical success in solving this problem. We identify the abstract notion of aligning two domains in a semantic way with concrete terms of minimal relative complexity. A theoretical framework for measuring the complexity of compositions of functions is developed in order to show that it is reasonable to expect the minimal complexity map** to be unique. The measured complexity used is directly related to the depth of the neural networks being learned and a semantically aligned map** could then be captured simply by learning using architectures that are not much bigger than the minimal architecture. Various predictions are made based on the hypothesis that semantic alignment can be captured by the minimal map**. These are verified extensively. In addition, a new map** algorithm is proposed and shown to lead to better map** results. △ Less

Submitted 15 January, 2020; v1 submitted 31 August, 2017; originally announced September 2017.

arXiv:1703.01606 [pdf, ps, other]

A Theory of Output-Side Unsupervised Domain Adaptation

Authors: Tomer Galanti, Lior Wolf

Abstract: When learning a map** from an input space to an output space, the assumption that the sample distribution of the training data is the same as that of the test data is often violated. Unsupervised domain shift methods adapt the learned function in order to correct for this shift. Previous work has focused on utilizing unlabeled samples from the target distribution. We consider the complementary p… ▽ More When learning a map** from an input space to an output space, the assumption that the sample distribution of the training data is the same as that of the test data is often violated. Unsupervised domain shift methods adapt the learned function in order to correct for this shift. Previous work has focused on utilizing unlabeled samples from the target distribution. We consider the complementary problem in which the unlabeled samples are given post map**, i.e., we are given the outputs of the map** of unknown samples from the shifted domain. Two other variants are also studied: the two sided version, in which unlabeled samples are give from both the input and the output spaces, and the Domain Transfer problem, which was recently formalized. In all cases, we derive generalization bounds that employ discrepancy terms. △ Less

Submitted 5 March, 2017; originally announced March 2017.

Showing 1–26 of 26 results for author: Galanti, T