Search | arXiv e-print repository

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Authors: Noam Razin, Yotam Alexander, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Nadav Cohen

Abstract: In modern machine learning, models can often fit training data in numerous ways, some of which perform well on unseen (test) data, while others do not. Remarkably, in such cases gradient descent frequently exhibits an implicit bias that leads to excellent performance on unseen data. This implicit bias was extensively studied in supervised learning, but is far less understood in optimal control (re… ▽ More In modern machine learning, models can often fit training data in numerous ways, some of which perform well on unseen (test) data, while others do not. Remarkably, in such cases gradient descent frequently exhibits an implicit bias that leads to excellent performance on unseen data. This implicit bias was extensively studied in supervised learning, but is far less understood in optimal control (reinforcement learning). There, learning a controller applied to a system via gradient descent is known as policy gradient, and a question of prime importance is the extent to which a learned controller extrapolates to unseen initial states. This paper theoretically studies the implicit bias of policy gradient in terms of extrapolation to unseen initial states. Focusing on the fundamental Linear Quadratic Regulator (LQR) problem, we establish that the extent of extrapolation depends on the degree of exploration induced by the system when commencing from initial states included in training. Experiments corroborate our theory, and demonstrate its conclusions on problems beyond LQR, where systems are non-linear and controllers are neural networks. We hypothesize that real-world optimal control may be greatly improved by develo** methods for informed selection of initial states to train on. △ Less

Submitted 1 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: Accepted to ICML 2024

arXiv:2310.20703 [pdf, other]

Vanishing Gradients in Reinforcement Finetuning of Language Models

Authors: Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin

Abstract: Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model… ▽ More Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT. △ Less

Submitted 14 March, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

Comments: Accepted to ICLR 2024

arXiv:2310.16028 [pdf, other]

What Algorithms can Transformers Learn? A Study in Length Generalization

Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, Preetum Nakkiran

Abstract: Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a… ▽ More Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the "min-degree-interpolator" model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Preprint

arXiv:2303.11249 [pdf, other]

What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum Entanglement

Authors: Yotam Alexander, Nimrod De La Vega, Noam Razin, Nadav Cohen

Abstract: The question of what makes a data distribution suitable for deep learning is a fundamental open problem. Focusing on locally connected neural networks (a prevalent family of architectures that includes convolutional and recurrent neural networks as well as local self-attention models), we address this problem by adopting theoretical tools from quantum physics. Our main theoretical result states th… ▽ More The question of what makes a data distribution suitable for deep learning is a fundamental open problem. Focusing on locally connected neural networks (a prevalent family of architectures that includes convolutional and recurrent neural networks as well as local self-attention models), we address this problem by adopting theoretical tools from quantum physics. Our main theoretical result states that a certain locally connected neural network is capable of accurate prediction over a data distribution if and only if the data distribution admits low quantum entanglement under certain canonical partitions of features. As a practical application of this result, we derive a preprocessing method for enhancing the suitability of a data distribution to locally connected neural networks. Experiments with widespread models over various datasets demonstrate our findings. We hope that our use of quantum entanglement will encourage further adoption of tools from physics for formally reasoning about the relation between deep learning and real-world data. △ Less

Submitted 21 January, 2024; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2211.16494 [pdf, other]

On the Ability of Graph Neural Networks to Model Interactions Between Vertices

Authors: Noam Razin, Tom Verbin, Nadav Cohen

Abstract: Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure k… ▽ More Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure known as separation rank, we quantify the ability of certain GNNs to model interaction between a given subset of vertices and its complement, i.e. between the sides of a given partition of input vertices. Our results reveal that the ability to model interaction is primarily determined by the partition's walk index -- a graph-theoretical characteristic defined by the number of walks originating from the boundary of the partition. Experiments with common GNN architectures corroborate this finding. As a practical application of our theory, we design an edge sparsification algorithm named Walk Index Sparsification (WIS), which preserves the ability of a GNN to model interactions when input edges are removed. WIS is simple, computationally efficient, and in our experiments has markedly outperformed alternative methods in terms of induced prediction accuracy. More broadly, it showcases the potential of improving GNNs by theoretically analyzing the interactions they can model. △ Less

Submitted 23 October, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Accepted to NeurIPS 2023

arXiv:2201.11729 [pdf, other]

Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Authors: Noam Razin, Asaf Maman, Nadav Cohen

Abstract: In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regulariza… ▽ More In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization. △ Less

Submitted 18 September, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Accepted to ICML 2022

arXiv:2102.09972 [pdf, other]

Implicit Regularization in Tensor Factorization

Authors: Noam Razin, Asaf Maman, Nadav Cohen

Abstract: Recent efforts to unravel the mystery of implicit regularization in deep learning have led to a theoretical focus on matrix factorization -- matrix completion via linear neural network. As a step further towards practical deep learning, we provide the first theoretical analysis of implicit regularization in tensor factorization -- tensor completion via certain type of non-linear neural network. We… ▽ More Recent efforts to unravel the mystery of implicit regularization in deep learning have led to a theoretical focus on matrix factorization -- matrix completion via linear neural network. As a step further towards practical deep learning, we provide the first theoretical analysis of implicit regularization in tensor factorization -- tensor completion via certain type of non-linear neural network. We circumvent the notorious difficulty of tensor problems by adopting a dynamical systems perspective, and characterizing the evolution induced by gradient descent. The characterization suggests a form of greedy low tensor rank search, which we rigorously prove under certain conditions, and empirically demonstrate under others. Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, we empirically explore it as a measure of complexity, and find that it captures the essence of datasets on which neural networks generalize. This leads us to believe that tensor rank may pave way to explaining both implicit regularization in deep learning, and the properties of real-world data translating this implicit regularization to generalization. △ Less

Submitted 9 June, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

Comments: Accepted to ICML 2021

arXiv:2009.13292 [pdf, other]

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Authors: Itzik Malkiel, Oren Barkan, Avi Caciularu, Noam Razin, Ori Katz, Noam Koenigstein

Abstract: Language models that utilize extensive self-supervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learni… ▽ More Language models that utilize extensive self-supervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learning catalog-specialized language models for text-based item recommendations. We suggest novel training and inference procedures for scoring similarities between pairs of items, that don't require item similarity labels. Both the training and the inference techniques were designed to utilize the unlabeled structure of textual catalogs, and minimize the discrepancy between them. By incorporating four scores during inference, RecoBERT can infer text-based item-to-item similarities more accurately than other techniques. In addition, we introduce a new language understanding task for wine recommendations using similarities based on professional wine reviews. As an additional contribution, we publish annotated recommendations dataset crafted by human wine experts. Finally, we evaluate RecoBERT and compare it to various state-of-the-art NLP models on wine and fashion recommendations tasks. △ Less

Submitted 25 September, 2020; originally announced September 2020.

arXiv:2008.08088 [pdf, other]

doi 10.1103/PhysRevE.102.030103

The entropy production of an active particle in a box

Authors: Nitzan Razin

Abstract: A run-and-tumble particle in a one dimensional box (infinite potential well) is studied. The steady state is analytically solved and analyzed, revealing the emergent length scale of the boundary layer where particles accumulate near the walls. The mesoscopic steady state entropy production rate of the system is derived from coupled Fokker-Planck equations with a linear reaction term, resulting in… ▽ More A run-and-tumble particle in a one dimensional box (infinite potential well) is studied. The steady state is analytically solved and analyzed, revealing the emergent length scale of the boundary layer where particles accumulate near the walls. The mesoscopic steady state entropy production rate of the system is derived from coupled Fokker-Planck equations with a linear reaction term, resulting in an exact analytic expression. The entropy production density is shown to peak at the walls. Additionally, the derivative of the entropy production rate peaks at a system size proportional to the length scale of the accumulation boundary layer, suggesting that the behavior of the entropy production rate and its derivatives as a function of the control parameter may signify a qualitative behavior change in the physics of active systems, such as phase transitions. △ Less

Submitted 7 September, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: 8 pages, 5 figures

Journal ref: Phys. Rev. E 102, 030103 (2020)

arXiv:2005.06398 [pdf, other]

Implicit Regularization in Deep Learning May Not Be Explainable by Norms

Authors: Noam Razin, Nadav Cohen

Abstract: Mathematically characterizing the implicit regularization induced by gradient-based optimization is a longstanding pursuit in the theory of deep learning. A widespread hope is that a characterization based on minimization of norms may apply, and a standard test-bed for studying this prospect is matrix factorization (matrix completion via linear neural networks). It is an open question whether norm… ▽ More Mathematically characterizing the implicit regularization induced by gradient-based optimization is a longstanding pursuit in the theory of deep learning. A widespread hope is that a characterization based on minimization of norms may apply, and a standard test-bed for studying this prospect is matrix factorization (matrix completion via linear neural networks). It is an open question whether norms can explain the implicit regularization in matrix factorization. The current paper resolves this open question in the negative, by proving that there exist natural matrix factorization problems on which the implicit regularization drives all norms (and quasi-norms) towards infinity. Our results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank. We demonstrate empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning. △ Less

Submitted 17 October, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

arXiv:1908.05161 [pdf]

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

Authors: Oren Barkan, Noam Razin, Itzik Malkiel, Ori Katz, Avi Caciularu, Noam Koenigstein

Abstract: Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-cand… ▽ More Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector map** and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding. △ Less

Submitted 21 November, 2019; v1 submitted 14 August, 2019; originally announced August 2019.

Comments: In Proceedings of AAAI 2020

arXiv:1806.08921 [pdf, other]

doi 10.1103/PhysRevE.99.022419

Signatures of motor susceptibility in the dynamics of a tracer particle in an active gel

Authors: Nitzan Razin, Raphael Voituriez, Nir S. Gov

Abstract: We study a model for the motion of a tracer particle inside an active gel, exposing the properties of the van Hove distribution of the particle displacements. Active events of a typical force magnitude give rise to non-Gaussian distributions, having exponential tails or side-peaks. The side-peaks appear when the local bulk elasticity of the gel is large enough and few active sources are dominant.… ▽ More We study a model for the motion of a tracer particle inside an active gel, exposing the properties of the van Hove distribution of the particle displacements. Active events of a typical force magnitude give rise to non-Gaussian distributions, having exponential tails or side-peaks. The side-peaks appear when the local bulk elasticity of the gel is large enough and few active sources are dominant. We explain the regimes of the different distributions, and study the structure of the peaks for active sources that are susceptible to the elastic stress that they cause inside the gel. We show how the van Hove distribution is altered by both the duty cycle of the active sources and their susceptibility, and suggest it as a sensitive probe to analyze microrheology data in active systems with restoring elastic forces. △ Less

Submitted 23 June, 2018; originally announced June 2018.

Comments: 4 pages, 4 figures and supplemental information (5 pages, 4 figures)

Journal ref: Phys. Rev. E 99, 022419 (2019)

arXiv:1708.05370 [pdf, other]

doi 10.1103/PhysRevE.96.052409

Forces in inhomogeneous open active-particle systems

Authors: Nitzan Razin, Raphael Voituriez, Jens Elgeti, Nir S. Gov

Abstract: We study the force that non-interacting point-like active particles apply to a symmetric inert object in the presence of a gradient of activity and particle sources and sinks. We consider two simple patterns of sources and sinks that are common in biological systems. We analytically solve a one dimensional model designed to emulate higher dimensional systems, and study a two dimensional model by n… ▽ More We study the force that non-interacting point-like active particles apply to a symmetric inert object in the presence of a gradient of activity and particle sources and sinks. We consider two simple patterns of sources and sinks that are common in biological systems. We analytically solve a one dimensional model designed to emulate higher dimensional systems, and study a two dimensional model by numerical simulations. We specify when the particle flux due to the creation and annihilation of particles can act to smooth the density profile that is induced by a gradient in the velocity of the active particles, and find the net resultant force due to both the gradient in activity and the particle flux. These results are compared qualitatively to observations of nuclear motion inside the oocyte, that is driven by a gradient in activity of actin-coated vesicles. △ Less

Submitted 4 March, 2018; v1 submitted 17 August, 2017; originally announced August 2017.

Comments: 14 pages, 8 figures

Journal ref: Phys. Rev. E 96, 052409 (2017)

arXiv:1703.07359 [pdf, other]

doi 10.1103/PhysRevE.96.032606

Generalized Archimedes' principle in active fluids

Authors: Nitzan Razin, Raphael Voituriez, Jens Elgeti, Nir S. Gov

Abstract: We show how a gradient in the motility properties of non-interacting point-like active particles can cause a pressure gradient that pushes a large inert object. We calculate the force on an object inside a system of active particles with position dependent motion parameters, in one and two dimensions, and show that a modified Archimedes' principle is satisfied. We characterize the system, both in… ▽ More We show how a gradient in the motility properties of non-interacting point-like active particles can cause a pressure gradient that pushes a large inert object. We calculate the force on an object inside a system of active particles with position dependent motion parameters, in one and two dimensions, and show that a modified Archimedes' principle is satisfied. We characterize the system, both in terms of the model parameters and in terms of experimentally measurable quantities: the spatial profiles of the density, velocity and pressure. This theoretical analysis is motivated by recent experiments, which showed that the nucleus of a mouse oocyte (immature egg cell) moves from the cortex to the center due to a gradient of activity of vesicles propelled by molecular motors; it more generally applies to artificial systems of controlled localized activity. △ Less

Submitted 29 September, 2017; v1 submitted 21 March, 2017; originally announced March 2017.

Comments: 16 pages, 9 figures

Journal ref: Phys. Rev. E 96, 032606 (2017)

Showing 1–14 of 14 results for author: Razin, N