Search | arXiv e-print repository

Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov

Abstract: Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a… ▽ More Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.01611 [pdf, other]

Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhibek Kadyrsizova, Zhenisbek Assylbekov

Abstract: The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentra… ▽ More The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: ACML 2023

arXiv:2307.10736 [pdf, other]

Long-Tail Theory under Gaussian Mixtures

Authors: Arman Bolatov, Maxat Tezekbayev, Igor Melnykov, Artur Pak, Vassilina Nikoulina, Zhenisbek Assylbekov

Abstract: We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered… ▽ More We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data. △ Less

Submitted 24 July, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: accepted to ECAI 2023

arXiv:2306.14194 [pdf, other]

Autoencoders for a manifold learning problem with a Jacobian rank constraint

Authors: Rustem Takhanov, Y. Sultan Abylkairov, Maxat Tezekbayev

Abstract: We formulate the manifold learning problem as the problem of finding an operator that maps any point to a close neighbor that lies on a ``hidden'' $k$-dimensional manifold. We call this operator the correcting function. Under this formulation, autoencoders can be viewed as a tool to approximate the correcting function. Given an autoencoder whose Jacobian has rank $k$, we deduce from the classical… ▽ More We formulate the manifold learning problem as the problem of finding an operator that maps any point to a close neighbor that lies on a ``hidden'' $k$-dimensional manifold. We call this operator the correcting function. Under this formulation, autoencoders can be viewed as a tool to approximate the correcting function. Given an autoencoder whose Jacobian has rank $k$, we deduce from the classical Constant Rank Theorem that its range has a structure of a $k$-dimensional manifold. A $k$-dimensionality of the range can be forced by the architecture of an autoencoder (by fixing the dimension of the code space), or alternatively, by an additional constraint that the rank of the autoencoder map** is not greater than $k$. This constraint is included in the objective function as a new term, namely a squared Ky-Fan $k$-antinorm of the Jacobian function. We claim that this constraint is a factor that effectively reduces the dimension of the range of an autoencoder, additionally to the reduction defined by the architecture. We also add a new curvature term into the objective. To conclude, we experimentally compare our approach with the CAE+H method on synthetic and real-world datasets. △ Less

Submitted 25 June, 2023; originally announced June 2023.

arXiv:2111.06832 [pdf, other]

Speeding Up Entmax

Authors: Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov

Abstract: Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this probl… ▽ More Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $α$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task. △ Less

Submitted 19 May, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

Comments: Findings of NAACL 2022

arXiv:2103.01819 [pdf, other]

doi 10.1613/jair.1.12788

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Authors: Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet, Madina Babazhanova, Matthias Gallé, Zhenisbek Assylbekov

Abstract: There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly co… ▽ More There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English. △ Less

Submitted 3 January, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Journal ref: Journal of Artificial Intelligence Vol. 72 (2021) 1343-1384

arXiv:1912.13413 [pdf, other]

Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Authors: Maxat Tezekbayev, Zhenisbek Assylbekov, Rustem Takhanov

Abstract: We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word. We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: 2 pages, 1 figure, Student Abstract

Showing 1–7 of 7 results for author: Tezekbayev, M