Search | arXiv e-print repository

Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov

Abstract: Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a… ▽ More Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.01611 [pdf, other]

Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Authors: Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhibek Kadyrsizova, Zhenisbek Assylbekov

Abstract: The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentra… ▽ More The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: ACML 2023

arXiv:2307.10736 [pdf, other]

Long-Tail Theory under Gaussian Mixtures

Authors: Arman Bolatov, Maxat Tezekbayev, Igor Melnykov, Artur Pak, Vassilina Nikoulina, Zhenisbek Assylbekov

Abstract: We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered… ▽ More We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data. △ Less

Submitted 24 July, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: accepted to ECAI 2023

arXiv:2204.12481 [pdf, other]

From Hyperbolic Geometry Back to Word Embeddings

Authors: Sultan Nurmukhamedov, Thomas Mach, Arsen Sheverdin, Zhenisbek Assylbekov

Abstract: We choose random points in the hyperbolic disc and claim that these points are already word representations. However, it is yet to be uncovered which point corresponds to which word of the human language of interest. This correspondence can be approximately established using a pointwise mutual information between words and recent alignment techniques. We choose random points in the hyperbolic disc and claim that these points are already word representations. However, it is yet to be uncovered which point corresponds to which word of the human language of interest. This correspondence can be approximately established using a pointwise mutual information between words and recent alignment techniques. △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2111.06832 [pdf, other]

Speeding Up Entmax

Authors: Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov

Abstract: Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this probl… ▽ More Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $α$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task. △ Less

Submitted 19 May, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

Comments: Findings of NAACL 2022

arXiv:2103.01819 [pdf, other]

doi 10.1613/jair.1.12788

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Authors: Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet, Madina Babazhanova, Matthias Gallé, Zhenisbek Assylbekov

Abstract: There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly co… ▽ More There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English. △ Less

Submitted 3 January, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Journal ref: Journal of Artificial Intelligence Vol. 72 (2021) 1343-1384

arXiv:2002.12005 [pdf, other]

Squashed Shifted PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces

Authors: Zhenisbek Assylbekov, Alibi Jangeldin

Abstract: We show that removing sigmoid transformation in the skip-gram with negative sampling (SGNS) objective does not harm the quality of word vectors significantly and at the same time is related to factorizing a squashed shifted PMI matrix which, in turn, can be treated as a connection probabilities matrix of a random graph. Empirically, such graph is a complex network, i.e. it has strong clustering an… ▽ More We show that removing sigmoid transformation in the skip-gram with negative sampling (SGNS) objective does not harm the quality of word vectors significantly and at the same time is related to factorizing a squashed shifted PMI matrix which, in turn, can be treated as a connection probabilities matrix of a random graph. Empirically, such graph is a complex network, i.e. it has strong clustering and scale-free degree distribution, and is tightly connected with hyperbolic spaces. In short, we show the connection between static word embeddings and hyperbolic spaces through the squashed shifted PMI matrix using analytical and empirical methods. △ Less

Submitted 26 September, 2020; v1 submitted 27 February, 2020; originally announced February 2020.

Comments: AJCAI 2020

arXiv:1912.13413 [pdf, other]

Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

Authors: Maxat Tezekbayev, Zhenisbek Assylbekov, Rustem Takhanov

Abstract: We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word. We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: 2 pages, 1 figure, Student Abstract

arXiv:1909.13494 [pdf, other]

A Critique of the Smooth Inverse Frequency Sentence Embeddings

Authors: Aidana Karipbayeva, Alena Sorokina, Zhenisbek Assylbekov

Abstract: We critically review the smooth inverse frequency sentence embedding method of Arora, Liang, and Ma (2017), and show inconsistencies in its setup, derivation, and evaluation. We critically review the smooth inverse frequency sentence embedding method of Arora, Liang, and Ma (2017), and show inconsistencies in its setup, derivation, and evaluation. △ Less

Submitted 30 September, 2019; originally announced September 2019.

Comments: 2 pages, 2 figures, Abstract

arXiv:1909.09855 [pdf, other]

Low-Rank Approximation of Matrices for PMI-based Word Embeddings

Authors: Alena Sorokina, Aidana Karipbayeva, Zhenisbek Assylbekov

Abstract: We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approxim… ▽ More We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approximation (SVD, NMF, QR) and dimensionality of the vectors (250, 500) to each of the PMI matrix replicates. Our experiments show that word vectors obtained from the truncated SVD achieve the best performance on two downstream tasks, similarity and analogy, compare to the other two low-rank approximation methods. △ Less

Submitted 21 September, 2019; originally announced September 2019.

Comments: 10 pages, 4 figures, CICLing 2019, Springer "Lecture Notes in Computer Science"

arXiv:1902.09859 [pdf, other]

Context Vectors are Reflections of Word Vectors in Half the Dimensions

Authors: Zhenisbek Assylbekov, Rustem Takhanov

Abstract: This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assum… ▽ More This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model. △ Less

Submitted 26 February, 2019; originally announced February 2019.

arXiv:1902.03011 [pdf, other]

doi 10.3233/IDA-195050

Fourier Neural Networks: A Comparative Study

Authors: Abylay Zhumekenov, Malika Uteuliyeva, Olzhas Kabdolov, Rustem Takhanov, Zhenisbek Assylbekov, Alejandro J. Castro

Abstract: We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirica… ▽ More We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirically demonstrate lower approximation error than the truncated Fourier series when it comes to an approximation of a known function of multiple variables. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Journal ref: Intell. Data Anal. 24 (2020), 1107-1120

arXiv:1802.08375 [pdf, other]

Reusing Weights in Subword-aware Neural Language Models

Authors: Zhenisbek Assylbekov, Rustem Takhanov

Abstract: We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding mode… ▽ More We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output. Our best morpheme-aware model with properly reused weights beats the competitive word-level model by a large margin across multiple languages and has 20%-87% fewer parameters. △ Less

Submitted 25 April, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

Comments: accepted to NAACL 2018

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1709.00541 [pdf, other]

Patterns versus Characters in Subword-aware Neural Language Modeling

Authors: Rustem Takhanov, Zhenisbek Assylbekov

Abstract: Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords… ▽ More Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords and from the latter set we select patterns, i.e. subwords which encapsulate information on character $n$-gram regularities. The selection is made using the pattern-based Conditional Random Field model with $l_1$ regularization. Further, for every word we construct a new sequence over an alphabet of patterns. The new alphabet's symbols confine a local statistical context stronger than the characters, therefore they allow better representations in ${\mathbb{R}}^n$ and are better building blocks for word representation. In the task of subword-aware language modeling, pattern-based models outperform character-based analogues by 2-20 perplexity points. Also, a recurrent neural network in which a word is represented as a sum of embeddings of its patterns is on par with a competitive and significantly more sophisticated character-based convolutional architecture. △ Less

Submitted 2 September, 2017; originally announced September 2017.

Comments: 10 pages

arXiv:1707.06480 [pdf, other]

Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Authors: Zhenisbek Assylbekov, Rustem Takhanov, Bagdat Myrzakhmetov, Jonathan N. Washington

Abstract: Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster. Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster. △ Less

Submitted 20 July, 2017; originally announced July 2017.

Comments: EMNLP 2017

MSC Class: 68T50 ACM Class: I.2.7

Showing 1–15 of 15 results for author: Assylbekov, Z