-
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Authors:
Rustem Takhanov,
Maxat Tezekbayev,
Artur Pak,
Arman Bolatov,
Zhenisbek Assylbekov
Abstract:
Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a…
▽ More
Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function.
A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line.
We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Intractability of Learning the Discrete Logarithm with Gradient-Based Methods
Authors:
Rustem Takhanov,
Maxat Tezekbayev,
Artur Pak,
Arman Bolatov,
Zhibek Kadyrsizova,
Zhenisbek Assylbekov
Abstract:
The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentra…
▽ More
The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained.
Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Long-Tail Theory under Gaussian Mixtures
Authors:
Arman Bolatov,
Maxat Tezekbayev,
Igor Melnykov,
Artur Pak,
Vassilina Nikoulina,
Zhenisbek Assylbekov
Abstract:
We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered…
▽ More
We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data.
△ Less
Submitted 24 July, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
From Hyperbolic Geometry Back to Word Embeddings
Authors:
Sultan Nurmukhamedov,
Thomas Mach,
Arsen Sheverdin,
Zhenisbek Assylbekov
Abstract:
We choose random points in the hyperbolic disc and claim that these points are already word representations. However, it is yet to be uncovered which point corresponds to which word of the human language of interest. This correspondence can be approximately established using a pointwise mutual information between words and recent alignment techniques.
We choose random points in the hyperbolic disc and claim that these points are already word representations. However, it is yet to be uncovered which point corresponds to which word of the human language of interest. This correspondence can be approximately established using a pointwise mutual information between words and recent alignment techniques.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Speeding Up Entmax
Authors:
Maxat Tezekbayev,
Vassilina Nikoulina,
Matthias Gallé,
Zhenisbek Assylbekov
Abstract:
Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this probl…
▽ More
Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $α$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax.
In this paper, we propose an alternative to $α$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.
△ Less
Submitted 19 May, 2022; v1 submitted 12 November, 2021;
originally announced November 2021.
-
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics
Authors:
Vassilina Nikoulina,
Maxat Tezekbayev,
Nuradil Kozhakhmet,
Madina Babazhanova,
Matthias Gallé,
Zhenisbek Assylbekov
Abstract:
There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly co…
▽ More
There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.
△ Less
Submitted 3 January, 2022; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Squashed Shifted PMI Matrix: Bridging Word Embeddings and Hyperbolic Spaces
Authors:
Zhenisbek Assylbekov,
Alibi Jangeldin
Abstract:
We show that removing sigmoid transformation in the skip-gram with negative sampling (SGNS) objective does not harm the quality of word vectors significantly and at the same time is related to factorizing a squashed shifted PMI matrix which, in turn, can be treated as a connection probabilities matrix of a random graph. Empirically, such graph is a complex network, i.e. it has strong clustering an…
▽ More
We show that removing sigmoid transformation in the skip-gram with negative sampling (SGNS) objective does not harm the quality of word vectors significantly and at the same time is related to factorizing a squashed shifted PMI matrix which, in turn, can be treated as a connection probabilities matrix of a random graph. Empirically, such graph is a complex network, i.e. it has strong clustering and scale-free degree distribution, and is tightly connected with hyperbolic spaces. In short, we show the connection between static word embeddings and hyperbolic spaces through the squashed shifted PMI matrix using analytical and empirical methods.
△ Less
Submitted 26 September, 2020; v1 submitted 27 February, 2020;
originally announced February 2020.
-
Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings
Authors:
Maxat Tezekbayev,
Zhenisbek Assylbekov,
Rustem Takhanov
Abstract:
We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.
We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
A Critique of the Smooth Inverse Frequency Sentence Embeddings
Authors:
Aidana Karipbayeva,
Alena Sorokina,
Zhenisbek Assylbekov
Abstract:
We critically review the smooth inverse frequency sentence embedding method of Arora, Liang, and Ma (2017), and show inconsistencies in its setup, derivation, and evaluation.
We critically review the smooth inverse frequency sentence embedding method of Arora, Liang, and Ma (2017), and show inconsistencies in its setup, derivation, and evaluation.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Low-Rank Approximation of Matrices for PMI-based Word Embeddings
Authors:
Alena Sorokina,
Aidana Karipbayeva,
Zhenisbek Assylbekov
Abstract:
We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approxim…
▽ More
We perform an empirical evaluation of several methods of low-rank approximation in the problem of obtaining PMI-based word embeddings. All word vectors were trained on parts of a large corpus extracted from English Wikipedia (enwik9) which was divided into two equal-sized datasets, from which PMI matrices were obtained. A repeated measures design was used in assigning a method of low-rank approximation (SVD, NMF, QR) and dimensionality of the vectors (250, 500) to each of the PMI matrix replicates. Our experiments show that word vectors obtained from the truncated SVD achieve the best performance on two downstream tasks, similarity and analogy, compare to the other two low-rank approximation methods.
△ Less
Submitted 21 September, 2019;
originally announced September 2019.
-
Context Vectors are Reflections of Word Vectors in Half the Dimensions
Authors:
Zhenisbek Assylbekov,
Rustem Takhanov
Abstract:
This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assum…
▽ More
This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Fourier Neural Networks: A Comparative Study
Authors:
Abylay Zhumekenov,
Malika Uteuliyeva,
Olzhas Kabdolov,
Rustem Takhanov,
Zhenisbek Assylbekov,
Alejandro J. Castro
Abstract:
We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirica…
▽ More
We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirically demonstrate lower approximation error than the truncated Fourier series when it comes to an approximation of a known function of multiple variables.
△ Less
Submitted 8 February, 2019;
originally announced February 2019.
-
Reusing Weights in Subword-aware Neural Language Models
Authors:
Zhenisbek Assylbekov,
Rustem Takhanov
Abstract:
We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding mode…
▽ More
We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output. Our best morpheme-aware model with properly reused weights beats the competitive word-level model by a large margin across multiple languages and has 20%-87% fewer parameters.
△ Less
Submitted 25 April, 2018; v1 submitted 22 February, 2018;
originally announced February 2018.
-
Patterns versus Characters in Subword-aware Neural Language Modeling
Authors:
Rustem Takhanov,
Zhenisbek Assylbekov
Abstract:
Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords…
▽ More
Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords and from the latter set we select patterns, i.e. subwords which encapsulate information on character $n$-gram regularities. The selection is made using the pattern-based Conditional Random Field model with $l_1$ regularization. Further, for every word we construct a new sequence over an alphabet of patterns. The new alphabet's symbols confine a local statistical context stronger than the characters, therefore they allow better representations in ${\mathbb{R}}^n$ and are better building blocks for word representation. In the task of subword-aware language modeling, pattern-based models outperform character-based analogues by 2-20 perplexity points. Also, a recurrent neural network in which a word is represented as a sum of embeddings of its patterns is on par with a competitive and significantly more sophisticated character-based convolutional architecture.
△ Less
Submitted 2 September, 2017;
originally announced September 2017.
-
Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones
Authors:
Zhenisbek Assylbekov,
Rustem Takhanov,
Bagdat Myrzakhmetov,
Jonathan N. Washington
Abstract:
Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.
Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.
△ Less
Submitted 20 July, 2017;
originally announced July 2017.