Estimation of embedding vectors in high dimensions

Azar, Golara Ahmadi; Emami, Melika; Fletcher, Alyson; Rangan, Sundeep

Computer Science > Machine Learning

arXiv:2312.07802 (cs)

[Submitted on 12 Dec 2023]

Title:Estimation of embedding vectors in high dimensions

Authors:Golara Ahmadi Azar, Melika Emami, Alyson Fletcher, Sundeep Rangan

View PDF HTML (experimental)

Abstract:Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.

Comments:	12 pages, 7 figures
Subjects:	Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
Cite as:	arXiv:2312.07802 [cs.LG]
	(or arXiv:2312.07802v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.07802

Submission history

From: Golara Ahmadi Azar [view email]
[v1] Tue, 12 Dec 2023 23:41:59 UTC (160 KB)

Computer Science > Machine Learning

Title:Estimation of embedding vectors in high dimensions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Estimation of embedding vectors in high dimensions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators