Search | arXiv e-print repository

An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces

Authors: Kelly Marchisio, Youngser Park, Ali Saad-Eldin, Anton Alyakin, Kevin Duh, Carey Priebe, Philipp Koehn

Abstract: Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exp… ▽ More Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exploits new techniques from the graph matching optimization literature. These contrasting approaches have not been compared in BLI so far. In this work, we study the behavior of Euclidean versus graph-based approaches to BLI under differing data conditions and show that they complement each other when combined. We release our code at https://github.com/kellymarchisio/euc-v-graph-bli. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: EMNLP Findings 2021 Camera-Ready

arXiv:2008.09434 [pdf, other]

doi 10.1007/s41109-023-00607-x

Correcting a Nonparametric Two-sample Graph Hypothesis Test for Graphs with Different Numbers of Vertices with Applications to Connectomics

Authors: Anton A. Alyakin, Joshua Agterberg, Hayden S. Helm, Carey E. Priebe

Abstract: Random graphs are statistical models that have many applications, ranging from neuroscience to social network analysis. Of particular interest in some applications is the problem of testing two random graphs for equality of generating distributions. Tang et al. (2017) propose a test for this setting. This test consists of embedding the graph into a low-dimensional space via the adjacency spectral… ▽ More Random graphs are statistical models that have many applications, ranging from neuroscience to social network analysis. Of particular interest in some applications is the problem of testing two random graphs for equality of generating distributions. Tang et al. (2017) propose a test for this setting. This test consists of embedding the graph into a low-dimensional space via the adjacency spectral embedding (ASE) and subsequently using a kernel two-sample test based on the maximum mean discrepancy. However, if the two graphs being compared have an unequal number of vertices, the test of Tang et al. (2017) may not be valid. We demonstrate the intuition behind this invalidity and propose a correction that makes any subsequent kernel- or distance-based test valid. Our method relies on sampling based on the asymptotic distribution for the ASE. We call these altered embeddings the corrected adjacency spectral embeddings (CASE). We also show that CASE remedies the exchangeability problem of the original test and demonstrate the validity and consistency of the test that uses CASE via a simulation study. Lastly, we apply our proposed test to the problem of determining equivalence of generating distributions in human connectomes extracted from diffusion magnetic resonance imaging (dMRI) at different scales. △ Less

Submitted 20 November, 2023; v1 submitted 21 August, 2020; originally announced August 2020.

arXiv:1911.11922 [pdf, other]

LqRT: Robust Hypothesis Testing of Location Parameters using Lq-Likelihood-Ratio-Type Test in Python

Authors: Anton Alyakin, Yichen Qin, Carey E. Priebe

Abstract: A t-test is considered a standard procedure for inference on population means and is widely used in scientific discovery. However, as a special case of a likelihood-ratio test, t-test often shows drastic performance degradation due to the deviations from its hard-to-verify distributional assumptions. Alternatively, in this article, we propose a new two-sample Lq-likelihood-ratio-type test (LqRT) a… ▽ More A t-test is considered a standard procedure for inference on population means and is widely used in scientific discovery. However, as a special case of a likelihood-ratio test, t-test often shows drastic performance degradation due to the deviations from its hard-to-verify distributional assumptions. Alternatively, in this article, we propose a new two-sample Lq-likelihood-ratio-type test (LqRT) along with an easy-to-use Python package for implementation. LqRT preserves high power when the distributional assumption is violated, and maintains the satisfactory performance when the assumption is valid. As numerical studies suggest, LqRT dominates many other robust tests in power, such as Wilcoxon test and sign test, while maintaining a valid size. To the extent that the robustness of the Wilcoxon test (minimum asymptotic relative efficiency (ARE) of the Wilcoxon test vs the t-test is 0.864) suggests that the Wilcoxon test should be the default test of choice (rather than "use Wilcoxon if there is evidence of non-normality", the default position should be "use Wilcoxon unless there is good reason to believe the normality assumption"), the results in this article suggest that the LqRT is potentially the new default go-to test for practitioners. △ Less

Submitted 26 November, 2019; originally announced November 2019.

arXiv:1911.02741 [pdf, other]

doi 10.1002/sta4.429

Valid Two-Sample Graph Testing via Optimal Transport Procrustes and Multiscale Graph Correlation with Applications in Connectomics

Authors: Jaewon Chung, Bijan Varjavand, Jesus Arroyo, Anton Alyakin, Joshua Agterberg, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the… ▽ More Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the nonparametric maximum mean discrepancy(MMD) test to obtain a p-value. Using synthetic data generated from Drosophila brain networks, we show that the median flip heuristic results in an invalid test, and demonstrate that optimal transport Procrustes (OTP) for alignment resolves the invalidity. We further demonstrate that substituting the MMD test with multiscale graph correlation(MGC) test leads to a more powerful test both in synthetic and in simulated data. Lastly, we apply this powerful test to the right and left hemispheres of the larval Drosophila mushroom body brain networks, and conclude that there is not sufficient evidence to reject the null hypothesis that the two hemispheres are equally distributed. △ Less

Submitted 13 September, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: 12 pages, 3 figures

Showing 1–4 of 4 results for author: Alyakin, A