-
An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces
Authors:
Kelly Marchisio,
Youngser Park,
Ali Saad-Eldin,
Anton Alyakin,
Kevin Duh,
Carey Priebe,
Philipp Koehn
Abstract:
Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exp…
▽ More
Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exploits new techniques from the graph matching optimization literature. These contrasting approaches have not been compared in BLI so far. In this work, we study the behavior of Euclidean versus graph-based approaches to BLI under differing data conditions and show that they complement each other when combined. We release our code at https://github.com/kellymarchisio/euc-v-graph-bli.
△ Less
Submitted 26 September, 2021;
originally announced September 2021.
-
Correcting a Nonparametric Two-sample Graph Hypothesis Test for Graphs with Different Numbers of Vertices with Applications to Connectomics
Authors:
Anton A. Alyakin,
Joshua Agterberg,
Hayden S. Helm,
Carey E. Priebe
Abstract:
Random graphs are statistical models that have many applications, ranging from neuroscience to social network analysis. Of particular interest in some applications is the problem of testing two random graphs for equality of generating distributions. Tang et al. (2017) propose a test for this setting. This test consists of embedding the graph into a low-dimensional space via the adjacency spectral…
▽ More
Random graphs are statistical models that have many applications, ranging from neuroscience to social network analysis. Of particular interest in some applications is the problem of testing two random graphs for equality of generating distributions. Tang et al. (2017) propose a test for this setting. This test consists of embedding the graph into a low-dimensional space via the adjacency spectral embedding (ASE) and subsequently using a kernel two-sample test based on the maximum mean discrepancy. However, if the two graphs being compared have an unequal number of vertices, the test of Tang et al. (2017) may not be valid. We demonstrate the intuition behind this invalidity and propose a correction that makes any subsequent kernel- or distance-based test valid. Our method relies on sampling based on the asymptotic distribution for the ASE. We call these altered embeddings the corrected adjacency spectral embeddings (CASE). We also show that CASE remedies the exchangeability problem of the original test and demonstrate the validity and consistency of the test that uses CASE via a simulation study. Lastly, we apply our proposed test to the problem of determining equivalence of generating distributions in human connectomes extracted from diffusion magnetic resonance imaging (dMRI) at different scales.
△ Less
Submitted 20 November, 2023; v1 submitted 21 August, 2020;
originally announced August 2020.
-
LqRT: Robust Hypothesis Testing of Location Parameters using Lq-Likelihood-Ratio-Type Test in Python
Authors:
Anton Alyakin,
Yichen Qin,
Carey E. Priebe
Abstract:
A t-test is considered a standard procedure for inference on population means and is widely used in scientific discovery. However, as a special case of a likelihood-ratio test, t-test often shows drastic performance degradation due to the deviations from its hard-to-verify distributional assumptions. Alternatively, in this article, we propose a new two-sample Lq-likelihood-ratio-type test (LqRT) a…
▽ More
A t-test is considered a standard procedure for inference on population means and is widely used in scientific discovery. However, as a special case of a likelihood-ratio test, t-test often shows drastic performance degradation due to the deviations from its hard-to-verify distributional assumptions. Alternatively, in this article, we propose a new two-sample Lq-likelihood-ratio-type test (LqRT) along with an easy-to-use Python package for implementation. LqRT preserves high power when the distributional assumption is violated, and maintains the satisfactory performance when the assumption is valid. As numerical studies suggest, LqRT dominates many other robust tests in power, such as Wilcoxon test and sign test, while maintaining a valid size. To the extent that the robustness of the Wilcoxon test (minimum asymptotic relative efficiency (ARE) of the Wilcoxon test vs the t-test is 0.864) suggests that the Wilcoxon test should be the default test of choice (rather than "use Wilcoxon if there is evidence of non-normality", the default position should be "use Wilcoxon unless there is good reason to believe the normality assumption"), the results in this article suggest that the LqRT is potentially the new default go-to test for practitioners.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Valid Two-Sample Graph Testing via Optimal Transport Procrustes and Multiscale Graph Correlation with Applications in Connectomics
Authors:
Jaewon Chung,
Bijan Varjavand,
Jesus Arroyo,
Anton Alyakin,
Joshua Agterberg,
Minh Tang,
Joshua T. Vogelstein,
Carey E. Priebe
Abstract:
Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the…
▽ More
Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the nonparametric maximum mean discrepancy(MMD) test to obtain a p-value. Using synthetic data generated from Drosophila brain networks, we show that the median flip heuristic results in an invalid test, and demonstrate that optimal transport Procrustes (OTP) for alignment resolves the invalidity. We further demonstrate that substituting the MMD test with multiscale graph correlation(MGC) test leads to a more powerful test both in synthetic and in simulated data. Lastly, we apply this powerful test to the right and left hemispheres of the larval Drosophila mushroom body brain networks, and conclude that there is not sufficient evidence to reject the null hypothesis that the two hemispheres are equally distributed.
△ Less
Submitted 13 September, 2021; v1 submitted 6 November, 2019;
originally announced November 2019.