Skip to main content

Showing 1–50 of 80 results for author: Ho, N

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.13781  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ML

    A Primal-Dual Framework for Transformers and Neural Networks

    Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

    Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables

  2. arXiv:2405.14131  [pdf, other

    stat.ML cs.LG

    Statistical Advantages of Perturbing Cosine Router in Sparse Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, Nhat Ho

    Abstract: The cosine router in sparse Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empir… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 44 pages, 2 figures

  3. arXiv:2405.13997  [pdf, other

    stat.ML cs.LG

    Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

    Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

    Abstract: The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has… ▽ More

    Submitted 1 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: 31 pages, 2 figures

  4. arXiv:2405.13160  [pdf, other

    stat.ML cs.LG

    Borrowing Strength in Distributionally Robust Optimization via Hierarchical Dirichlet Processes

    Authors: Nicola Bariletto, Khai Nguyen, Nhat Ho

    Abstract: This paper presents a novel optimization framework to address key challenges presented by modern machine learning applications: High dimensionality, distributional uncertainty, and data heterogeneity. Our approach unifies regularized estimation, distributionally robust optimization (DRO), and hierarchical Bayesian modeling in a single data-driven criterion. By employing a hierarchical Dirichlet pr… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  5. arXiv:2405.07482  [pdf, other

    stat.ML cs.GR cs.LG

    Marginal Fairness Sliced Wasserstein Barycenter

    Authors: Khai Nguyen, Hai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness ba… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: 33 pages, 14 figures, 6 tables

  6. arXiv:2404.15378  [pdf, other

    cs.CV cs.AI cs.GR cs.LG stat.ML

    Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint Distributions

    Authors: Khai Nguyen, Nhat Ho

    Abstract: Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multip… ▽ More

    Submitted 30 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 28 pages, 11 figures, 4 tables

  7. arXiv:2402.05220  [pdf, other

    stat.ML cs.LG

    On Parameter Estimation in Deviated Gaussian Mixture of Experts

    Authors: Huy Nguyen, Khai Nguyen, Nhat Ho

    Abstract: We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unkno… ▽ More

    Submitted 24 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: Accepted to AISTATS 2024, 32 pages, 2 figures, 1 table

  8. arXiv:2402.02952  [pdf, other

    stat.ML cs.LG

    On Least Square Estimation in Softmax Gating Mixture of Experts

    Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

    Abstract: Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous t… ▽ More

    Submitted 24 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024, 29 pages, 2 figures, 2 tables

  9. arXiv:2401.15889  [pdf, other

    stat.ML cs.AI cs.CV cs.LG

    Sliced Wasserstein with Random-Path Projecting Directions

    Authors: Khai Nguyen, Shujian Zhang, Tam Le, Nhat Ho

    Abstract: Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing d… ▽ More

    Submitted 8 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted to ICML 2024, 21 pages, 5 figures, 2 tables

  10. arXiv:2401.15771  [pdf, other

    stat.ML cs.LG

    Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization

    Authors: Nicola Bariletto, Nhat Ho

    Abstract: Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights fr… ▽ More

    Submitted 17 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

  11. arXiv:2401.13875  [pdf, other

    stat.ML cs.LG

    Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

    Authors: Huy Nguyen, Pedram Akbarian, Nhat Ho

    Abstract: Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabi… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ICML 2024, 47 pages, 2 figures, 2 tables

  12. arXiv:2401.02058  [pdf, other

    cs.LG stat.ML

    Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model

    Authors: Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

    Abstract: The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk that pushes the training loss value towards zero, even after the training error has been vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiang… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: 2024 International Conference on Machine Learning

  13. arXiv:2310.14188  [pdf, other

    stat.ML cs.LG

    A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, Nhat Ho

    Abstract: Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the… ▽ More

    Submitted 24 June, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

    Comments: Accepted to ICML 2024, 32 pages, 3 figures, 3 tables

  14. arXiv:2309.13850  [pdf, other

    stat.ML cs.LG

    Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

    Authors: Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho

    Abstract: Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitio… ▽ More

    Submitted 23 February, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted to ICLR 2024, 38 pages, 3 figures, 1 table

  15. arXiv:2309.11713  [pdf, other

    stat.ML cs.GR cs.LG

    Quasi-Monte Carlo for 3D Sliced Wasserstein

    Authors: Khai Nguyen, Nicola Bariletto, Nhat Ho

    Abstract: Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Car… ▽ More

    Submitted 16 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted to ICLR 2024 (Spotlight), 25 pages, 13 figures, 6 tables

  16. arXiv:2306.05023  [pdf, other

    stat.ML cs.LG

    Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

    Authors: Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

    Abstract: The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations… ▽ More

    Submitted 13 May, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted (Poster) at the Twelfth International Conference on Learning Representations

  17. arXiv:2305.07572  [pdf, other

    stat.ML cs.LG

    Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts

    Authors: Huy Nguyen, TrungTin Nguyen, Khai Nguyen, Nhat Ho

    Abstract: Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from compl… ▽ More

    Submitted 9 February, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: 32 pages, 9 figures; Huy Nguyen and TrungTin Nguyen contributed equally to this work

  18. arXiv:2305.03288  [pdf, other

    stat.ML cs.LG math.ST

    Demystifying Softmax Gating Function in Gaussian Mixture of Experts

    Authors: Huy Nguyen, TrungTin Nguyen, Nhat Ho

    Abstract: Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the soft… ▽ More

    Submitted 29 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 29 pages, 3 figures

  19. arXiv:2305.00402  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Sliced Wasserstein Estimation with Control Variates

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate… ▽ More

    Submitted 18 February, 2024; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: Accepted to ICLR2024, 20 pages, 7 figures, 4 tables

  20. arXiv:2304.13586  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Energy-Based Sliced Wasserstein Distance

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best dis… ▽ More

    Submitted 29 December, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

    Comments: Accepted to NeurIPS 2023, 30 pages, 8 figures, 6 tables

  21. arXiv:2301.04791  [pdf, other

    stat.ML cs.CV cs.GR cs.LG

    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction

    Authors: Khai Nguyen, Dang Nguyen, Nhat Ho

    Abstract: Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times.… ▽ More

    Submitted 8 May, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: Accepted to ICML 2023, 23 pages, 6 figures, 9 tables,

  22. arXiv:2301.03749  [pdf, other

    stat.ML cs.LG

    Markovian Sliced Wasserstein Distances: Beyond Independent Projections

    Authors: Khai Nguyen, Tongzheng Ren, Nhat Ho

    Abstract: Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due t… ▽ More

    Submitted 31 December, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: Accepted to NeurIPS 2023, 29 pages, 8 figures, 5 tables

  23. arXiv:2301.00437  [pdf, other

    cs.LG stat.ML

    Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

    Authors: Hien Dang, Tho Tran, Stanley Osher, Hung Tran-The, Nhat Ho, Tan Nguyen

    Abstract: Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-laye… ▽ More

    Submitted 18 June, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

    Comments: 75 pages, 20 figures, 4 tables. Hien Dang and Tho Tran contributed equally to this work

  24. arXiv:2211.15779  [pdf, other

    cs.LG stat.ML

    Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature

    Authors: Khang Nguyen, Hieu Nong, Vinh Nguyen, Nhat Ho, Stanley Osher, Tan Nguyen

    Abstract: Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues… ▽ More

    Submitted 31 May, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: Accepted at ICML 2023; 24 pages, 4 figures

  25. arXiv:2210.10268  [pdf, other

    stat.ML cs.LG

    Fast Approximation of the Generalized Sliced-Wasserstein Distance

    Authors: Dung Le, Huy Nguyen, Khai Nguyen, Trang Nguyen, Nhat Ho

    Abstract: Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the M… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: 22 pages, 2 figures. Dung Le, Huy Nguyen and Khai Nguyen contributed equally to this work

  26. arXiv:2209.15092  [pdf, other

    cs.LG stat.ML

    Improving Generative Flow Networks with Path Regularization

    Authors: Anh Do, Duy Dinh, Tan Nguyen, Khuong Nguyen, Stanley Osher, Nhat Ho

    Abstract: Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: 28 pages, 2 figures, 5 tables. Anh Do, Duy Dinh, and Tan Nguyen contributed equally to this work

  27. arXiv:2209.13570  [pdf, other

    stat.ML cs.LG

    Hierarchical Sliced Wasserstein Distance

    Authors: Khai Nguyen, Tongzheng Ren, Huy Nguyen, Litu Rout, Tan Nguyen, Nhat Ho

    Abstract: Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite i… ▽ More

    Submitted 6 February, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023, 29 pages, 8 figures, 3 tables,

  28. arXiv:2206.01934  [pdf, other

    cs.LG cs.AI stat.ML

    Stochastic Multiple Target Sampling Gradient Descent

    Authors: Hoang Phan, Ngoc Tran, Trung Le, Toan Tran, Nhat Ho, Dinh Phung

    Abstract: Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimiz… ▽ More

    Submitted 10 February, 2023; v1 submitted 4 June, 2022; originally announced June 2022.

    Comments: Accepted to Advances in Neural Information Processing Systems (NeurIPS) 2022. 27 pages, 10 figures, 5 tables

  29. arXiv:2206.00206  [pdf, ps, other

    cs.LG stat.ML

    Transformer with Fourier Integral Attentions

    Authors: Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley J. Osher, Nhat Ho

    Abstract: Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. Ther… ▽ More

    Submitted 31 May, 2022; originally announced June 2022.

    Comments: 35 pages, 5 tables. Tan Nguyen and Minh Pham contributed equally to this work

  30. arXiv:2205.11078  [pdf, other

    stat.ML cs.LG math.ST

    Beyond EM Algorithm on Over-specified Two-Component Location-Scale Gaussian Mixtures

    Authors: Tongzheng Ren, Fuheng Cui, Sujay Sanghavi, Nhat Ho

    Abstract: The Expectation-Maximization (EM) algorithm has been predominantly used to approximate the maximum likelihood estimation of the location-scale Gaussian mixtures. However, when the models are over-specified, namely, the chosen number of components to fit the data is larger than the unknown true number of components, EM needs a polynomial number of iterations in terms of the sample size to reach the… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: 38 pages, 4 figures. Tongzheng Ren and Fuheng Cui contributed equally to this work

  31. arXiv:2205.07999  [pdf, other

    stat.ML cs.LG math.OC math.ST

    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

    Authors: Nhat Ho, Tongzheng Ren, Sujay Sanghavi, Purnamrita Sarkar, Rachel Ward

    Abstract: Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under hom… ▽ More

    Submitted 1 February, 2023; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 37 pages. The authors are listed in alphabetical order

  32. arXiv:2204.01188  [pdf, other

    cs.CV cs.LG stat.ML

    Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution

    Authors: Khai Nguyen, Nhat Ho

    Abstract: The conventional sliced Wasserstein is defined between two probability measures that have realizations as vectors. When comparing two probability measures over images, practitioners first need to vectorize images and then project them to one-dimensional space by using matrix multiplication between the sample matrix and the projection matrix. After that, the sliced Wasserstein is evaluated by avera… ▽ More

    Submitted 23 September, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted to NeurIPS 2022, 29 pages, 9 figures, 11 tables

  33. arXiv:2203.13417  [pdf, other

    stat.ML cs.LG

    Amortized Projection Optimization for Sliced Wasserstein Generative Models

    Authors: Khai Nguyen, Nhat Ho

    Abstract: Seeking informative projecting directions has been an important task in utilizing sliced Wasserstein distance in applications. However, finding these directions usually requires an iterative optimization procedure over the space of projecting directions, which is computationally expensive. Moreover, the computational issue is even more severe in deep learning applications, where computing the dist… ▽ More

    Submitted 23 September, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to NeurIPS 2022, 22 pages, 6 figures, 8 tables

  34. arXiv:2202.08786  [pdf, other

    math.ST stat.ML

    Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models

    Authors: Tudor Manole, Nhat Ho

    Abstract: We revisit the classical problem of deriving convergence rates for the maximum likelihood estimator (MLE) in finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing wei… ▽ More

    Submitted 20 June, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: To appear in the Proceedings of the 39th International Conference on Machine Learning (ICML), 2022

  35. arXiv:2202.04219  [pdf, other

    stat.ML cs.LG math.ST

    Improving Computational Complexity in Statistical Models with Second-Order Information

    Authors: Tongzheng Ren, Jiacheng Zhuo, Sujay Sanghavi, Nhat Ho

    Abstract: It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational… ▽ More

    Submitted 13 April, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: 27 pages, 2 figures. Fixing a bug in the proof of Lemma 7

  36. arXiv:2202.02651  [pdf, other

    stat.ML cs.LG math.ST

    Beyond Black Box Densities: Parameter Learning for the Deviated Components

    Authors: Dat Do, Nhat Ho, XuanLong Nguyen

    Abstract: As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model}… ▽ More

    Submitted 26 October, 2022; v1 submitted 5 February, 2022; originally announced February 2022.

    Comments: Accepted at NeurIPS 2022. Dat Do and Nhat Ho contributed equally to this work

  37. arXiv:2201.03447  [pdf, ps, other

    math.ST stat.ML

    Bayesian Consistency with the Supremum Metric

    Authors: Nhat Ho, Stephen G. Walker

    Abstract: We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: 11 pages

  38. arXiv:2110.15520  [pdf, other

    cs.LG stat.ME stat.ML

    On Label Shift in Domain Adaptation via Wasserstein Distance

    Authors: Trung Le, Dat Do, Tuan Nguyen, Huy Nguyen, Hung Bui, Nhat Ho, Dinh Phung

    Abstract: We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA und… ▽ More

    Submitted 1 March, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

    Comments: 35 pages, 7 figures, 6 tables

  39. arXiv:2110.08678  [pdf, other

    cs.LG cs.CL stat.ML

    Improving Transformers with Probabilistic Attention Keys

    Authors: Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet-Anh Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

    Abstract: Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observati… ▽ More

    Submitted 12 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: 27 pages, 16 figures, 10 tables

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

  40. arXiv:2110.07810  [pdf, other

    cs.LG math.ST stat.ML

    Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent

    Authors: Tongzheng Ren, Fuheng Cui, Alexia Atsidakou, Sujay Sanghavi, Nhat Ho

    Abstract: We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growt… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: First three authors contributed equally. 40 pages, 4 figures

  41. arXiv:2108.10961  [pdf, other

    math.ST cs.IT stat.ML

    Entropic Gromov-Wasserstein between Gaussian Distributions

    Authors: Khang Le, Dung Le, Huy Nguyen, Dat Do, Tung Pham, Nhat Ho

    Abstract: We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von… ▽ More

    Submitted 24 February, 2022; v1 submitted 24 August, 2021; originally announced August 2021.

    Comments: 52 pages, 3 figures. Khang Le, Dung Le, Huy Nguyen contributed equally to this work

  42. arXiv:2108.09645  [pdf, other

    stat.ML cs.LG

    Improving Mini-batch Optimal Transport via Partial Transportation

    Authors: Khai Nguyen, Dang Nguyen, The-Anh Vu-Le, Tung Pham, Nhat Ho

    Abstract: Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified map**s, namely, map**s that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified ma… ▽ More

    Submitted 6 June, 2022; v1 submitted 22 August, 2021; originally announced August 2021.

    Comments: Accepted to ICML 2022, 36 pages, 18 figures, 18 tables

  43. arXiv:2108.07992  [pdf, other

    stat.ML cs.DS cs.LG math.OC stat.CO

    On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity

    Authors: Khang Le, Huy Nguyen, Tung Pham, Nhat Ho

    Abstract: We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masse… ▽ More

    Submitted 24 February, 2022; v1 submitted 18 August, 2021; originally announced August 2021.

    Comments: Accepted at AISTATS, 2022. Khang Le and Huy Nguyen contributed equally to this work

  44. arXiv:2107.10947  [pdf, other

    stat.CO math.CA stat.ME stat.ML

    On Integral Theorems and their Statistical Properties

    Authors: Nhat Ho, Stephen G. Walker

    Abstract: We introduce a class of integral theorems based on cyclic functions and Riemann sums approximating integrals. The Fourier integral theorem, derived as a combination of a transform and inverse transform, arises as a special case. The integral theorems provide natural estimators of density functions via Monte Carlo methods. Assessments of the quality of the density estimators can be used to obtain o… ▽ More

    Submitted 20 March, 2022; v1 submitted 22 July, 2021; originally announced July 2021.

    Comments: 21 pages, 5 figures. arXiv admin note: text overlap with arXiv:2106.06608

  45. arXiv:2106.15743  [pdf, other

    stat.ME math.ST

    BONuS: Multiple multivariate testing with a data-adaptivetest statistic

    Authors: Chiao-Yu Yang, Lihua Lei, Nhat Ho, Will Fithian

    Abstract: We propose a new adaptive empirical Bayes framework, the Bag-Of-Null-Statistics (BONuS) procedure, for multiple testing where each hypothesis testing problem is itself multivariate or nonparametric. BONuS is an adaptive and interactive knockoff-type method that helps improve the testing power while controlling the false discovery rate (FDR), and is closely connected to the "counting knockoffs" pro… ▽ More

    Submitted 1 July, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

  46. arXiv:2106.06608  [pdf, other

    stat.ME cs.LG stat.CO stat.ML

    Statistical Analysis from the Fourier Integral Theorem

    Authors: Nhat Ho, Stephen G. Walker

    Abstract: Taking the Fourier integral theorem as our starting point, in this paper we focus on natural Monte Carlo and fully nonparametric estimators of multivariate distributions and conditional distribution functions. We do this without the need for any estimated covariance matrix or dependence structure between variables. These aspects arise immediately from the integral theorem. Being able to model mult… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: 20 pages, 10 figures

  47. arXiv:2102.07927  [pdf, other

    cs.LG stat.ML

    Structured Dropout Variational Inference for Bayesian Neural Networks

    Authors: Son Nguyen, Duong Nguyen, Khai Nguyen, Khoat Than, Hung Bui, Nhat Ho

    Abstract: Approximate inference in Bayesian deep networks exhibits a dilemma of how to yield high fidelity posterior approximations while maintaining computational efficiency and scalability. We tackle this challenge by introducing a novel variational structured approximation inspired by the Bayesian interpretation of Dropout regularization. Concretely, we focus on the inflexibility of the factorized struct… ▽ More

    Submitted 28 October, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: 45 pages, 9 figures

  48. arXiv:2102.06857  [pdf, other

    cs.LG cs.DS math.OC stat.ML

    On Robust Optimal Transport: Computational Complexity and Barycenter Computation

    Authors: Khang Le, Huy Nguyen, Quang Nguyen, Tung Pham, Hung Bui, Nhat Ho

    Abstract: We consider robust variants of the standard optimal transport, named robust optimal transport, where marginal constraints are relaxed via Kullback-Leibler divergence. We show that Sinkhorn-based algorithms can approximate the optimal cost of robust optimal transport in $\widetilde{\mathcal{O}}(\frac{n^2}{\varepsilon})$ time, in which $n$ is the number of supports of the probability distributions a… ▽ More

    Submitted 27 October, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

    Comments: Advances in NeurIPS, 2021; 52 pages, 10 figures; Khang Le and Huy Nguyen contributed equally to this week

  49. arXiv:2102.05912  [pdf, other

    stat.ML cs.LG

    On Transportation of Mini-batches: A Hierarchical Approach

    Authors: Khai Nguyen, Dang Nguyen, Quoc Nguyen, Tung Pham, Hung Bui, Dinh Phung, Trung Le, Nhat Ho

    Abstract: Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads… ▽ More

    Submitted 6 June, 2022; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: Accepted to ICML 2022, 34 pages, 16 figures, 9 tables

  50. arXiv:2102.02756  [pdf, other

    cs.LG stat.ML

    On the computational and statistical complexity of over-parameterized matrix sensing

    Authors: Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, Constantine Caramanis

    Abstract: We consider solving the low rank matrix sensing problem with Factorized Gradient Descend (FGD) method when the true rank is unknown and over-specified, which we refer to as over-parameterized matrix sensing. If the ground truth signal $\mathbf{X}^* \in \mathbb{R}^{d*d}$ is of rank $r$, but we try to recover it using $\mathbf{F} \mathbf{F}^\top$ where $\mathbf{F} \in \mathbb{R}^{d*k}$ and $k>r$, th… ▽ More

    Submitted 26 January, 2021; originally announced February 2021.