Search | arXiv e-print repository

arXiv:2301.12005 [pdf, other]

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Authors: Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar

Abstract: Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages… ▽ More Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance. △ Less

Submitted 3 July, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:2209.09175 [pdf, other]

Exponential Family Trend Filtering on Lattices

Authors: Veeranjaneyulu Sadhanala, Robert Bassett, James Sharpnack, Daniel J. McDonald

Abstract: Trend filtering is a modern approach to nonparametric regression that is more adaptive to local smoothness than splines or similar basis procedures. Existing analyses of trend filtering focus on estimating a function corrupted by homoskedastic Gaussian noise, but our work extends this technique to general exponential family distributions. This extension is motivated by the need to study massive, g… ▽ More Trend filtering is a modern approach to nonparametric regression that is more adaptive to local smoothness than splines or similar basis procedures. Existing analyses of trend filtering focus on estimating a function corrupted by homoskedastic Gaussian noise, but our work extends this technique to general exponential family distributions. This extension is motivated by the need to study massive, gridded climate data derived from polar-orbiting satellites. We present algorithms tailored to large problems, theoretical results for general exponential family likelihoods, and principled methods for tuning parameter selection without excess computation. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: 53 pages; 6 figures; 3 tables

arXiv:2112.14758 [pdf, other]

Multivariate Trend Filtering for Lattice Data

Authors: Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Addison J. Hu, Ryan J. Tibshirani

Abstract: We study a multivariate version of trend filtering, called Kronecker trend filtering or KTF, for the case in which the design points form a lattice in $d$ dimensions. KTF is a natural extension of univariate trend filtering (Steidl et al., 2006; Kim et al., 2009; Tibshirani, 2014), and is defined by minimizing a penalized least squares problem whose penalty term sums the absolute (higher-order) di… ▽ More We study a multivariate version of trend filtering, called Kronecker trend filtering or KTF, for the case in which the design points form a lattice in $d$ dimensions. KTF is a natural extension of univariate trend filtering (Steidl et al., 2006; Kim et al., 2009; Tibshirani, 2014), and is defined by minimizing a penalized least squares problem whose penalty term sums the absolute (higher-order) differences of the parameter to be estimated along each of the coordinate directions. The corresponding penalty operator can be written in terms of Kronecker products of univariate trend filtering penalty operators, hence the name Kronecker trend filtering. Equivalently, one can view KTF in terms of an $\ell_1$-penalized basis regression problem where the basis functions are tensor products of falling factorial functions, a piecewise polynomial (discrete spline) basis that underlies univariate trend filtering. This paper is a unification and extension of the results in Sadhanala et al. (2016, 2017). We develop a complete set of theoretical results that describe the behavior of $k^{\mathrm{th}}$ order Kronecker trend filtering in $d$ dimensions, for every $k \geq 0$ and $d \geq 1$. This reveals a number of interesting phenomena, including the dominance of KTF over linear smoothers in estimating heterogeneously smooth functions, and a phase transition at $d=2(k+1)$, a boundary past which (on the high dimension-to-smoothness side) linear smoothers fail to be consistent entirely. We also leverage recent results on discrete splines from Tibshirani (2020), in particular, discrete spline interpolation results that enable us to extend the KTF estimate to any off-lattice location in constant-time (independent of the size of the lattice $n$). △ Less

Submitted 5 April, 2024; v1 submitted 29 December, 2021; originally announced December 2021.

arXiv:1903.10083 [pdf, other]

A Higher-Order Kolmogorov-Smirnov Test

Authors: Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Aaditya Ramdas, Ryan J. Tibshirani

Abstract: We present an extension of the Kolmogorov-Smirnov (KS) two-sample test, which can be more sensitive to differences in the tails. Our test statistic is an integral probability metric (IPM) defined over a higher-order total variation ball, recovering the original KS test as its simplest case. We give an exact representer result for our IPM, which generalizes the fact that the original KS test statis… ▽ More We present an extension of the Kolmogorov-Smirnov (KS) two-sample test, which can be more sensitive to differences in the tails. Our test statistic is an integral probability metric (IPM) defined over a higher-order total variation ball, recovering the original KS test as its simplest case. We give an exact representer result for our IPM, which generalizes the fact that the original KS test statistic can be expressed in equivalent variational and CDF forms. For small enough orders ($k \leq 5$), we develop a linear-time algorithm for computing our higher-order KS test statistic; for all others ($k \geq 6$), we give a nearly linear-time approximation. We derive the asymptotic null distribution for our test, and show that our nearly linear-time approximation shares the same asymptotic null. Lastly, we complement our theory with numerical studies. △ Less

Submitted 24 March, 2019; originally announced March 2019.

Comments: 18 pages, AISTATS 2019

arXiv:1702.05037 [pdf, other]

Additive Models with Trend Filtering

Authors: Veeranjaneyulu Sadhanala, Ryan J. Tibshirani

Abstract: We study additive models built with trend filtering, i.e., additive models whose components are each regularized by the (discrete) total variation of their $k$th (discrete) derivative, for a chosen integer $k \geq 0$. This results in $k$th degree piecewise polynomial components, (e.g., $k=0$ gives piecewise constant components, $k=1$ gives piecewise linear, $k=2$ gives piecewise quadratic, etc.).… ▽ More We study additive models built with trend filtering, i.e., additive models whose components are each regularized by the (discrete) total variation of their $k$th (discrete) derivative, for a chosen integer $k \geq 0$. This results in $k$th degree piecewise polynomial components, (e.g., $k=0$ gives piecewise constant components, $k=1$ gives piecewise linear, $k=2$ gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, as per the standard in additive models, backfitting is an appealing method for optimization, but it is particularly appealing for additive trend filtering because we can leverage a few highly efficient univariate trend filtering solvers. Going one step further, we describe a new backfitting algorithm whose iterations can be run in parallel, which (as far as we know) is the first of its kind. Lastly, we present experiments to examine the empirical performance of additive trend filtering. △ Less

Submitted 21 November, 2018; v1 submitted 16 February, 2017; originally announced February 2017.

Comments: 63 pages

arXiv:1605.08400 [pdf, other]

Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers

Authors: Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Ryan Tibshirani

Abstract: We consider the problem of estimating a function defined over $n$ locations on a $d$-dimensional grid (having all side lengths equal to $n^{1/d}$). When the function is constrained to have discrete total variation bounded by $C_n$, we derive the minimax optimal (squared) $\ell_2$ estimation error rate, parametrized by $n$ and $C_n$. Total variation denoising, also known as the fused lasso, is seen… ▽ More We consider the problem of estimating a function defined over $n$ locations on a $d$-dimensional grid (having all side lengths equal to $n^{1/d}$). When the function is constrained to have discrete total variation bounded by $C_n$, we derive the minimax optimal (squared) $\ell_2$ estimation error rate, parametrized by $n$ and $C_n$. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Several simpler estimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural question is: can these simpler estimators perform just as well? We prove that these estimators, and more broadly all estimators given by linear transformations of the input data, are suboptimal over the class of functions with bounded variation. This extends fundamental findings of Donoho and Johnstone [1998] on 1-dimensional total variation spaces to higher dimensions. The implication is that the computationally simpler methods cannot be used for such sophisticated denoising tasks, without sacrificing statistical accuracy. We also derive minimax rates for discrete Sobolev spaces over $d$-dimensional grids, which are, in some sense, smaller than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can be optimal---and a few well-known ones are, such as Laplacian smoothing and Laplacian eigenmaps, as we show. Lastly, we investigate the problem of adaptivity of the total variation denoiser to these smaller Sobolev function spaces. △ Less

Submitted 26 May, 2016; originally announced May 2016.

arXiv:1409.6086 [pdf, other]

Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms

Authors: Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, Eric P. Xing

Abstract: We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worst-case bounded delays, our methods only depend (mildly) on \emp… ▽ More We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worst-case bounded delays, our methods only depend (mildly) on \emph{expected} delays, allowing them to be robust to stragglers and faulty worker threads. Our algorithms assume block-separable constraints, and subsume the recent Block-Coordinate Frank-Wolfe (BCFW) method~\citep{lacoste2013block}. Our analysis reveals problem-dependent quantities that govern the speedups of our methods over BCFW. We present experiments on structural SVM and Group Fused Lasso, obtaining significant speedups over competing state-of-the-art (and synchronous) methods. △ Less

Submitted 12 February, 2016; v1 submitted 22 September, 2014; originally announced September 2014.

Showing 1–7 of 7 results for author: Sadhanala, V