-
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Authors:
Ali Khaleghi Rahimian,
Manish Kumar Govind,
Subhajit Maity,
Dominick Reilly,
Christian Kümmerle,
Srijan Das,
Aritra Dutta
Abstract:
Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed…
▽ More
Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed to reduce the computational complexity of self-attention in ViTs, leading to the development of efficient and sparse transformer architectures. In this paper, viewing through the efficiency lens, we realized that introducing any sparse self-attention strategy in ViTs can keep the computational overhead low. However, these strategies are sub-optimal as they often fail to capture fine-grained visual details. This observation leads us to propose a general, efficient, sparse architecture, named Fibottention, for approximating self-attention with superlinear complexity that is built upon Fibonacci sequences. The key strategies in Fibottention include: it excludes proximate tokens to reduce redundancy, employs structured sparsity by design to decrease computational demands, and incorporates inception-like diversity across attention heads. This diversity ensures the capture of complementary information through non-overlap** token interactions, optimizing both performance and resource utilization in ViTs for visual representation learning. We embed our Fibottention mechanism into multiple state-of-the-art transformer architectures dedicated to visual tasks. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention in conjunction with ViT and its variants, consistently achieves significant performance boosts compared to standard ViTs in nine datasets across three domains $\unicode{x2013}$ image classification, video understanding, and robot learning tasks.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
UnitNorm: Rethinking Normalization for Transformers in Time Series
Authors:
Nan Huang,
Christian Kümmerle,
Xiang Zhang
Abstract:
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumv…
▽ More
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges. Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets. Notably, UnitNorm shows superior performance, especially in scenarios requiring robust attention mechanisms and contextual comprehension, evidenced by significant improvements by up to a 1.46 decrease in MSE for forecasting, and a 4.89% increase in accuracy for classification. This work not only calls for a reevaluation of normalization strategies in time series Transformers but also sets a new direction for enhancing model performance and stability. The source code is available at https://anonymous.4open.science/r/UnitNorm-5B84.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Recovering Simultaneously Structured Data via Non-Convex Iteratively Reweighted Least Squares
Authors:
Christian Kümmerle,
Johannes Maly
Abstract:
We propose a new algorithm for the problem of recovering data that adheres to multiple, heterogeneous low-dimensional structures from linear observations. Focusing on data matrices that are simultaneously row-sparse and low-rank, we propose and analyze an iteratively reweighted least squares (IRLS) algorithm that is able to leverage both structures. In particular, it optimizes a combination of non…
▽ More
We propose a new algorithm for the problem of recovering data that adheres to multiple, heterogeneous low-dimensional structures from linear observations. Focusing on data matrices that are simultaneously row-sparse and low-rank, we propose and analyze an iteratively reweighted least squares (IRLS) algorithm that is able to leverage both structures. In particular, it optimizes a combination of non-convex surrogates for row-sparsity and rank, a balancing of which is built into the algorithm. We prove locally quadratic convergence of the iterates to a simultaneously structured data matrix in a regime of minimal sample complexity (up to constants and a logarithmic factor), which is known to be impossible for a combination of convex surrogates. In experiments, we show that the IRLS method exhibits favorable empirical convergence, identifying simultaneously row-sparse and low-rank matrices from fewer measurements than state-of-the-art methods. Code is available at https://github.com/ckuemmerle/simirls.
△ Less
Submitted 18 January, 2024; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Learning Transition Operators From Sparse Space-Time Samples
Authors:
Christian Kümmerle,
Mauro Maggioni,
Sui Tang
Abstract:
We consider the nonlinear inverse problem of learning a transition operator $\mathbf{A}$ from partial observations at different times, in particular from sparse observations of entries of its powers $\mathbf{A},\mathbf{A}^2,\cdots,\mathbf{A}^{T}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by…
▽ More
We consider the nonlinear inverse problem of learning a transition operator $\mathbf{A}$ from partial observations at different times, in particular from sparse observations of entries of its powers $\mathbf{A},\mathbf{A}^2,\cdots,\mathbf{A}^{T}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by graph operators depending on the underlying graph topology. We address the nonlinearity of the problem by embedding it into a higher-dimensional space of suitable block-Hankel matrices, where it becomes a low-rank matrix completion problem, even if $\mathbf{A}$ is of full rank. For both a uniform and an adaptive random space-time sampling model, we quantify the recoverability of the transition operator via suitable measures of incoherence of these block-Hankel embedding matrices. For graph transition operators these measures of incoherence depend on the interplay between the dynamics and the graph topology. We develop a suitable non-convex iterative reweighted least squares (IRLS) algorithm, establish its quadratic local convergence, and show that, in optimal scenarios, no more than $\mathcal{O}(rn \log(nT))$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator $\mathbf{A}$ of size $n \times n$. This establishes that spatial samples can be substituted by a comparable number of space-time samples. We provide an efficient implementation of the proposed IRLS algorithm with space complexity of order $O(r n T)$ and per-iteration time complexity linear in $n$. Numerical experiments for transition operators based on several graph models confirm that the theoretical findings accurately track empirical phase transitions, and illustrate the applicability and scalability of the proposed algorithm.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Global Linear and Local Superlinear Convergence of IRLS for Non-Smooth Robust Regression
Authors:
Liangzu Peng,
Christian Kümmerle,
René Vidal
Abstract:
We advance both the theory and practice of robust $\ell_p$-quasinorm regression for $p \in (0,1]$ by using novel variants of iteratively reweighted least-squares (IRLS) to solve the underlying non-smooth problem. In the convex case, $p=1$, we prove that this IRLS variant converges globally at a linear rate under a mild, deterministic condition on the feature matrix called the \textit{stable range…
▽ More
We advance both the theory and practice of robust $\ell_p$-quasinorm regression for $p \in (0,1]$ by using novel variants of iteratively reweighted least-squares (IRLS) to solve the underlying non-smooth problem. In the convex case, $p=1$, we prove that this IRLS variant converges globally at a linear rate under a mild, deterministic condition on the feature matrix called the \textit{stable range space property}. In the non-convex case, $p\in(0,1)$, we prove that under a similar condition, IRLS converges locally to the global minimizer at a superlinear rate of order $2-p$; the rate becomes quadratic as $p\to 0$. We showcase the proposed methods in three applications: real phase retrieval, regression without correspondences, and robust face restoration. The results show that (1) IRLS can handle a larger number of outliers than other methods, (2) it is faster than competing methods at the same level of accuracy, (3) it restores a sparsely corrupted face image with satisfactory visual quality. https://github.com/liangzu/IRLS-NeurIPS2022
△ Less
Submitted 11 October, 2022; v1 submitted 24 August, 2022;
originally announced August 2022.
-
A Scalable Second Order Method for Ill-Conditioned Matrix Completion from Few Samples
Authors:
Christian Kümmerle,
Claudio Mayrink Verdun
Abstract:
We propose an iterative algorithm for low-rank matrix completion that can be interpreted as an iteratively reweighted least squares (IRLS) algorithm, a saddle-esca** smoothing Newton method or a variable metric proximal gradient method applied to a non-convex rank surrogate. It combines the favorable data-efficiency of previous IRLS approaches with an improved scalability by several orders of ma…
▽ More
We propose an iterative algorithm for low-rank matrix completion that can be interpreted as an iteratively reweighted least squares (IRLS) algorithm, a saddle-esca** smoothing Newton method or a variable metric proximal gradient method applied to a non-convex rank surrogate. It combines the favorable data-efficiency of previous IRLS approaches with an improved scalability by several orders of magnitude. We establish the first local convergence guarantee from a minimal number of samples for that class of algorithms, showing that the method attains a local quadratic convergence rate. Furthermore, we show that the linear systems to be solved are well-conditioned even for very ill-conditioned ground truth matrices. We provide extensive experiments, indicating that unlike many state-of-the-art approaches, our method is able to complete very ill-conditioned matrices with a condition number of up to $10^{10}$ from few samples, while being competitive in its scalability.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Dictionary-Sparse Recovery From Heavy-Tailed Measurements
Authors:
Pedro Abdalla,
Christian Kümmerle
Abstract:
The recovery of signals that are sparse not in a basis, but rather sparse with respect to an over-complete dictionary is one of the most flexible settings in the field of compressed sensing with numerous applications. As in the standard compressed sensing setting, it is possible that the signal can be reconstructed efficiently from few, linear measurements, for example by the so-called $\ell_1$-sy…
▽ More
The recovery of signals that are sparse not in a basis, but rather sparse with respect to an over-complete dictionary is one of the most flexible settings in the field of compressed sensing with numerous applications. As in the standard compressed sensing setting, it is possible that the signal can be reconstructed efficiently from few, linear measurements, for example by the so-called $\ell_1$-synthesis method.
However, it has been less well-understood which measurement matrices provably work for this setting. Whereas in the standard setting, it has been shown that even certain heavy-tailed measurement matrices can be used in the same sample complexity regime as Gaussian matrices, comparable results are only available for the restrictive class of sub-Gaussian measurement vectors as far as the recovery of dictionary-sparse signals via $\ell_1$-synthesis is concerned.
In this work, we fill this gap and establish optimal guarantees for the recovery of vectors that are (approximately) sparse with respect to a dictionary via the $\ell_1$-synthesis method from linear, potentially noisy measurements for a large class of random measurement matrices. In particular, we show that random measurements that fulfill only a small-ball assumption and a weak moment assumption, such as random vectors with i.i.d. Student-$t$ entries with a logarithmic number of degrees of freedom, lead to comparable guarantees as (sub-)Gaussian measurements.
As a technical tool, we show a bound on the expectation of the sum of squared order statistics under very general assumptions, which might be of independent interest.
As a corollary of our results, we also obtain a slight improvement on the weakest assumption on a measurement matrix with i.i.d. rows sufficient for uniform recovery in standard compressed sensing, improving on results by Lecué and Mendelson and Dirksen, Lecué and Rauhut.
△ Less
Submitted 29 September, 2021; v1 submitted 20 January, 2021;
originally announced January 2021.
-
Iteratively Reweighted Least Squares for Basis Pursuit with Global Linear Convergence Rate
Authors:
Christian Kümmerle,
Claudio Mayrink Verdun,
Dominik Stöger
Abstract:
The recovery of sparse data is at the core of many applications in machine learning and signal processing. While such problems can be tackled using $\ell_1$-regularization as in the LASSO estimator and in the Basis Pursuit approach, specialized algorithms are typically required to solve the corresponding high-dimensional non-smooth optimization for large instances. Iteratively Reweighted Least Squ…
▽ More
The recovery of sparse data is at the core of many applications in machine learning and signal processing. While such problems can be tackled using $\ell_1$-regularization as in the LASSO estimator and in the Basis Pursuit approach, specialized algorithms are typically required to solve the corresponding high-dimensional non-smooth optimization for large instances. Iteratively Reweighted Least Squares (IRLS) is a widely used algorithm for this purpose due its excellent numerical performance. However, while existing theory is able to guarantee convergence of this algorithm to the minimizer, it does not provide a global convergence rate. In this paper, we prove that a variant of IRLS converges with a global linear rate to a sparse solution, i.e., with a linear error decrease occurring immediately from any initialization, if the measurements fulfill the usual null space property assumption. We support our theory by numerical experiments showing that our linear rate captures the correct dimension dependence. We anticipate that our theoretical findings will lead to new insights for many other use cases of the IRLS algorithm, such as in low-rank matrix recovery.
△ Less
Submitted 11 November, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
On the robustness of noise-blind low-rank recovery from rank-one measurements
Authors:
Felix Krahmer,
Christian Kümmerle,
Oleh Melnyk
Abstract:
We prove new results about the robustness of well-known convex noise-blind optimization formulations for the reconstruction of low-rank matrices from underdetermined linear measurements. Our results are applicable for symmetric rank-one measurements as used in a formulation of the phase retrieval problem.
We obtain these results by establishing that with high probability rank-one measurement ope…
▽ More
We prove new results about the robustness of well-known convex noise-blind optimization formulations for the reconstruction of low-rank matrices from underdetermined linear measurements. Our results are applicable for symmetric rank-one measurements as used in a formulation of the phase retrieval problem.
We obtain these results by establishing that with high probability rank-one measurement operators defined by i.i.d. Gaussian vectors exhibit the so-called Schatten-1 quotient property, which corresponds to a lower bound for the inradius of their image of the nuclear norm (Schatten-1) unit ball.
We complement our analysis by numerical experiments comparing the solutions of noise-blind and noise-aware formulations. These experiments confirm that noise-blind optimization methods exhibit comparable robustness to noise-aware formulations.
Keywords: low-rank matrix recovery, phase retrieval, quotient property, noise-blind, robustness, nuclear norm minimization
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Esca** Saddle Points in Ill-Conditioned Matrix Completion with a Scalable Second Order Method
Authors:
Christian Kümmerle,
Claudio M. Verdun
Abstract:
We propose an iterative algorithm for low-rank matrix completion that can be interpreted as both an iteratively reweighted least squares (IRLS) algorithm and a saddle-esca** smoothing Newton method applied to a non-convex rank surrogate objective. It combines the favorable data efficiency of previous IRLS approaches with an improved scalability by several orders of magnitude. Our method attains…
▽ More
We propose an iterative algorithm for low-rank matrix completion that can be interpreted as both an iteratively reweighted least squares (IRLS) algorithm and a saddle-esca** smoothing Newton method applied to a non-convex rank surrogate objective. It combines the favorable data efficiency of previous IRLS approaches with an improved scalability by several orders of magnitude. Our method attains a local quadratic convergence rate already for a number of samples that is close to the information theoretical limit. We show in numerical experiments that unlike many state-of-the-art approaches, our approach is able to complete very ill-conditioned matrices with a condition number of up to $10^{10}$ from few samples.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
On the geometry of polytopes generated by heavy-tailed random vectors
Authors:
Olivier Guédon,
Felix Krahmer,
Christian Kümmerle,
Shahar Mendelson,
Holger Rauhut
Abstract:
We study the geometry of centrally-symmetric random polytopes, generated by $N$ independent copies of a random vector $X$ taking values in $\mathbb{R}^n$. We show that under minimal assumptions on $X$, for $N \gtrsim n$ and with high probability, the polytope contains a deterministic set that is naturally associated with the random vector---namely, the polar of a certain floating body. This solves…
▽ More
We study the geometry of centrally-symmetric random polytopes, generated by $N$ independent copies of a random vector $X$ taking values in $\mathbb{R}^n$. We show that under minimal assumptions on $X$, for $N \gtrsim n$ and with high probability, the polytope contains a deterministic set that is naturally associated with the random vector---namely, the polar of a certain floating body. This solves the long-standing question on whether such a random polytope contains a canonical body. Moreover, by identifying the floating bodies associated with various random vectors we recover the estimates that have been obtained previously, and thanks to the minimal assumptions on $X$ we derive estimates in cases that had been out of reach, involving random polytopes generated by heavy-tailed random vectors (e.g., when $X$ is $q$-stable or when $X$ has an unconditional structure). Finally, the structural results are used for the study of a fundamental question in compressive sensing---noise blind sparse recovery.
△ Less
Submitted 16 July, 2019;
originally announced July 2019.
-
The Oracle of DLphi
Authors:
Dominik Alfke,
Weston Baines,
Jan Blechschmidt,
Mauricio J. del Razo Sarmina,
Amnon Drory,
Dennis Elbrächter,
Nando Farchmin,
Matteo Gambara,
Silke Glas,
Philipp Grohs,
Peter Hinz,
Danijel Kivaranovic,
Christian Kümmerle,
Gitta Kutyniok,
Sebastian Lunz,
Jan Macdonald,
Ryan Malthaner,
Gregory Naisat,
Ariel Neufeld,
Philipp Christian Petersen,
Rafael Reisenhofer,
Jun-Da Sheng,
Laura Thesing,
Philipp Trunschke,
Johannes von Lindheim
, et al. (2 additional authors not shown)
Abstract:
We present a novel technique based on deep learning and set theory which yields exceptional classification and prediction results. Having access to a sufficiently large amount of labelled training data, our methodology is capable of predicting the labels of the test data almost always even if the training data is entirely unrelated to the test data. In other words, we prove in a specific setting t…
▽ More
We present a novel technique based on deep learning and set theory which yields exceptional classification and prediction results. Having access to a sufficiently large amount of labelled training data, our methodology is capable of predicting the labels of the test data almost always even if the training data is entirely unrelated to the test data. In other words, we prove in a specific setting that as long as one has access to enough data points, the quality of the data is irrelevant.
△ Less
Submitted 27 January, 2019; v1 submitted 17 January, 2019;
originally announced January 2019.
-
Denoising and Completion of Structured Low-Rank Matrices via Iteratively Reweighted Least Squares
Authors:
Christian Kümmerle,
Claudio Mayrink Verdun
Abstract:
We propose a new Iteratively Reweighted Least Squares (IRLS) algorithm for the problem of completing or denoising low-rank matrices that are structured, e.g., that possess a Hankel, Toeplitz or block-Hankel/Toeplitz structure. The algorithm optimizes an objective based on a non-convex surrogate of the rank by solving a sequence of quadratic problems. Our strategy combines computational efficiency,…
▽ More
We propose a new Iteratively Reweighted Least Squares (IRLS) algorithm for the problem of completing or denoising low-rank matrices that are structured, e.g., that possess a Hankel, Toeplitz or block-Hankel/Toeplitz structure. The algorithm optimizes an objective based on a non-convex surrogate of the rank by solving a sequence of quadratic problems. Our strategy combines computational efficiency, as it operates on a lower dimensional generator space of the structured matrices, with high statistical accuracy which can be observed in experiments on hard estimation and completion tasks. Our experiments show that the proposed algorithm StrucHMIRLS exhibits an empirical recovery probability close to 1 from fewer samples than the state-of-the-art in a Hankel matrix completion task arising from the problem of spectral super-resolution of badly separated frequencies. Furthermore, we explain how the proposed algorithm for structured low-rank recovery can be used as preprocessing step for improved robustness in frequency or line spectrum estimation problems.
△ Less
Submitted 18 November, 2018;
originally announced November 2018.
-
A Quotient Property for Matrices with Heavy-Tailed Entries and its Application to Noise-Blind Compressed Sensing
Authors:
Felix Krahmer,
Christian Kümmerle,
Holger Rauhut
Abstract:
For a large class of random matrices $A$ with i.i.d. entries we show that the $\ell_1$-quotient property holds with probability exponentially close to 1. In contrast to previous results, our analysis does not require concentration of the entrywise distributions. We provide a unified proof that recovers corresponding previous results for (sub-)Gaussian and Weibull distributions. Our findings genera…
▽ More
For a large class of random matrices $A$ with i.i.d. entries we show that the $\ell_1$-quotient property holds with probability exponentially close to 1. In contrast to previous results, our analysis does not require concentration of the entrywise distributions. We provide a unified proof that recovers corresponding previous results for (sub-)Gaussian and Weibull distributions. Our findings generalize known results on the geometry of random polytopes, providing lower bounds on the size of the largest Euclidean ball contained in the centrally symmetric polytope spanned by the columns of $A$. At the same time, our results establish robustness of noise-blind $\ell_1$-decoders for recovering sparse vectors $x$ from underdetermined, noisy linear measurements $y = Ax + w$ under the weakest possible assumptions on the entrywise distributions that allow for recovery with optimal sample complexity even in the noiseless case. Our analysis predicts superior robustness behavior for measurement matrices with super-Gaussian entries, which we confirm by numerical experiments.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Harmonic Mean Iteratively Reweighted Least Squares for Low-Rank Matrix Recovery
Authors:
Christian Kümmerle,
Juliane Sigl
Abstract:
We propose a new iteratively reweighted least squares (IRLS) algorithm for the recovery of a matrix $X \in \mathbb{C}^{d_1\times d_2}$ of rank $r \ll\min(d_1,d_2)$ from incomplete linear observations, solving a sequence of low complexity linear problems. The easily implementable algorithm, which we call harmonic mean iteratively reweighted least squares (HM-IRLS), optimizes a non-convex Schatten-…
▽ More
We propose a new iteratively reweighted least squares (IRLS) algorithm for the recovery of a matrix $X \in \mathbb{C}^{d_1\times d_2}$ of rank $r \ll\min(d_1,d_2)$ from incomplete linear observations, solving a sequence of low complexity linear problems. The easily implementable algorithm, which we call harmonic mean iteratively reweighted least squares (HM-IRLS), optimizes a non-convex Schatten-$p$ quasi-norm penalization to promote low-rankness and carries three major strengths, in particular for the matrix completion setting. First, we observe a remarkable global convergence behavior of the algorithm's iterates to the low-rank matrix for relevant, interesting cases, for which any other state-of-the-art optimization approach fails the recovery. Secondly, HM-IRLS exhibits an empirical recovery probability close to $1$ even for a number of measurements very close to the theoretical lower bound $r (d_1 +d_2 -r)$, i.e., already for significantly fewer linear observations than any other tractable approach in the literature. Thirdly, HM-IRLS exhibits a locally superlinear rate of convergence (of order $2-p$) if the linear observations fulfill a suitable null space property. While for the first two properties we have so far only strong empirical evidence, we prove the third property as our main theoretical result.
△ Less
Submitted 27 February, 2018; v1 submitted 15 March, 2017;
originally announced March 2017.