Search | arXiv e-print repository

Features Fusion for Dual-View Mammography Mass Detection

Authors: Arina Varlamova, Valery Belotsky, Grigory Novikov, Anton Konushin, Evgeny Sidorov

Abstract: Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose… ▽ More Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while kee** high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Accepted at ISBI 2024 (21st IEEE International Symposium on Biomedical Imaging)

arXiv:2306.02697 [pdf, other]

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Authors: Viktoriia Chekalina, Georgii Novikov, Julia Gusak, Ivan Oseledets, Alexander Panchenko

Abstract: Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of the parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Tensor Train Matrix~(TTM)… ▽ More Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of the parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Tensor Train Matrix~(TTM) structure. Finally, we customize forward and backward operations through the TTM-based layer for simplicity and the stableness of further training. % The resulting GPT-2-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model. On the downstream tasks, including language understanding and text summarization, the model performs similarly to the original GPT-2 model. The proposed tensorized layers could be used to efficiently pre-training other Transformer models. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2302.10844 [pdf, ps, other]

Robust Mean Estimation Without Moments for Symmetric Distributions

Authors: Gleb Novikov, David Steurer, Stefan Tiegel

Abstract: We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distribut… ▽ More We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-δ$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/δ)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works. △ Less

Submitted 8 November, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2302.10158 [pdf, ps, other]

Sparse PCA Beyond Covariance Thresholding

Authors: Gleb Novikov

Abstract: In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + βvv^\top})$, where $β> 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge Ω(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that so… ▽ More In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + βvv^\top})$, where $β> 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge Ω(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ β\gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $β\gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings. △ Less

Submitted 8 November, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: COLT 2023

arXiv:2211.07327 [pdf, ps, other]

Higher degree sum-of-squares relaxations robust against oblivious outliers

Authors: Tommaso d'Orsi, Rajai Nasser, Gleb Novikov, David Steurer

Abstract: We consider estimation models of the form $Y=X^*+N$, where $X^*$ is some $m$-dimensional signal we wish to recover, and $N$ is symmetrically distributed noise that may be unbounded in all but a small $α$ fraction of the entries. We introduce a family of algorithms that under mild assumptions recover the signal $X^*$ in all estimation problems for which there exists a sum-of-squares algorithm that… ▽ More We consider estimation models of the form $Y=X^*+N$, where $X^*$ is some $m$-dimensional signal we wish to recover, and $N$ is symmetrically distributed noise that may be unbounded in all but a small $α$ fraction of the entries. We introduce a family of algorithms that under mild assumptions recover the signal $X^*$ in all estimation problems for which there exists a sum-of-squares algorithm that succeeds in recovering the signal $X^*$ when the noise $N$ is Gaussian. This essentially shows that it is enough to design a sum-of-squares algorithm for an estimation problem with Gaussian noise in order to get the algorithm that works with the symmetric noise model. Our framework extends far beyond previous results on symmetric noise models and is even robust to adversarial perturbations. As concrete examples, we investigate two problems for which no efficient algorithms were known to work for heavy-tailed noise: tensor PCA and sparse PCA. For the former, our algorithm recovers the principal component in polynomial time when the signal-to-noise ratio is at least $\tilde{O}(n^{p/4}/α)$, that matches (up to logarithmic factors) current best known algorithmic guarantees for Gaussian noise. For the latter, our algorithm runs in quasipolynomial time and matches the state-of-the-art guarantees for quasipolynomial time algorithms in the case of Gaussian noise. Using a reduction from the planted clique problem, we provide evidence that the quasipolynomial time is likely to be necessary for sparse PCA with symmetric noise. In our proofs we use bounds on the covering numbers of sets of pseudo-expectations, which we obtain by certifying in sum-of-squares upper bounds on the Gaussian complexities of sets of solutions. This approach for bounding the covering numbers of sets of pseudo-expectations may be interesting in its own right and may find other application in future works. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: To appear in SODA 2023

arXiv:2202.00441 [pdf, other]

Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Authors: Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Dimitrov, Ivan Oseledets

Abstract: Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of… ▽ More Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks. △ Less

Submitted 2 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

Comments: Submitted

arXiv:2111.02966 [pdf, ps, other]

Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers

Authors: Tommaso d'Orsi, Chih-Hung Liu, Rajai Nasser, Gleb Novikov, David Steurer, Stefan Tiegel

Abstract: We develop machinery to design efficiently computable and consistent estimators, achieving estimation error approaching zero as the number of observations grows, when facing an oblivious adversary that may corrupt responses in all but an $α$ fraction of the samples. As concrete examples, we investigate two problems: sparse regression and principal component analysis (PCA). For sparse regression, w… ▽ More We develop machinery to design efficiently computable and consistent estimators, achieving estimation error approaching zero as the number of observations grows, when facing an oblivious adversary that may corrupt responses in all but an $α$ fraction of the samples. As concrete examples, we investigate two problems: sparse regression and principal component analysis (PCA). For sparse regression, we achieve consistency for optimal sample size $n\gtrsim (k\log d)/α^2$ and optimal error rate $O(\sqrt{(k\log d)/(n\cdot α^2)})$ where $n$ is the number of observations, $d$ is the number of dimensions and $k$ is the sparsity of the parameter vector, allowing the fraction of inliers to be inverse-polynomial in the number of samples. Prior to this work, no estimator was known to be consistent when the fraction of inliers $α$ is $o(1/\log \log n)$, even for (non-spherical) Gaussian design matrices. Results holding under weak design assumptions and in the presence of such general noise have only been shown in dense setting (i.e., general linear regression) very recently by d'Orsi et al. [dNS21]. In the context of PCA, we attain optimal error guarantees under broad spikiness assumptions on the parameter matrix (usually used in matrix completion). Previous works could obtain non-trivial guarantees only under the assumptions that the measurement noise corresponding to the inliers is polynomially small in $n$ (e.g., Gaussian with variance $1/n^2$). To devise our estimators, we equip the Huber loss with non-smooth regularizers such as the $\ell_1$ norm or the nuclear norm, and extend d'Orsi et al.'s approach [dNS21] in a novel way to analyze the loss function. Our machinery appears to be easily applicable to a wide range of estimation problems. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: To appear in NeurIPS 2021

arXiv:2108.00089 [pdf, other]

Tensor-Train Density Estimation

Authors: Georgii S. Novikov, Maxim E. Panov, Ivan V. Oseledets

Abstract: Estimation of probability density function from samples is one of the central problems in statistics and machine learning. Modern neural network-based models can learn high dimensional distributions but have problems with hyperparameter selection and are often prone to instabilities during training and inference. We propose a new efficient tensor train-based model for density estimation (TTDE). Su… ▽ More Estimation of probability density function from samples is one of the central problems in statistics and machine learning. Modern neural network-based models can learn high dimensional distributions but have problems with hyperparameter selection and are often prone to instabilities during training and inference. We propose a new efficient tensor train-based model for density estimation (TTDE). Such density parametrization allows exact sampling, calculation of cumulative and marginal density functions, and partition function. It also has very intuitive hyperparameters. We develop an efficient non-adversarial training procedure for TTDE based on the Riemannian optimization. Experimental results demonstrate the competitive performance of the proposed method in density estimation and sampling tasks, while TTDE significantly outperforms competitors in training speed. △ Less

Submitted 25 February, 2022; v1 submitted 30 July, 2021; originally announced August 2021.

Comments: Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)

ACM Class: G.3

arXiv:2011.06585 [pdf, ps, other]

Sparse PCA: Algorithms, Adversarial Perturbations and Certificates

Authors: Tommaso d'Orsi, Pravesh K. Kothari, Gleb Novikov, David Steurer

Abstract: We study efficient algorithms for Sparse PCA in standard statistical models (spiked covariance in its Wishart form). Our goal is to achieve optimal recovery guarantees while being resilient to small perturbations. Despite a long history of prior works, including explicit studies of perturbation resilience, the best known algorithmic guarantees for Sparse PCA are fragile and break down under small… ▽ More We study efficient algorithms for Sparse PCA in standard statistical models (spiked covariance in its Wishart form). Our goal is to achieve optimal recovery guarantees while being resilient to small perturbations. Despite a long history of prior works, including explicit studies of perturbation resilience, the best known algorithmic guarantees for Sparse PCA are fragile and break down under small adversarial perturbations. We observe a basic connection between perturbation resilience and \emph{certifying algorithms} that are based on certificates of upper bounds on sparse eigenvalues of random matrices. In contrast to other techniques, such certifying algorithms, including the brute-force maximum likelihood estimator, are automatically robust against small adversarial perturbation. We use this connection to obtain the first polynomial-time algorithms for this problem that are resilient against additive adversarial perturbations by obtaining new efficient certificates for upper bounds on sparse eigenvalues of random matrices. Our algorithms are based either on basic semidefinite programming or on its low-degree sum-of-squares strengthening depending on the parameter regimes. Their guarantees either match or approach the best known guarantees of \emph{fragile} algorithms in terms of sparsity of the unknown vector, number of samples and the ambient dimension. To complement our algorithmic results, we prove rigorous lower bounds matching the gap between fragile and robust polynomial-time algorithms in a natural computational model based on low-degree polynomials (closely related to the pseudo-calibration technique for sum-of-squares lower bounds) that is known to capture the best known guarantees for related statistical estimation problems. The combination of these results provides formal evidence of an inherent price to pay to achieve robustness. △ Less

Submitted 12 November, 2020; originally announced November 2020.

arXiv:2009.14774 [pdf, ps, other]

Consistent regression when oblivious outliers overwhelm

Authors: Tommaso d'Orsi, Gleb Novikov, David Steurer

Abstract: We consider a robust linear regression model $y=Xβ^* + η$, where an adversary oblivious to the design $X\in \mathbb{R}^{n\times d}$ may choose $η$ to corrupt all but an $α$ fraction of the observations $y$ in an arbitrary way. Prior to our work, even for Gaussian $X$, no estimator for $β^*$ was known to be consistent in this model except for quadratic sample size $n \gtrsim (d/α)^2$ or for logarit… ▽ More We consider a robust linear regression model $y=Xβ^* + η$, where an adversary oblivious to the design $X\in \mathbb{R}^{n\times d}$ may choose $η$ to corrupt all but an $α$ fraction of the observations $y$ in an arbitrary way. Prior to our work, even for Gaussian $X$, no estimator for $β^*$ was known to be consistent in this model except for quadratic sample size $n \gtrsim (d/α)^2$ or for logarithmic inlier fraction $α\ge 1/\log n$. We show that consistent estimation is possible with nearly linear sample size and inverse-polynomial inlier fraction. Concretely, we show that the Huber loss estimator is consistent for every sample size $n= ω(d/α^2)$ and achieves an error rate of $O(d/α^2n)^{1/2}$. Both bounds are optimal (up to constant factors). Our results extend to designs far beyond the Gaussian case and only require the column span of $X$ to not contain approximately sparse vectors). (similar to the kind of assumption commonly made about the kernel space for compressed sensing). We provide two technically similar proofs. One proof is phrased in terms of strong convexity, extending work of [Tsakonas et al.'14], and particularly short. The other proof highlights a connection between the Huber loss estimator and high-dimensional median computations. In the special case of Gaussian designs, this connection leads us to a strikingly simple algorithm based on computing coordinate-wise medians that achieves optimal guarantees in nearly-linear time, and that can exploit sparsity of $β^*$. The model studied here also captures heavy-tailed noise distributions that may not even have a first moment. △ Less

Submitted 25 May, 2021; v1 submitted 30 September, 2020; originally announced September 2020.

Comments: To appear in ICML 2021

arXiv:1803.00397 [pdf, other]

Satellite imagery analysis for operational damage assessment in Emergency situations

Authors: Alexey Trekin, German Novikov, Georgy Potapov, Vladimir Ignatiev, Evgeny Burnaev

Abstract: When major disaster occurs the questions are raised how to estimate the damage in time to support the decision making process and relief efforts by local authorities or humanitarian teams. In this paper we consider the use of Machine Learning and Computer Vision on remote sensing imagery to improve time efficiency of assessment of damaged buildings in disaster affected area. We propose a general w… ▽ More When major disaster occurs the questions are raised how to estimate the damage in time to support the decision making process and relief efforts by local authorities or humanitarian teams. In this paper we consider the use of Machine Learning and Computer Vision on remote sensing imagery to improve time efficiency of assessment of damaged buildings in disaster affected area. We propose a general workflow that can be useful in various disaster management applications, and demonstrate the use of the proposed workflow for the assessment of the damage caused by the wildfires in California in 2017. △ Less

Submitted 19 February, 2018; originally announced March 2018.

Comments: 12 pages, 10 figures

arXiv:1609.08209 [pdf, other]

Automatic Construction of a Recurrent Neural Network based Classifier for Vehicle Passage Detection

Authors: Evgeny Burnaev, Ivan Koptelov, German Novikov, Timur Khanipov

Abstract: Recurrent Neural Networks (RNNs) are extensively used for time-series modeling and prediction. We propose an approach for automatic construction of a binary classifier based on Long Short-Term Memory RNNs (LSTM-RNNs) for detection of a vehicle passage through a checkpoint. As an input to the classifier we use multidimensional signals of various sensors that are installed on the checkpoint. Obtaine… ▽ More Recurrent Neural Networks (RNNs) are extensively used for time-series modeling and prediction. We propose an approach for automatic construction of a binary classifier based on Long Short-Term Memory RNNs (LSTM-RNNs) for detection of a vehicle passage through a checkpoint. As an input to the classifier we use multidimensional signals of various sensors that are installed on the checkpoint. Obtained results demonstrate that the previous approach to handcrafting a classifier, consisting of a set of deterministic rules, can be successfully replaced by an automatic RNN training on an appropriately labelled data. △ Less

Submitted 26 September, 2016; originally announced September 2016.

Comments: 6 pages, 2 figures, 5 tables

Showing 1–12 of 12 results for author: Novikov, G