Search | arXiv e-print repository

Surrogate Gap Minimization Improves Sharpness-Aware Training

Authors: Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu

Abstract: The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to th… ▽ More The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of the neighborhood (to derive the perturbed loss) is small. The surrogate gap is easy to compute and feasible for direct minimization during training. Based on the above observations, we propose Surrogate \textbf{G}ap Guided \textbf{S}harpness-\textbf{A}ware \textbf{M}inimization (GSAM), a novel improvement over SAM with negligible computation overhead. Conceptually, GSAM consists of two steps: 1) a gradient descent like SAM to minimize the perturbed loss, and 2) an \textit{ascent} step in the \textit{orthogonal} direction (after gradient decomposition) to minimize the surrogate gap and yet not affect the perturbed loss. GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities. Theoretically, we show the convergence of GSAM and provably better generalization than SAM. Empirically, GSAM consistently improves generalization (e.g., +3.2\% over SAM and +5.4\% over AdamW on ImageNet top-1 accuracy for ViT-B/32). Code is released at \url{ https://sites.google.com/view/gsam-iclr22/home}. △ Less

Submitted 19 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Paper accepted by ICLR22, https://openreview.net/forum?id=edONMAnhLu-

arXiv:2110.05454 [pdf, other]

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Authors: Juntang Zhuang, Yifan Ding, Tommy Tang, Nicha Dvornek, Sekhar Tatikonda, James S. Duncan

Abstract: We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for $t$-th update, denominator uses information up to step $t-1$, while numerator uses gradient at $t$-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous op… ▽ More We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for $t$-th update, denominator uses information up to step $t-1$, while numerator uses gradient at $t$-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of $O(\frac{1}{\sqrt{T}})$ for the stochastic non-convex case, which matches the oracle rate and outperforms the $O(\frac{logT}{\sqrt{T}})$ rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam. We provide the implementation at https://github.com/juntang-zhuang/ACProp-Optimizer. △ Less

Submitted 1 December, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

arXiv:2106.07696 [pdf, other]

Face Age Progression With Attribute Manipulation

Authors: Sinzith Tatikonda, Athira Nambiar, Anurag Mittal

Abstract: Face is one of the predominant means of person recognition. In the process of ageing, human face is prone to many factors such as time, attributes, weather and other subject specific variations. The impact of these factors were not well studied in the literature of face aging. In this paper, we propose a novel holistic model in this regard viz., ``Face Age progression With Attribute Manipulation (… ▽ More Face is one of the predominant means of person recognition. In the process of ageing, human face is prone to many factors such as time, attributes, weather and other subject specific variations. The impact of these factors were not well studied in the literature of face aging. In this paper, we propose a novel holistic model in this regard viz., ``Face Age progression With Attribute Manipulation (FAWAM)", i.e. generating face images at different ages while simultaneously varying attributes and other subject specific characteristics. We address the task in a bottom-up manner, as two submodules i.e. face age progression and face attribute manipulation. For face aging, we use an attribute-conscious face aging model with a pyramidal generative adversarial network that can model age-specific facial changes while maintaining intrinsic subject specific characteristics. For facial attribute manipulation, the age processed facial image is manipulated with desired attributes while preserving other details unchanged, leveraging an attribute generative adversarial network architecture. We conduct extensive analysis in standard large scale datasets and our model achieves significant performance both quantitatively and qualitatively. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: -

arXiv:2102.11013 [pdf, other]

Multiple-shooting adjoint method for whole-brain dynamic causal modeling

Authors: Juntang Zhuang, Nicha Dvornek, Sekhar Tatikonda, Xenophon Papademetris, Pamela Ventola, James Duncan

Abstract: Dynamic causal modeling (DCM) is a Bayesian framework to infer directed connections between compartments, and has been used to describe the interactions between underlying neural populations based on functional neuroimaging data. DCM is typically analyzed with the expectation-maximization (EM) algorithm. However, because the inversion of a large-scale continuous system is difficult when noisy obse… ▽ More Dynamic causal modeling (DCM) is a Bayesian framework to infer directed connections between compartments, and has been used to describe the interactions between underlying neural populations based on functional neuroimaging data. DCM is typically analyzed with the expectation-maximization (EM) algorithm. However, because the inversion of a large-scale continuous system is difficult when noisy observations are present, DCM by EM is typically limited to a small number of compartments ($<10$). Another drawback with the current method is its complexity; when the forward model changes, the posterior mean changes, and we need to re-derive the algorithm for optimization. In this project, we propose the Multiple-Shooting Adjoint (MSA) method to address these limitations. MSA uses the multiple-shooting method for parameter estimation in ordinary differential equations (ODEs) under noisy observations, and is suitable for large-scale systems such as whole-brain analysis in functional MRI (fMRI). Furthermore, MSA uses the adjoint method for accurate gradient estimation in the ODE; since the adjoint method is generic, MSA is a generic method for both linear and non-linear systems, and does not require re-derivation of the algorithm as in EM. We validate MSA in extensive experiments: 1) in toy examples with both linear and non-linear models, we show that MSA achieves better accuracy in parameter value estimation than EM; furthermore, MSA can be successfully applied to large systems with up to 100 compartments; and 2) using real fMRI data, we apply MSA to the estimation of the whole-brain effective connectome and show improved classification of autism spectrum disorder (ASD) vs. control compared to using the functional connectome. The package is provided \url{https://jzkay12.github.io/TorchDiffEqPack} △ Less

Submitted 14 February, 2021; originally announced February 2021.

Comments: 27th International Conference on Information Processing in Medical Imaging

arXiv:2102.04668 [pdf, other]

MALI: A memory efficient and reverse accurate integrator for Neural ODEs

Authors: Juntang Zhuang, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan

Abstract: Neural ordinary differential equations (Neural ODEs) are a new family of deep-learning models with continuous depth. However, the numerical estimation of the gradient in the continuous case is not well solved: existing implementations of the adjoint method suffer from inaccuracy in reverse-time trajectory, while the naive method and the adaptive checkpoint adjoint method (ACA) have a memory cost t… ▽ More Neural ordinary differential equations (Neural ODEs) are a new family of deep-learning models with continuous depth. However, the numerical estimation of the gradient in the continuous case is not well solved: existing implementations of the adjoint method suffer from inaccuracy in reverse-time trajectory, while the naive method and the adaptive checkpoint adjoint method (ACA) have a memory cost that grows with integration time. In this project, based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost \textit{w.r.t} number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy; for time series modeling, MALI significantly outperforms the adjoint method; and for continuous generative models, MALI achieves new state-of-the-art performance. We provide a pypi package at \url{https://jzkay12.github.io/TorchDiffEqPack/} △ Less

Submitted 3 March, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: https://openreview.net/forum?id=blfSjHeFM_e

Journal ref: International Conference on Learning Representation, ICLR 2021

arXiv:2010.07468 [pdf, other]

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Authors: Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar Tatikonda, Nicha Dvornek, Xenophon Papademetris, James S. Duncan

Abstract: Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptiv… ▽ More Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer △ Less

Submitted 20 December, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

Journal ref: NeurIPS 2020

arXiv:2006.02493 [pdf]

Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE

Authors: Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, James Duncan

Abstract: Neural ordinary differential equations (NODEs) have recently attracted increasing attention; however, their empirical performance on benchmark tasks (e.g. image classification) are significantly inferior to discrete-layer models. We demonstrate an explanation for their poorer performance is the inaccuracy of existing gradient estimation methods: the adjoint method has numerical errors in reverse-m… ▽ More Neural ordinary differential equations (NODEs) have recently attracted increasing attention; however, their empirical performance on benchmark tasks (e.g. image classification) are significantly inferior to discrete-layer models. We demonstrate an explanation for their poorer performance is the inaccuracy of existing gradient estimation methods: the adjoint method has numerical errors in reverse-mode integration; the naive method directly back-propagates through ODE solvers, but suffers from a redundantly deep computation graph when searching for the optimal stepsize. We propose the Adaptive Checkpoint Adjoint (ACA) method: in automatic differentiation, ACA applies a trajectory checkpoint strategy which records the forward-mode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers. On image classification tasks, compared with the adjoint and naive method, ACA achieves half the error rate in half the training time; NODE trained with ACA outperforms ResNet in both accuracy and test-retest reliability. On time-series modeling, ACA outperforms competing methods. Finally, in an example of the three-body problem, we show NODE with ACA can incorporate physical knowledge to achieve better accuracy. We provide the PyTorch implementation of ACA: \url{https://github.com/juntang-zhuang/torch-ACA}. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Journal ref: https://proceedings.icml.cc/static/paper_files/icml/2020/917-Paper.pdf

arXiv:1911.00771 [pdf, other]

doi 10.1561/0100000092

Sparse Regression Codes

Authors: Ramji Venkataramanan, Sekhar Tatikonda, Andrew Barron

Abstract: Develo** computationally-efficient codes that approach the Shannon-theoretic limits for communication and compression has long been one of the major goals of information and coding theory. There have been significant advances towards this goal in the last couple of decades, with the emergence of turbo codes, sparse-graph codes, and polar codes. These codes are designed primarily for discrete-alp… ▽ More Develo** computationally-efficient codes that approach the Shannon-theoretic limits for communication and compression has long been one of the major goals of information and coding theory. There have been significant advances towards this goal in the last couple of decades, with the emergence of turbo codes, sparse-graph codes, and polar codes. These codes are designed primarily for discrete-alphabet channels and sources. For Gaussian channels and sources, where the alphabet is inherently continuous, Sparse Superposition Codes or Sparse Regression Codes (SPARCs) are a promising class of codes for achieving the Shannon limits. This survey provides a unified and comprehensive overview of sparse regression codes, covering theory, algorithms, and practical implementation aspects. The first part of the monograph focuses on SPARCs for AWGN channel coding, and the second part on SPARCs for lossy compression (with squared error distortion criterion). In the third part, SPARCs are used to construct codes for Gaussian multi-terminal channel and source coding models such as broadcast channels, multiple-access channels, and source and channel coding with side information. The survey concludes with a discussion of open problems and directions for future work. △ Less

Submitted 2 November, 2019; originally announced November 2019.

Comments: Published in Foundations and Trends in Communications and Information Theory, 2019

Journal ref: Foundations and Trends in Communications and Information Theory, vol. 15, no. 1-2, pp. 1-195, 2019

arXiv:1808.09889 [pdf, other]

Zero-shot Transfer Learning for Semantic Parsing

Authors: Javid Dadashkarimi, Alexander Fabbri, Sekhar Tatikonda, Dragomir R. Radev

Abstract: While neural networks have shown impressive performance on large datasets, applying these models to tasks where little data is available remains a challenging problem. In this paper we propose to use feature transfer in a zero-shot experimental setting on the task of semantic parsing. We first introduce a new method for learning the shared space between multiple domains based on the prediction… ▽ More While neural networks have shown impressive performance on large datasets, applying these models to tasks where little data is available remains a challenging problem. In this paper we propose to use feature transfer in a zero-shot experimental setting on the task of semantic parsing. We first introduce a new method for learning the shared space between multiple domains based on the prediction of the domain label for each example. Our experiments support the superiority of this method in a zero-shot experimental setting in terms of accuracy metrics compared to state-of-the-art techniques. In the second part of this paper we study the impact of individual domains and examples on semantic parsing performance. We use influence functions to this aim and investigate the sensitivity of domain-label classification loss on each example. Our findings reveal that cross-domain adversarial attacks identify useful examples for training even from the domains the least similar to the target domain. Augmenting our training data with these influential examples further boosts our accuracy at both the token and the sequence level. △ Less

Submitted 27 August, 2018; originally announced August 2018.

arXiv:1807.07333 [pdf, other]

Sequence to Logic with Copy and Cache

Authors: Javid Dadashkarimi, Sekhar Tatikonda

Abstract: Generating logical form equivalents of human language is a fresh way to employ neural architectures where long short-term memory effectively captures dependencies in both encoder and decoder units. The logical form of the sequence usually preserves information from the natural language side in the form of similar tokens, and recently a copying mechanism has been proposed which increases the prob… ▽ More Generating logical form equivalents of human language is a fresh way to employ neural architectures where long short-term memory effectively captures dependencies in both encoder and decoder units. The logical form of the sequence usually preserves information from the natural language side in the form of similar tokens, and recently a copying mechanism has been proposed which increases the probability of outputting tokens from the source input through decoding. In this paper we propose a caching mechanism as a more general form of the copying mechanism which also weighs all the words from the source vocabulary according to their relation to the current decoding context. Our results confirm that the proposed method achieves improvements in sequence/token-level accuracy on sequence to logical form tasks. Further experiments on cross-domain adversarial attacks show substantial improvements when using the most influential examples of other domains for training. △ Less

Submitted 19 July, 2018; originally announced July 2018.

arXiv:1711.09853 [pdf, ps, other]

The Time-Invariant Multidimensional Gaussian Sequential Rate-Distortion Problem Revisited

Authors: Photios A. Stavrou, Takashi Tanaka, Sekhar Tatikonda

Abstract: We revisit the sequential rate-distortion (SRD) trade-off problem for vector-valued Gauss-Markov sources with mean-squared error distortion constraints. We show via a counterexample that the dynamic reverse water-filling algorithm suggested by [1, eq. (15)] is not applicable to this problem, and consequently the closed form expression of the asymptotic SRD function derived in [1, eq. (17)] is not… ▽ More We revisit the sequential rate-distortion (SRD) trade-off problem for vector-valued Gauss-Markov sources with mean-squared error distortion constraints. We show via a counterexample that the dynamic reverse water-filling algorithm suggested by [1, eq. (15)] is not applicable to this problem, and consequently the closed form expression of the asymptotic SRD function derived in [1, eq. (17)] is not correct in general. Nevertheless, we show that the multidimensional Gaussian SRD function is semidefinite representable and thus it is readily computable. △ Less

Submitted 27 November, 2017; originally announced November 2017.

Comments: 7 pages, 2 figures

MSC Class: 90C22; 94A15

arXiv:1611.07138 [pdf, other]

A New Approach to Laplacian Solvers and Flow Problems

Authors: Patrick Rebeschini, Sekhar Tatikonda

Abstract: This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple gra… ▽ More This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple graph-theoretic constructions or sampling mechanisms that make them difficult to implement and analyze. On the other hand, message passing routines are distributed, simple, and easy to implement. In this paper we establish a framework to analyze Min-Sum to solve voltage and flow problems. We characterize the error committed by the algorithm on general weighted graphs in terms of hitting times of random walks defined on the computation trees that support the operations of the algorithms with time. For $d$-regular graphs with equal weights, we show that the convergence of the algorithms is controlled by the total variation distance between the distributions of non-backtracking random walks defined on the original graph that start from neighboring nodes. The framework that we introduce extends the analysis of Min-Sum to settings where the contraction arguments previously considered in the literature (based on the assumption of walk summability or scaled diagonal dominance) can not be used, possibly in the presence of constraints. △ Less

Submitted 7 March, 2019; v1 submitted 21 November, 2016; originally announced November 2016.

arXiv:1401.5272 [pdf, other]

doi 10.1109/TIT.2017.2716360

The Rate-Distortion Function and Excess-Distortion Exponent of Sparse Regression Codes with Optimal Encoding

Authors: Ramji Venkataramanan, Sekhar Tatikonda

Abstract: This paper studies the performance of sparse regression codes for lossy compression with the squared-error distortion criterion. In a sparse regression code, codewords are linear combinations of subsets of columns of a design matrix. It is shown that with minimum-distance encoding, sparse regression codes achieve the Shannon rate-distortion function for i.i.d. Gaussian sources $R^*(D)$ as well as… ▽ More This paper studies the performance of sparse regression codes for lossy compression with the squared-error distortion criterion. In a sparse regression code, codewords are linear combinations of subsets of columns of a design matrix. It is shown that with minimum-distance encoding, sparse regression codes achieve the Shannon rate-distortion function for i.i.d. Gaussian sources $R^*(D)$ as well as the optimal excess-distortion exponent. This completes a previous result which showed that $R^*(D)$ and the optimal exponent were achievable for distortions below a certain threshold. The proof of the rate-distortion result is based on the second moment method, a popular technique to show that a non-negative random variable $X$ is strictly positive with high probability. In our context, $X$ is the number of codewords within target distortion $D$ of the source sequence. We first identify the reason behind the failure of the standard second moment method for certain distortions, and illustrate the different failure modes via a stylized example. We then use a refinement of the second moment method to show that $R^*(D)$ is achievable for all distortion values. Finally, the refinement technique is applied to Suen's correlation inequality to prove the achievability of the optimal Gaussian excess-distortion exponent. △ Less

Submitted 19 June, 2017; v1 submitted 21 January, 2014; originally announced January 2014.

Comments: 16 pages. IEEE Transactions on Information Theory

Journal ref: IEEE Transactions on Information Theory, Vol. 63, no. 8, pp. 5228-5243 (August 2017)

arXiv:1301.0605 [pdf]

Loopy Belief Propogation and Gibbs Measures

Authors: Sekhar Tatikonda, Michael I. Jordan

Abstract: We address the question of convergence in the loopy belief propagation (LBP) algorithm. Specifically, we relate convergence of LBP to the existence of a weak limit for a sequence of Gibbs measures defined on the LBP s associated computation tree.Using tools FROM the theory OF Gibbs measures we develop easily testable sufficient conditions FOR convergence.The failure OF convergence O… ▽ More We address the question of convergence in the loopy belief propagation (LBP) algorithm. Specifically, we relate convergence of LBP to the existence of a weak limit for a sequence of Gibbs measures defined on the LBP s associated computation tree.Using tools FROM the theory OF Gibbs measures we develop easily testable sufficient conditions FOR convergence.The failure OF convergence OF LBP implies the existence OF multiple phases FOR the associated Gibbs specification.These results give new insight INTO the mechanics OF the algorithm. △ Less

Submitted 12 December, 2012; originally announced January 2013.

Comments: Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002)

Report number: UAI-P-2002-PG-493-500

arXiv:1212.2125 [pdf, other]

Sparse Regression Codes for Multi-terminal Source and Channel Coding

Authors: Ramji Venkataramanan, Sekhar Tatikonda

Abstract: We study a new class of codes for Gaussian multi-terminal source and channel coding. These codes are designed using the statistical framework of high-dimensional linear regression and are called Sparse Superposition or Sparse Regression codes. Codewords are linear combinations of subsets of columns of a design matrix. These codes were recently introduced by Barron and Joseph and shown to achieve t… ▽ More We study a new class of codes for Gaussian multi-terminal source and channel coding. These codes are designed using the statistical framework of high-dimensional linear regression and are called Sparse Superposition or Sparse Regression codes. Codewords are linear combinations of subsets of columns of a design matrix. These codes were recently introduced by Barron and Joseph and shown to achieve the channel capacity of AWGN channels with computationally feasible decoding. They have also recently been shown to achieve the optimal rate-distortion function for Gaussian sources. In this paper, we demonstrate how to implement random binning and superposition coding using sparse regression codes. In particular, with minimum-distance encoding/decoding it is shown that sparse regression codes attain the optimal information-theoretic limits for a variety of multi-terminal source and channel coding problems. △ Less

Submitted 10 December, 2012; originally announced December 2012.

Comments: 9 pages, appeared in the Proceedings of the 50th Annual Allerton Conference on Communication, Control, and Computing - 2012

arXiv:1212.1707 [pdf, other]

doi 10.1109/TIT.2014.2314676

Lossy Compression via Sparse Linear Regression: Computationally Efficient Encoding and Decoding

Authors: Ramji Venkataramanan, Tuhin Sarkar, Sekhar Tatikonda

Abstract: We propose computationally efficient encoders and decoders for lossy compression using a Sparse Regression Code. The codebook is defined by a design matrix and codewords are structured linear combinations of columns of this matrix. The proposed encoding algorithm sequentially chooses columns of the design matrix to successively approximate the source sequence. It is shown to achieve the optimal di… ▽ More We propose computationally efficient encoders and decoders for lossy compression using a Sparse Regression Code. The codebook is defined by a design matrix and codewords are structured linear combinations of columns of this matrix. The proposed encoding algorithm sequentially chooses columns of the design matrix to successively approximate the source sequence. It is shown to achieve the optimal distortion-rate function for i.i.d Gaussian sources under the squared-error distortion criterion. For a given rate, the parameters of the design matrix can be varied to trade off distortion performance with encoding complexity. An example of such a trade-off as a function of the block length n is the following. With computational resource (space or time) per source sample of O((n/\log n)^2), for a fixed distortion-level above the Gaussian distortion-rate function, the probability of excess distortion decays exponentially in n. The Sparse Regression Code is robust in the following sense: for any ergodic source, the proposed encoder achieves the optimal distortion-rate function of an i.i.d Gaussian source with the same variance. Simulations show that the encoder has good empirical performance, especially at low and moderate rates. △ Less

Submitted 28 March, 2014; v1 submitted 7 December, 2012; originally announced December 2012.

Comments: 14 pages, to appear in IEEE Transactions on Information Theory

Journal ref: IEEE Transactions on Information Theory, vol. 60, no. 6, pp. 3265-3278, June 2014

arXiv:1212.0171 [pdf, ps, other]

Message-Passing Algorithms for Quadratic Minimization

Authors: Nicholas Ruozzi, Sekhar Tatikonda

Abstract: Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean of a multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive definite quadratic function. Sufficient conditions, such as walk-summability, that guarantee the convergence and correctness of GaBP are known, but GaBP may fail to converge to the correct solution given an arbitrary po… ▽ More Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean of a multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive definite quadratic function. Sufficient conditions, such as walk-summability, that guarantee the convergence and correctness of GaBP are known, but GaBP may fail to converge to the correct solution given an arbitrary positive definite quadratic function. As was observed in previous work, the GaBP algorithm fails to converge if the computation trees produced by the algorithm are not positive definite. In this work, we will show that the failure modes of the GaBP algorithm can be understood via graph covers, and we prove that a parameterized generalization of the min-sum algorithm can be used to ensure that the computation trees remain positive definite whenever the input matrix is positive definite. We demonstrate that the resulting algorithm is closely related to other iterative schemes for quadratic minimization such as the Gauss-Seidel and Jacobi algorithms. Finally, we observe, empirically, that there always exists a choice of parameters such that the above generalization of the GaBP algorithm converges. △ Less

Submitted 1 December, 2012; originally announced December 2012.

Journal ref: Journal of Machine Learning Research. 14 (Aug) :2287-2314, 2013

arXiv:1211.4521 [pdf, ps, other]

Hash in a Flash: Hash Tables for Solid State Devices

Authors: Tyler Clemons, S. M. Faisal, Shirish Tatikonda, Charu Aggarawl, Srinivasan Parthasarathy

Abstract: In recent years, information retrieval algorithms have taken center stage for extracting important data in ever larger datasets. Advances in hardware technology have lead to the increasingly wide spread use of flash storage devices. Such devices have clear benefits over traditional hard drives in terms of latency of access, bandwidth and random access capabilities particularly when reading data. T… ▽ More In recent years, information retrieval algorithms have taken center stage for extracting important data in ever larger datasets. Advances in hardware technology have lead to the increasingly wide spread use of flash storage devices. Such devices have clear benefits over traditional hard drives in terms of latency of access, bandwidth and random access capabilities particularly when reading data. There are however some interesting trade-offs to consider when leveraging the advanced features of such devices. On a relative scale writing to such devices can be expensive. This is because typical flash devices (NAND technology) are updated in blocks. A minor update to a given block requires the entire block to be erased, followed by a re-writing of the block. On the other hand, sequential writes can be two orders of magnitude faster than random writes. In addition, random writes are degrading to the life of the flash drive, since each block can support only a limited number of erasures. TF-IDF can be implemented using a counting hash table. In general, hash tables are a particularly challenging case for the flash drive because this data structure is inherently dependent upon the randomness of the hash function, as opposed to the spatial locality of the data. This makes it difficult to avoid the random writes incurred during the construction of the counting hash table for TF-IDF. In this paper, we will study the design landscape for the development of a hash table for flash storage devices. We demonstrate how to effectively design a hash table with two related hash functions, one of which exhibits a data placement property with respect to the other. Specifically, we focus on three designs based on this general philosophy and evaluate the trade-offs among them along the axes of query performance, insert and update times and I/O time through an implementation of the TF-IDF algorithm. △ Less

Submitted 19 November, 2012; originally announced November 2012.

Comments: 16 pages 10 figures

ACM Class: H.2.7; H.2.8; H.3.1; E.2

arXiv:1206.2491 [pdf, other]

doi 10.1109/JSAC.2014.140502

Rewritable storage channels with hidden state

Authors: Ramji Venkataramanan, Sekhar Tatikonda, Luis Lastras, Michele Franceschini

Abstract: Many storage channels admit reading and rewriting of the content at a given cost. We consider rewritable channels with a hidden state which models the unknown characteristics of the memory cell. In addition to mitigating the effect of the write noise, rewrites can help the write controller obtain a better estimate of the hidden state. The paper has two contributions. The first is a lower bound on… ▽ More Many storage channels admit reading and rewriting of the content at a given cost. We consider rewritable channels with a hidden state which models the unknown characteristics of the memory cell. In addition to mitigating the effect of the write noise, rewrites can help the write controller obtain a better estimate of the hidden state. The paper has two contributions. The first is a lower bound on the capacity of a general rewritable channel with hidden state. The lower bound is obtained using a coding scheme that combines Gelfand-Pinsker coding with superposition coding. The rewritable AWGN channel is discussed as an example. The second contribution is a simple coding scheme for a rewritable channel where the write noise and hidden state are both uniformly distributed. It is shown that this scheme is asymptotically optimal as the number of rewrites gets large. △ Less

Submitted 3 June, 2013; v1 submitted 12 June, 2012; originally announced June 2012.

Comments: 10 pages. Part of the paper appeared in the proceedings of the 2012 IEEE International Symposium on Information Theory

Journal ref: IEEE Journal on Selected Areas in Communications, vol. 32, no. 5, pp. 815-824, May 2014

arXiv:1202.0840 [pdf, other]

doi 10.1109/TIT.2014.2313085

Lossy Compression via Sparse Linear Regression: Performance under Minimum-distance Encoding

Authors: Ramji Venkataramanan, Antony Joseph, Sekhar Tatikonda

Abstract: We study a new class of codes for lossy compression with the squared-error distortion criterion, designed using the statistical framework of high-dimensional linear regression. Codewords are linear combinations of subsets of columns of a design matrix. Called a Sparse Superposition or Sparse Regression codebook, this structure is motivated by an analogous construction proposed recently by Barron a… ▽ More We study a new class of codes for lossy compression with the squared-error distortion criterion, designed using the statistical framework of high-dimensional linear regression. Codewords are linear combinations of subsets of columns of a design matrix. Called a Sparse Superposition or Sparse Regression codebook, this structure is motivated by an analogous construction proposed recently by Barron and Joseph for communication over an AWGN channel. For i.i.d Gaussian sources and minimum-distance encoding, we show that such a code can attain the Shannon rate-distortion function with the optimal error exponent, for all distortions below a specified value. It is also shown that sparse regression codes are robust in the following sense: a codebook designed to compress an i.i.d Gaussian source of variance $σ^2$ with (squared-error) distortion $D$ can compress any ergodic source of variance less than $σ^2$ to within distortion $D$. Thus the sparse regression ensemble retains many of the good covering properties of the i.i.d random Gaussian ensemble, while having having a compact representation in terms of a matrix whose size is a low-order polynomial in the block-length. △ Less

Submitted 18 December, 2015; v1 submitted 3 February, 2012; originally announced February 2012.

Comments: This version corrects a typo in the statement of Theorem 2 of the published paper

Journal ref: IEEE Transactions on Information Theory, vol. 60, no. 6, pp. 3254-3264, June 2014

arXiv:1107.3818 [pdf, other]

Conditioned Poisson distributions and the concentration of chromatic numbers

Authors: John Hartigan, David Pollard, Sekhar Tatikonda

Abstract: The paper provides a simpler method for proving a delicate inequality that was used by Achlioptis and Naor to establish asymptotic concentration for chromatic numbers of Erdos-Renyi random graphs. The simplifications come from two new ideas. The first involves a sharpened form of a piece of statistical folklore regarding goodness-of-fit tests for two-way tables of Poisson counts under linear condi… ▽ More The paper provides a simpler method for proving a delicate inequality that was used by Achlioptis and Naor to establish asymptotic concentration for chromatic numbers of Erdos-Renyi random graphs. The simplifications come from two new ideas. The first involves a sharpened form of a piece of statistical folklore regarding goodness-of-fit tests for two-way tables of Poisson counts under linear conditioning constraints. The second idea takes the form of a new inequality that controls the extreme tails of the distribution of a quadratic form in independent Poissons random variables. △ Less

Submitted 19 July, 2011; originally announced July 2011.

Comments: Unpublished paper from June 2008

arXiv:1102.5112 [pdf, other]

doi 10.1109/TIT.2013.2278181

Achievable Rates for Channels with Deletions and Insertions

Authors: Ramji Venkataramanan, Sekhar Tatikonda, Kannan Ramchandran

Abstract: This paper considers a binary channel with deletions and insertions, where each input bit is transformed in one of the following ways: it is deleted with probability d, or an extra bit is added after it with probability i, or it is transmitted unmodified with probability 1-d-i. A computable lower bound on the capacity of this channel is derived. The transformation of the input sequence by the chan… ▽ More This paper considers a binary channel with deletions and insertions, where each input bit is transformed in one of the following ways: it is deleted with probability d, or an extra bit is added after it with probability i, or it is transmitted unmodified with probability 1-d-i. A computable lower bound on the capacity of this channel is derived. The transformation of the input sequence by the channel may be viewed in terms of runs as follows: some runs of the input sequence get shorter/longer, some runs get deleted, and some new runs are added. It is difficult for the decoder to synchronize the channel output sequence to the transmitted codeword mainly due to deleted runs and new inserted runs. The main idea is a mutual information decomposition in terms of the rate achieved by a sub-optimal decoder that determines the positions of the deleted and inserted runs in addition to decoding the transmitted codeword. The mutual information between the channel input and output sequences is expressed as the sum of the rate achieved by this decoder and the rate loss due to its sub-optimality. Obtaining computable lower bounds on each of these quantities yields a lower bound on the capacity. The bounds proposed in this paper provide the first characterization of achievable rates for channels with general insertions, and for channels with both deletions and insertions. For the special case of the deletion channel, the proposed bound improves on the previous best lower bound for deletion probabilities up to 0.3. △ Less

Submitted 19 July, 2013; v1 submitted 24 February, 2011; originally announced February 2011.

Comments: To appear in IEEE Transactions on Information Theory. For the deletion channel, the new capacity lower bound improves on the previous best bound for deletion probabilities up to 0.3

Journal ref: IEEE Transactions on Information Theory, vol. 59, no.11, pp. 6990-7013, Nov. 2013

arXiv:1002.3239 [pdf, ps, other]

doi 10.1109/TIT.2013.2259576

Message-Passing Algorithms: Reparameterizations and Splittings

Authors: Nicholas Ruozzi, Sekhar Tatikonda

Abstract: The max-product algorithm, a local message-passing scheme that attempts to compute the most probable assignment (MAP) of a given probability distribution, has been successfully employed as a method of approximate inference for applications arising in coding theory, computer vision, and machine learning. However, the max-product algorithm is not guaranteed to converge to the MAP assignment, and if… ▽ More The max-product algorithm, a local message-passing scheme that attempts to compute the most probable assignment (MAP) of a given probability distribution, has been successfully employed as a method of approximate inference for applications arising in coding theory, computer vision, and machine learning. However, the max-product algorithm is not guaranteed to converge to the MAP assignment, and if it does, is not guaranteed to recover the MAP assignment. Alternative convergent message-passing schemes have been proposed to overcome these difficulties. This work provides a systematic study of such message-passing algorithms that extends the known results by exhibiting new sufficient conditions for convergence to local and/or global optima, providing a combinatorial characterization of these optima based on graph covers, and describing a new convergent and correct message-passing algorithm whose derivation unifies many of the known convergent message-passing algorithms. While convergent and correct message-passing algorithms represent a step forward in the analysis of max-product style message-passing algorithms, the conditions needed to guarantee convergence to a global optimum can be too restrictive in both theory and practice. This limitation of convergent and correct message-passing schemes is characterized by graph covers and illustrated by example. △ Less

Submitted 1 December, 2012; v1 submitted 17 February, 2010; originally announced February 2010.

Comments: A complete rework and expansion of the previous versions

Journal ref: Information Theory, IEEE Transactions on , vol.59, no.9, pp.5860,5881, Sept. 2013

arXiv:0911.2023 [pdf, other]

Opportunistic capacity and error exponent regions for compound channel with feedback

Authors: Aditya Mahajan, Sekhar Tatikonda

Abstract: Variable length communication over a compound channel with feedback is considered. Traditionally, capacity of a compound channel without feedback is defined as the maximum rate that is determined before the start of communication such that communication is reliable. This traditional definition is pessimistic. In the presence of feedback, an opportunistic definition is given. Capacity is defined as… ▽ More Variable length communication over a compound channel with feedback is considered. Traditionally, capacity of a compound channel without feedback is defined as the maximum rate that is determined before the start of communication such that communication is reliable. This traditional definition is pessimistic. In the presence of feedback, an opportunistic definition is given. Capacity is defined as the maximum rate that is determined at the end of communication such that communication is reliable. Thus, the transmission rate can adapt to the channel chosen by nature. Under this definition, feedback communication over a compound channel is conceptually similar to multi-terminal communication. Transmission rate is a vector rather than a scalar; channel capacity is a region rather than a scalar; error exponent is a region rather than a scalar. In this paper, variable length communication over a compound channel with feedback is formulated, its opportunistic capacity region is characterized, and lower bounds for its error exponent region are provided.. △ Less

Submitted 29 June, 2011; v1 submitted 10 November, 2009; originally announced November 2009.

arXiv:0809.0158 [pdf, ps, other]

doi 10.1109/TIT.2011.2168901

Network Tomography Based on Additive Metrics

Authors: Jian Ni, Sekhar Tatikonda

Abstract: Inference of the network structure (e.g., routing topology) and dynamics (e.g., link performance) is an essential component in many network design and management tasks. In this paper we propose a new, general framework for analyzing and designing routing topology and link performance inference algorithms using ideas and tools from phylogenetic inference in evolutionary biology. The framework is… ▽ More Inference of the network structure (e.g., routing topology) and dynamics (e.g., link performance) is an essential component in many network design and management tasks. In this paper we propose a new, general framework for analyzing and designing routing topology and link performance inference algorithms using ideas and tools from phylogenetic inference in evolutionary biology. The framework is applicable to a variety of measurement techniques. Based on the framework we introduce and develop several polynomial-time distance-based inference algorithms with provable performance. We provide sufficient conditions for the correctness of the algorithms. We show that the algorithms are consistent (return correct topology and link performance with an increasing sample size) and robust (can tolerate a certain level of measurement errors). In addition, we establish certain optimality properties of the algorithms (i.e., they achieve the optimal $l_\infty$-radius) and demonstrate their effectiveness via model simulation. △ Less

Submitted 31 August, 2008; originally announced September 2008.

Comments: 35 pages

Journal ref: IEEE Transactions on Information Theory, 57(12), December 2011

arXiv:0808.2089 [pdf, ps, other]

doi 10.1109/TIT.2015.2437380

Capacity-achieving Feedback Scheme for Gaussian Finite-State Markov Channels with Channel State Information

Authors: Jialing Liu, Nicola Elia, Sekhar Tatikonda

Abstract: In this paper, we propose capacity-achieving communication schemes for Gaussian finite-state Markov channels (FSMCs) subject to an average channel input power constraint, under the assumption that the transmitters can have access to delayed noiseless output feedback as well as instantaneous or delayed channel state information (CSI). We show that the proposed schemes reveals connections between fe… ▽ More In this paper, we propose capacity-achieving communication schemes for Gaussian finite-state Markov channels (FSMCs) subject to an average channel input power constraint, under the assumption that the transmitters can have access to delayed noiseless output feedback as well as instantaneous or delayed channel state information (CSI). We show that the proposed schemes reveals connections between feedback communication and feedback control. △ Less

Submitted 7 October, 2010; v1 submitted 14 August, 2008; originally announced August 2008.

Comments: Submitted to the IEEE Transactions on Information Theory. 31 pages

arXiv:0707.2014 [pdf, ps, other]

On the error exponent of variable-length block-coding schemes over finite-state Markov channels with feedback

Authors: Giacomo Como, Serdar Yuksel, Sekhar Tatikonda

Abstract: The error exponent of Markov channels with feedback is studied in the variable-length block-coding setting. Burnashev's classic result is extended and a single letter characterization for the reliability function of finite-state Markov channels is presented, under the assumption that the channel state is causally observed both at the transmitter and at the receiver side. Tools from stochastic co… ▽ More The error exponent of Markov channels with feedback is studied in the variable-length block-coding setting. Burnashev's classic result is extended and a single letter characterization for the reliability function of finite-state Markov channels is presented, under the assumption that the channel state is causally observed both at the transmitter and at the receiver side. Tools from stochastic control theory are used in order to treat channels with intersymbol interference. In particular the convex analytical approach to Markov decision processes is adopted to handle problems with stop** time horizons arising from variable-length coding schemes. △ Less

Submitted 13 July, 2007; originally announced July 2007.

arXiv:cs/0701099 [pdf, ps, other]

On the Feedback Capacity of Power Constrained Gaussian Noise Channels with Memory

Authors: Shaohua Yang, Aleksandar Kavcic, Sekhar Tatikonda

Abstract: For a stationary additive Gaussian-noise channel with a rational noise power spectrum of a finite-order $L$, we derive two new results for the feedback capacity under an average channel input power constraint. First, we show that a very simple feedback-dependent Gauss-Markov source achieves the feedback capacity, and that Kalman-Bucy filtering is optimal for processing the feedback. Based on the… ▽ More For a stationary additive Gaussian-noise channel with a rational noise power spectrum of a finite-order $L$, we derive two new results for the feedback capacity under an average channel input power constraint. First, we show that a very simple feedback-dependent Gauss-Markov source achieves the feedback capacity, and that Kalman-Bucy filtering is optimal for processing the feedback. Based on these results, we develop a new method for optimizing the channel inputs for achieving the Cover-Pombra block-length-$n$ feedback capacity by using a dynamic programming approach that decomposes the computation into $n$ sequentially identical optimization problems where each stage involves optimizing $O(L^2)$ variables. Second, we derive the explicit maximal information rate for stationary feedback-dependent sources. In general, evaluating the maximal information rate for stationary sources requires solving only a few equations by simple non-linear programming. For first-order autoregressive and/or moving average (ARMA) noise channels, this optimization admits a closed form maximal information rate formula. The maximal information rate for stationary sources is a lower bound on the feedback capacity, and it equals the feedback capacity if the long-standing conjecture, that stationary sources achieve the feedback capacity, holds. △ Less

Submitted 16 January, 2007; originally announced January 2007.

Comments: Transaction on Information Theory, accepted version, first version submitted on Oct 22, 2003

arXiv:cs/0609139 [pdf, ps, other]

The Capacity of Channels with Feedback

Authors: Sekhar Tatikonda, Sanjoy Mitter

Abstract: We introduce a general framework for treating channels with memory and feedback. First, we generalize Massey's concept of directed information and use it to characterize the feedback capacity of general channels. Second, we present coding results for Markov channels. This requires determining appropriate sufficient statistics at the encoder and decoder. Third, a dynamic programming framework for… ▽ More We introduce a general framework for treating channels with memory and feedback. First, we generalize Massey's concept of directed information and use it to characterize the feedback capacity of general channels. Second, we present coding results for Markov channels. This requires determining appropriate sufficient statistics at the encoder and decoder. Third, a dynamic programming framework for computing the capacity of Markov channels is presented. Fourth, it is shown that the average cost optimality equation (ACOE) can be viewed as an implicit single-letter characterization of the capacity. Fifth, scenarios with simple sufficient statistics are described. △ Less

Submitted 25 September, 2006; originally announced September 2006.

Showing 1–29 of 29 results for author: Tatikonda, S