-
Mutual Information, Relative Entropy and Estimation Error in Semi-martingale Channels
Authors:
Jiantao Jiao,
Kartik Venkat,
Tsachy Weissman
Abstract:
Fundamental relations between information and estimation have been established in the literature for the continuous-time Gaussian and Poisson channels, in a long line of work starting from the classical representation theorems by Duncan and Kabanov respectively. In this work, we demonstrate that such relations hold for a much larger family of continuous-time channels. We introduce the family of se…
▽ More
Fundamental relations between information and estimation have been established in the literature for the continuous-time Gaussian and Poisson channels, in a long line of work starting from the classical representation theorems by Duncan and Kabanov respectively. In this work, we demonstrate that such relations hold for a much larger family of continuous-time channels. We introduce the family of semi-martingale channels where the channel output is a semi-martingale stochastic process, and the channel input modulates the characteristics of the semi-martingale. For these channels, which includes as a special case the continuous time Gaussian and Poisson models, we establish new representations relating the mutual information between the channel input and output to an optimal causal filtering loss, thereby unifying and considerably extending results from the Gaussian and Poisson settings. Extensions to the setting of mismatched estimation are also presented where the relative entropy between the laws governing the output of the channel under two different input distributions is equal to the cumulative difference between the estimation loss incurred by using the mismatched and optimal causal filters respectively. The main tool underlying these results is the Doob--Meyer decomposition of a class of likelihood ratio sub-martingales. The results in this work can be viewed as the continuous-time analogues of recent generalizations for relations between information and estimation for discrete-time Lévy channels.
△ Less
Submitted 18 April, 2017;
originally announced April 2017.
-
Beyond Maximum Likelihood: from Theory to Practice
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
Maximum likelihood is the most widely used statistical estimation technique. Recent work by the authors introduced a general methodology for the construction of estimators for functionals in parametric models, and demonstrated improvements - both in theory and in practice - over the maximum likelihood estimator (MLE), particularly in high dimensional scenarios involving parameter dimension compara…
▽ More
Maximum likelihood is the most widely used statistical estimation technique. Recent work by the authors introduced a general methodology for the construction of estimators for functionals in parametric models, and demonstrated improvements - both in theory and in practice - over the maximum likelihood estimator (MLE), particularly in high dimensional scenarios involving parameter dimension comparable to or larger than the number of samples. This approach to estimation, building on results from approximation theory, is shown to yield minimax rate-optimal estimators for a wide class of functionals, implementable with modest computational requirements. In a nutshell, a message of this recent work is that, for a wide class of functionals, the performance of these essentially optimal estimators with $n$ samples is comparable to that of the MLE with $n \ln n$ samples.
In the present paper, we highlight the applicability of the aforementioned methodology to statistical problems beyond functional estimation, and show that it can yield substantial gains. For example, we demonstrate that for learning tree-structured graphical models, our approach achieves a significant reduction of the required data size compared with the classical Chow--Liu algorithm, which is an implementation of the MLE, to achieve the same accuracy. The key step in improving the Chow--Liu algorithm is to replace the empirical mutual information with the estimator for mutual information proposed by the authors. Further, applying the same replacement approach to classical Bayesian network classification, the resulting classifiers uniformly outperform the previous classifiers on 26 widely used datasets.
△ Less
Submitted 25 September, 2014;
originally announced September 2014.
-
Maximum Likelihood Estimation of Functionals of Discrete Distributions
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
We consider the problem of estimating functionals of discrete distributions, and focus on tight nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their ex…
▽ More
We consider the problem of estimating functionals of discrete distributions, and focus on tight nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their expectations from the true functional, namely their \emph{bias}.
We characterize the worst case squared error risk incurred by the Maximum Likelihood Estimator (MLE) in estimating the Shannon entropy $H(P) = \sum_{i = 1}^S -p_i \ln p_i$, and $F_α(P) = \sum_{i = 1}^S p_i^α,α>0$, up to multiplicative constants, for any alphabet size $S\leq \infty$ and sample size $n$ for which the risk may vanish. As a corollary, for Shannon entropy estimation, we show that it is necessary and sufficient to have $n \gg S$ observations for the MLE to be consistent. In addition, we establish that it is necessary and sufficient to consider $n \gg S^{1/α}$ samples for the MLE to consistently estimate $F_α(P), 0<α<1$. The minimax rate-optimal estimators for both problems require $S/\ln S$ and $S^{1/α}/\ln S$ samples, which implies that the MLE has a strictly sub-optimal sample complexity. When $1<α<3/2$, we show that the worst-case squared error rate of convergence for the MLE is $n^{-2(α-1)}$ for infinite alphabet size, while the minimax squared error rate is $(n\ln n)^{-2(α-1)}$. When $α\geq 3/2$, the MLE achieves the minimax optimal rate $n^{-1}$ regardless of the alphabet size.
As an application of the general theory, we analyze the Dirichlet prior smoothing techniques for Shannon entropy estimation. We show that no matter how we tune the parameters in the Dirichlet prior, this technique cannot achieve the minimax rates in entropy estimation.
△ Less
Submitted 9 August, 2017; v1 submitted 26 June, 2014;
originally announced June 2014.
-
Minimax Estimation of Functionals of Discrete Distributions
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
We propose a general methodology for the construction and analysis of minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the alphabet size $S$ is unknown and may be comparable with the number of observations $n$. We treat the respective regions where the functional is "nonsmooth" and "smooth" separately. I…
▽ More
We propose a general methodology for the construction and analysis of minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the alphabet size $S$ is unknown and may be comparable with the number of observations $n$. We treat the respective regions where the functional is "nonsmooth" and "smooth" separately. In the "nonsmooth" regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the "smooth" regime, we apply a bias-corrected Maximum Likelihood Estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing two important cases: the entropy $H(P) = \sum_{i = 1}^S -p_i \ln p_i$ and $F_α(P) = \sum_{i = 1}^S p_i^α,α>0$. We obtain the minimax $L_2$ rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity $n \asymp S/\ln S$ for entropy estimation. We also show that the sample complexity for estimating $F_α(P),0<α<1$ is $n\asymp S^{1/α}/ \ln S$, which can be achieved by our estimator but not the MLE. For $1<α<3/2$, we show the minimax $L_2$ rate for estimating $F_α(P)$ is $(n\ln n)^{-2(α-1)}$ regardless of the alphabet size, while the $L_2$ rate for the MLE is $n^{-2(α-1)}$. For all the above cases, the behavior of the minimax rate-optimal estimators with $n$ samples is essentially that of the MLE with $n\ln n$ samples. We highlight the practical advantages of our schemes for entropy and mutual information estimation. We demonstrate that our approach reduces running time and boosts the accuracy compared to existing various approaches. Moreover, we show that the mutual information estimator induced by our methodology leads to significant performance boosts over the Chow--Liu algorithm in learning graphical models.
△ Less
Submitted 10 March, 2015; v1 submitted 26 June, 2014;
originally announced June 2014.
-
Relations between Information and Estimation in Discrete-Time Lévy Channels
Authors:
Jiantao Jiao,
Kartik Venkat,
Tsachy Weissman
Abstract:
Fundamental relations between information and estimation have been established in the literature for the discrete-time Gaussian and Poisson channels. In this work, we demonstrate that such relations hold for a much larger class of observation models. We introduce the natural family of discrete-time Lévy channels where the distribution of the output conditioned on the input is infinitely divisible.…
▽ More
Fundamental relations between information and estimation have been established in the literature for the discrete-time Gaussian and Poisson channels. In this work, we demonstrate that such relations hold for a much larger class of observation models. We introduce the natural family of discrete-time Lévy channels where the distribution of the output conditioned on the input is infinitely divisible. For Lévy channels, we establish new representations relating the mutual information between the channel input and output to an optimal expected estimation loss, thereby unifying and considerably extending results from the Gaussian and Poisson settings. We demonstrate the richness of our results by working out two examples of Lévy channels, namely the gamma channel and the negative binomial channel, with corresponding relations between information and estimation. Extensions to the setting of mismatched estimation are also presented.
△ Less
Submitted 1 February, 2017; v1 submitted 27 April, 2014;
originally announced April 2014.
-
Information Measures: the Curious Case of the Binary Alphabet
Authors:
Jiantao Jiao,
Thomas Courtade,
Albert No,
Kartik Venkat,
Tsachy Weissman
Abstract:
Four problems related to information divergence measures defined on finite alphabets are considered. In three of the cases we consider, we illustrate a contrast which arises between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the larger-alphabet settings do not generalize their binary-alphabet counterparts. Specifically, we sh…
▽ More
Four problems related to information divergence measures defined on finite alphabets are considered. In three of the cases we consider, we illustrate a contrast which arises between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the larger-alphabet settings do not generalize their binary-alphabet counterparts. Specifically, we show that $f$-divergences are not the unique decomposable divergences on binary alphabets that satisfy the data processing inequality, thereby clarifying claims that have previously appeared in the literature. We also show that KL divergence is the unique Bregman divergence which is also an $f$-divergence for any alphabet size. We show that KL divergence is the unique Bregman divergence which is invariant to statistically sufficient transformations of the data, even when non-decomposable divergences are considered. Like some of the problems we consider, this result holds only when the alphabet size is at least three.
△ Less
Submitted 28 November, 2014; v1 submitted 27 April, 2014;
originally announced April 2014.
-
Justification of Logarithmic Loss via the Benefit of Side Information
Authors:
Jiantao Jiao,
Thomas Courtade,
Kartik Venkat,
Tsachy Weissman
Abstract:
We consider a natural measure of relevance: the reduction in optimal prediction risk in the presence of side information. For any given loss function, this relevance measure captures the benefit of side information for performing inference on a random variable under this loss function. When such a measure satisfies a natural data processing property, and the random variable of interest has alphabe…
▽ More
We consider a natural measure of relevance: the reduction in optimal prediction risk in the presence of side information. For any given loss function, this relevance measure captures the benefit of side information for performing inference on a random variable under this loss function. When such a measure satisfies a natural data processing property, and the random variable of interest has alphabet size greater than two, we show that it is uniquely characterized by the mutual information, and the corresponding loss function coincides with logarithmic loss. In doing so, our work provides a new characterization of mutual information, and justifies its use as a measure of relevance. When the alphabet is binary, we characterize the only admissible forms the measure of relevance can assume while obeying the specified data processing property. Our results naturally extend to measuring causal influence between stochastic processes, where we unify different causal-inference measures in the literature as instantiations of directed information.
△ Less
Submitted 22 December, 2015; v1 submitted 18 March, 2014;
originally announced March 2014.
-
Information, Estimation, and Lookahead in the Gaussian channel
Authors:
Kartik Venkat,
Tsachy Weissman,
Yair Carmon,
Shlomo Shamai
Abstract:
We consider mean squared estimation with lookahead of a continuous-time signal corrupted by additive white Gaussian noise. We show that the mutual information rate function, i.e., the mutual information rate as function of the signal-to-noise ratio (SNR), does not, in general, determine the minimum mean squared error (MMSE) with fixed finite lookahead, in contrast to the special cases with 0 and i…
▽ More
We consider mean squared estimation with lookahead of a continuous-time signal corrupted by additive white Gaussian noise. We show that the mutual information rate function, i.e., the mutual information rate as function of the signal-to-noise ratio (SNR), does not, in general, determine the minimum mean squared error (MMSE) with fixed finite lookahead, in contrast to the special cases with 0 and infinite lookahead (filtering and smoothing errors), respectively, which were previously established in the literature. We also establish a new expectation identity under a generalized observation model where the Gaussian channel has an SNR jump at $t=0$, capturing the tradeoff between lookahead and SNR.
Further, we study the class of continuous-time stationary Gauss-Markov processes (Ornstein-Uhlenbeck processes) as channel inputs, and explicitly characterize the behavior of the minimum mean squared error (MMSE) with finite lookahead and signal-to-noise ratio (SNR). The MMSE with lookahead is shown to converge exponentially rapidly to the non-causal error, with the exponent being the reciprocal of the non-causal error. We extend our results to mixtures of Ornstein-Uhlenbeck processes, and use the insight gained to present lower and upper bounds on the MMSE with lookahead for a class of stationary Gaussian input processes, whose spectrum can be expressed as a mixture of Ornstein-Uhlenbeck spectra.
△ Less
Submitted 8 February, 2013;
originally announced February 2013.
-
Reference Based Genome Compression
Authors:
Bobbie Chern,
Idoia Ochoa,
Alexandros Manolakos,
Albert No,
Kartik Venkat,
Tsachy Weissman
Abstract:
DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known refere…
▽ More
DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a map** from the reference to the target genome, and then compresses this map** with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.
△ Less
Submitted 9 April, 2012;
originally announced April 2012.
-
Pointwise Relations between Information and Estimation in Gaussian Noise
Authors:
Kartik Venkat,
Tsachy Weissman
Abstract:
Many of the classical and recent relations between information and estimation in the presence of Gaussian noise can be viewed as identities between expectations of random quantities. These include the I-MMSE relationship of Guo et al.; the relative entropy and mismatched estimation relationship of Verdú; the relationship between causal estimation and mutual information of Duncan, and its extension…
▽ More
Many of the classical and recent relations between information and estimation in the presence of Gaussian noise can be viewed as identities between expectations of random quantities. These include the I-MMSE relationship of Guo et al.; the relative entropy and mismatched estimation relationship of Verdú; the relationship between causal estimation and mutual information of Duncan, and its extension to the presence of feedback by Kadota et al.; the relationship between causal and non-casual estimation of Guo et al., and its mismatched version of Weissman. We dispense with the expectations and explore the nature of the pointwise relations between the respective random quantities. The pointwise relations that we find are as succinctly stated as - and give considerable insight into - the original expectation identities.
As an illustration of our results, consider Duncan's 1970 discovery that the mutual information is equal to the causal MMSE in the AWGN channel, which can equivalently be expressed saying that the difference between the input-output information density and half the causal estimation error is a zero mean random variable (regardless of the distribution of the channel input). We characterize this random variable explicitly, rather than merely its expectation. Classical estimation and information theoretic quantities emerge with new and surprising roles. For example, the variance of this random variable turns out to be given by the causal MMSE (which, in turn, is equal to the mutual information by Duncan's result).
△ Less
Submitted 30 April, 2012; v1 submitted 30 October, 2011;
originally announced October 2011.