-
Finite-sample expansions for the optimal error probability in asymmetric binary hypothesis testing
Authors:
Valentinian Lungu,
Ioannis Kontoyiannis
Abstract:
The problem of binary hypothesis testing between two probability measures is considered. New sharp bounds are derived for the best achievable error probability of such tests based on independent and identically distributed observations. Specifically, the asymmetric version of the problem is examined, where different requirements are placed on the two error probabilities. Accurate nonasymptotic exp…
▽ More
The problem of binary hypothesis testing between two probability measures is considered. New sharp bounds are derived for the best achievable error probability of such tests based on independent and identically distributed observations. Specifically, the asymmetric version of the problem is examined, where different requirements are placed on the two error probabilities. Accurate nonasymptotic expansions with explicit constants are obtained for the error probability, using tools from large deviations and Gaussian approximation. Examples are shown indicating that, in the asymmetric regime, the approximations suggested by the new bounds are significantly more accurate than the approximations provided by either of the two main earlier approaches -- normal approximation and error exponents.
△ Less
Submitted 29 May, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Relative entropy bounds for sampling with and without replacement
Authors:
Oliver Johnson,
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
Sharp, nonasymptotic bounds are obtained for the relative entropy between the distributions of sampling with and without replacement from an urn with balls of $c\geq 2$ colors. Our bounds are asymptotically tight in certain regimes and, unlike previous results, they depend on the number of balls of each colour in the urn. The connection of these results with finite de Finetti-style theorems is exp…
▽ More
Sharp, nonasymptotic bounds are obtained for the relative entropy between the distributions of sampling with and without replacement from an urn with balls of $c\geq 2$ colors. Our bounds are asymptotically tight in certain regimes and, unlike previous results, they depend on the number of balls of each colour in the urn. The connection of these results with finite de Finetti-style theorems is explored, and it is observed that a sampling bound due to Stam (1978) combined with the convexity of relative entropy yield a new finite de Finetti bound in relative entropy, which achieves the optimal asymptotic convergence rate.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
The entropic doubling constant and robustness of Gaussian codebooks for additive-noise channels
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Entropy comparison inequalities are obtained for the differential entropy $h(X+Y)$ of the sum of two independent random vectors $X,Y$, when one is replaced by a Gaussian. For identically distributed random vectors $X,Y$, these are closely related to bounds on the entropic doubling constant, which quantifies the entropy increase when adding an independent copy of a random vector to itself. Conseque…
▽ More
Entropy comparison inequalities are obtained for the differential entropy $h(X+Y)$ of the sum of two independent random vectors $X,Y$, when one is replaced by a Gaussian. For identically distributed random vectors $X,Y$, these are closely related to bounds on the entropic doubling constant, which quantifies the entropy increase when adding an independent copy of a random vector to itself. Consequences of both large and small doubling are explored. For the former, lower bounds are deduced on the entropy increase when adding an independent Gaussian, while for the latter, a qualitative stability result for the entropy power inequality is obtained. In the more general case of non-identically distributed random vectors $X,Y$, a Gaussian comparison inequality with interesting implications for channel coding is established: For additive-noise channels with a power constraint, Gaussian codebooks come within a $\frac{\sf snr}{3{\sf snr}+2}$ factor of capacity. In the low-SNR regime this improves the half-a-bit additive bound of Zamir and Erez (2004). Analogous results are obtained for additive-noise multiple access channels, and for linear, additive-noise MIMO channels.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains
Authors:
A. Theocharous,
G. G. Gregoriou,
P. Sapountzis,
I. Kontoyiannis
Abstract:
We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, correspon…
▽ More
We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, corresponding to the absence of temporally causal influence, is equivalent to the underlying `causal conditional directed information rate' being equal to zero. The plug-in estimator for this functional is identified with the log-likelihood ratio test statistic for the desired test. This statistic is shown to be asymptotically normal under the alternative hypothesis and asymptotically $χ^2$ distributed under the null, facilitating the computation of $p$-values when used on empirical data. The effectiveness of the resulting hypothesis test is illustrated on simulated data, validating the underlying theory. The test is also employed in the analysis of spike train data recorded from neurons in the V4 and FEF brain regions of behaving animals during a visual attention task. There, the test results are seen to identify interesting and biologically relevant information.
△ Less
Submitted 17 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Generalised shot noise representations of stochastic systems driven by non-Gaussian Lévy processes
Authors:
Marcos Tapia Costa,
Ioannis Kontoyiannis,
Simon Godsill
Abstract:
We consider the problem of obtaining effective representations for the solutions of linear, vector-valued stochastic differential equations (SDEs) driven by non-Gaussian pure-jump Lévy processes, and we show how such representations lead to efficient simulation methods. The processes considered constitute a broad class of models that find application across the physical and biological sciences, ma…
▽ More
We consider the problem of obtaining effective representations for the solutions of linear, vector-valued stochastic differential equations (SDEs) driven by non-Gaussian pure-jump Lévy processes, and we show how such representations lead to efficient simulation methods. The processes considered constitute a broad class of models that find application across the physical and biological sciences, mathematics, finance and engineering. Motivated by important relevant problems in statistical inference, we derive new, generalised shot-noise simulation methods whenever a normal variance-mean (NVM) mixture representation exists for the driving Lévy process, including the generalised hyperbolic, normal-Gamma, and normal tempered stable cases. Simple, explicit conditions are identified for the convergence of the residual of a truncated shot-noise representation to a Brownian motion in the case of the pure Lévy process, and to a Brownian-driven SDE in the case of the Lévy-driven SDE. These results provide Gaussian approximations to the small jumps of the process under the NVM representation. The resulting representations are of particular importance in state inference and parameter estimation for Lévy-driven SDE models, since the resulting conditionally Gaussian structures can be readily incorporated into latent variable inference methods such as Markov chain Monte Carlo (MCMC), Expectation-Maximisation (EM), and sequential Monte Carlo.
△ Less
Submitted 7 November, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
A Third Information-Theoretic Approach to Finite de Finetti Theorems
Authors:
Mario Berta,
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A new finite form of de Finetti's representation theorem is established using elementary information-theoretic tools. The distribution of the first $k$ random variables in an exchangeable vector of $n\geq k$ random variables is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided. This bound is tighter than those obta…
▽ More
A new finite form of de Finetti's representation theorem is established using elementary information-theoretic tools. The distribution of the first $k$ random variables in an exchangeable vector of $n\geq k$ random variables is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided. This bound is tighter than those obtained via earlier information-theoretic proofs, and its utility extends to random variables taking values in general spaces. The core argument employed has its origins in the quantum information-theoretic literature.
△ Less
Submitted 25 April, 2024; v1 submitted 11 April, 2023;
originally announced April 2023.
-
Context-tree weighting and Bayesian Context Trees: Asymptotic and non-asymptotic justifications
Authors:
Ioannis Kontoyiannis
Abstract:
The Bayesian Context Trees (BCT) framework is a recently introduced, general collection of statistical and algorithmic tools for modelling, analysis and inference with discrete-valued time series. The foundation of this development is built in part on some well-known information-theoretic ideas and techniques, including Rissanen's tree sources and Willems et al.'s context-tree weighting algorithm.…
▽ More
The Bayesian Context Trees (BCT) framework is a recently introduced, general collection of statistical and algorithmic tools for modelling, analysis and inference with discrete-valued time series. The foundation of this development is built in part on some well-known information-theoretic ideas and techniques, including Rissanen's tree sources and Willems et al.'s context-tree weighting algorithm. This paper presents a collection of theoretical results that provide mathematical justifications and further insight into the BCT modelling framework and the associated practical tools. It is shown that the BCT prior predictive likelihood (the probability of a time series of observations averaged over all models and parameters) is both pointwise and minimax optimal, in agreement with the MDL principle and the BIC criterion. The posterior distribution is shown to be asymptotically consistent with probability one (over both models and parameters), and asymptotically Gaussian (over the parameters). And the posterior predictive distribution is also shown to be asymptotically consistent with probability one.
△ Less
Submitted 5 September, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Information in probability: Another information-theoretic proof of a finite de Finetti theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We d…
▽ More
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We derive an upper bound on the relative entropy between the distribution of the first $k$ in a sequence of $n$ exchangeable random variables, and an appropriate mixture over product distributions. The mixing measure is characterised as the law of the empirical measure of the original sequence, and de Finetti's result is recovered as a corollary. The proof is nicely motivated by the Gibbs conditioning principle in connection with statistical mechanics, and it follows along an appealing sequence of steps. The technical estimates required for these steps are obtained via the use of a collection of combinatorial tools known within information theory as `the method of types.'
△ Less
Submitted 26 April, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Posterior Representations for Bayesian Context Trees: Sampling, Estimation and Convergence
Authors:
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
We revisit the Bayesian Context Trees (BCT) modelling framework for discrete time series, which was recently found to be very effective in numerous tasks including model selection, estimation and prediction. A novel representation of the induced posterior distribution on model space is derived in terms of a simple branching process, and several consequences of this are explored in theory and in pr…
▽ More
We revisit the Bayesian Context Trees (BCT) modelling framework for discrete time series, which was recently found to be very effective in numerous tasks including model selection, estimation and prediction. A novel representation of the induced posterior distribution on model space is derived in terms of a simple branching process, and several consequences of this are explored in theory and in practice. First, it is shown that the branching process representation leads to a simple variable-dimensional Monte Carlo sampler for the joint posterior distribution on models and parameters, which can efficiently produce independent samples. This sampler is found to be more efficient than earlier MCMC samplers for the same tasks. Then, the branching process representation is used to establish the asymptotic consistency of the BCT posterior, including the derivation of an almost-sure convergence rate. Finally, an extensive study is carried out on the performance of the induced Bayesian entropy estimator. Its utility is illustrated through both simulation experiments and real-world applications, where it is found to outperform several state-of-the-art methods.
△ Less
Submitted 20 March, 2023; v1 submitted 4 February, 2022;
originally announced February 2022.
-
The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning
Authors:
Vivek Borkar,
Shuhang Chen,
Adithya Devraj,
Ioannis Kontoyiannis,
Sean Meyn
Abstract:
The paper concerns the stochastic approximation recursion, \[ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1})
\,,\quad n\ge 0, \] where the {\em estimates} $θ_n\in\Re^d$ and $ \{ Φ_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow}…
▽ More
The paper concerns the stochastic approximation recursion, \[ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1})
\,,\quad n\ge 0, \] where the {\em estimates} $θ_n\in\Re^d$ and $ \{ Φ_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow} $ \tfrac{d}{dt} \vartheta_t = \bar{f}(\vartheta_t)$, is globally asymptotically stable with stationary point denoted $θ^*$, where $\bar{f}(θ)=\text{ E}[f(θ,Φ)]$ with $Φ$ having the stationary distribution of the chain. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3) for the chain:
(i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$.
(ii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\text{ E} [ z_n z_n^T ]$ to the asymptotic covariance $Σ^Θ$ in the CLT, where $z_n= (θ_n-θ^*)/\sqrt{α_n}$.
(iii) The CLT holds for the normalized version $z^{\text{ PR}}_n$ of the averaged parameters $θ^{\text{ PR}}_n$, subject to standard assumptions on the step-size. Moreover, the normalized covariance of both $θ^{\text{ PR}}_n$ and $z^{\text{ PR}}_n$ converge to $Σ^{\text{ PR}}$, the minimal covariance of Polyak and Ruppert.
(iv)} An example is given where $f$ and $\bar{f}$ are linear in $θ$, and the Markov chain is geometrically ergodic but does not satisfy (DV3). While the algorithm is convergent, the second moment of $θ_n$ is unbounded and in fact diverges.
△ Less
Submitted 21 February, 2024; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Entropy and the Discrete Central Limit Theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A strengthened version of the central limit theorem for discrete random variables is established, relying only on information-theoretic tools and elementary arguments. It is shown that the relative entropy between the standardised sum of $n$ independent and identically distributed lattice random variables and an appropriately discretised Gaussian, vanishes as $n\to\infty$.
A strengthened version of the central limit theorem for discrete random variables is established, relying only on information-theoretic tools and elementary arguments. It is shown that the relative entropy between the standardised sum of $n$ independent and identically distributed lattice random variables and an appropriately discretised Gaussian, vanishes as $n\to\infty$.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Population-scale testing can suppress the spread of infectious disease
Authors:
Jussi Taipale,
Ioannis Kontoyiannis,
Sten Linnarsson
Abstract:
Major advances in public health have resulted from disease prevention. However, prevention of a new infectious disease by vaccination or pharmaceuticals is made difficult by the slow process of vaccine and drug development. We propose an additional intervention that allows rapid control of emerging infectious diseases, and can also be used to eradicate diseases that rely almost exclusively on huma…
▽ More
Major advances in public health have resulted from disease prevention. However, prevention of a new infectious disease by vaccination or pharmaceuticals is made difficult by the slow process of vaccine and drug development. We propose an additional intervention that allows rapid control of emerging infectious diseases, and can also be used to eradicate diseases that rely almost exclusively on human-to-human transmission. The intervention is based on (1) testing every individual for the disease, (2) repeatedly, and (3) isolation of infected individuals. We show here that at a sufficient rate of testing, the reproduction number is reduced below 1.0 and the epidemic will rapidly collapse. The approach does not rely on strong or unrealistic assumptions about test accuracy, isolation compliance, population structure or epidemiological parameters, and its success can be monitored in real time by following the test positivity rate. In addition to the compliance rate and false negatives, the required rate of testing depends on the design of the testing regime, with concurrent testing outperforming random sampling. Provided that results are obtained rapidly, the test frequency required to suppress an epidemic is monotonic and near-linear with respect to R0, the infectious period, and the fraction of susceptible individuals. The testing regime is effective against both early phase and established epidemics, and additive to other interventions (e.g. contact tracing and social distancing). It is also robust to failure: any rate of testing reduces the number of infections, improving both public health and economic conditions. These conclusions are based on rigorous analysis and simulations of appropriate epidemiological models. A mass-produced, disposable test that could be used at home would be ideal, due to the optimal performance of concurrent tests that return immediate results.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
An Information-Theoretic Proof of a Finite de Finetti Theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A finite form of de Finetti's representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.
A finite form of de Finetti's representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.
△ Less
Submitted 25 June, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Compression and Symmetry of Small-World Graphs and Structures
Authors:
Ioannis Kontoyiannis,
Yi Heng Lim,
Katia Papakonstantinopoulou,
Wojtek Szpankowski
Abstract:
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quanti…
▽ More
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quantifying the degree of dependence between them, requires the study of symmetry, graph automorphism, entropy, and graph compressibility. In this paper, we focus on a class of small-world graphs. These are geometric random graphs where vertices are first connected to their nearest neighbors on a circle and then pairs of non-neighbors are connected according to a distance-dependent probability distribution. We establish the degree distribution of this model, and use it to prove the model's asymmetry in an appropriate range of parameters. Then we derive the relevant entropy and structural entropy of these random graphs, in connection with graph compression.
△ Less
Submitted 22 November, 2021; v1 submitted 31 July, 2020;
originally announced July 2020.
-
Optimal rates for independence testing via $U$-statistic permutation tests
Authors:
Thomas B. Berrett,
Ioannis Kontoyiannis,
Richard J. Samworth
Abstract:
We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives…
▽ More
We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives of the form $\{f: D(f) \geq ρ^2 \}$. We therefore restrict attention to alternatives that impose additional Sobolev-type smoothness constraints, and define a permutation test based on a basis expansion and a $U$-statistic estimator of $D(f)$ that we prove is minimax optimal in terms of its separation rates in many instances. Finally, for the case of a Fourier basis on $[0,1]^2$, we provide an approximation to the power function that offers several additional insights. Our methodology is implemented in the R package USP.
△ Less
Submitted 6 November, 2020; v1 submitted 15 January, 2020;
originally announced January 2020.
-
The Lévy State Space Model
Authors:
Simon Godsill,
Marina Riabiz,
Ioannis Kontoyiannis
Abstract:
In this paper we introduce a new class of state space models based on shot-noise simulation representations of non-Gaussian Lévy-driven linear systems, represented as stochastic differential equations. In particular a conditionally Gaussian version of the models is proposed that is able to capture heavy-tailed non-Gaussianity while retaining tractability for inference procedures. We focus on a can…
▽ More
In this paper we introduce a new class of state space models based on shot-noise simulation representations of non-Gaussian Lévy-driven linear systems, represented as stochastic differential equations. In particular a conditionally Gaussian version of the models is proposed that is able to capture heavy-tailed non-Gaussianity while retaining tractability for inference procedures. We focus on a canonical class of such processes, the $α$-stable Lévy processes, which retain important properties such as self-similarity and heavy-tails, while emphasizing that broader classes of non-Gaussian Lévy processes may be handled by similar methodology. An important feature is that we are able to marginalise both the skewness and the scale parameters of these challenging models from posterior probability distributions. The models are posed in continuous time and so are able to deal with irregular data arrival times. Example modelling and inference procedures are provided using Rao-Blackwellised sequential Monte Carlo applied to a two-dimensional Langevin model, and this is tested on real exchange rate data.
△ Less
Submitted 8 January, 2020; v1 submitted 28 December, 2019;
originally announced December 2019.
-
Differential Temporal Difference Learning
Authors:
Adithya M. Devraj,
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (…
▽ More
Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods.
△ Less
Submitted 27 February, 2020; v1 submitted 28 December, 2018;
originally announced December 2018.
-
A Simple Network of Nodes Moving on the Circle
Authors:
Dimitris Cheliotis,
Ioannis Kontoyiannis,
Michail Loulakis,
Stavros Toumpis
Abstract:
Two simple Markov processes are examined, one in discrete and one in continuous time, arising from idealized versions of a transmission protocol for mobile, delay-tolerant networks. We consider two independent walkers moving with constant speed on either the discrete or continuous circle, and changing directions at independent geometric (respectively, exponential) times. One of the walkers carries…
▽ More
Two simple Markov processes are examined, one in discrete and one in continuous time, arising from idealized versions of a transmission protocol for mobile, delay-tolerant networks. We consider two independent walkers moving with constant speed on either the discrete or continuous circle, and changing directions at independent geometric (respectively, exponential) times. One of the walkers carries a message that wishes to travel as far and as fast as possible in the clockwise direction. The message stays with its current carrier unless the two walkers meet, the carrier is moving counter-clockwise, and the other walker is moving clockwise. In that case, the message jumps to the other walker. The long-term average clockwise speed of the message is computed. An explicit expression is derived via the solution of an associated boundary value problem in terms of the generator of the underlying Markov process. The average transmission cost is also similarly computed, measured as the long-term number of jumps the message makes per unit time. The tradeoff between speed and cost is examined, as a function of the underlying problem parameters.
△ Less
Submitted 4 March, 2020; v1 submitted 11 August, 2018;
originally announced August 2018.
-
Nonasymptotic Gaussian Approximation for Inference with Stable Noise
Authors:
Marina Riabiz,
Tohid Ardeshiri,
Ioannis Kontoyiannis,
Simon Godsill
Abstract:
The results of a series of theoretical studies are reported, examining the convergence rate for different approximate representations of $α$-stable distributions. Although they play a key role in modelling random processes with jumps and discontinuities, the use of $α$-stable distributions in inference often leads to analytically intractable problems. The LePage series, which is a probabilistic re…
▽ More
The results of a series of theoretical studies are reported, examining the convergence rate for different approximate representations of $α$-stable distributions. Although they play a key role in modelling random processes with jumps and discontinuities, the use of $α$-stable distributions in inference often leads to analytically intractable problems. The LePage series, which is a probabilistic representation employed in this work, is used to transform an intractable, infinite-dimensional inference problem into a conditionally Gaussian parametric problem. A major component of our approach is the approximation of the tail of this series by a Gaussian random variable. Standard statistical techniques, such as Expectation-Maximization, Markov chain Monte Carlo, and Particle Filtering, can then be applied. In addition to the asymptotic normality of the tail of this series, we establish explicit, nonasymptotic bounds on the approximation error. Their proofs follow classical Fourier-analytic arguments, using Esséen's smoothing lemma. Specifically, we consider the distance between the distributions of: $(i)$~the tail of the series and an appropriate Gaussian; $(ii)$~the full series and the truncated series; and $(iii)$~the full series and the truncated series with an added Gaussian term. In all three cases, sharp bounds are established, and the theoretical results are compared with the actual distances (computed numerically) in specific examples of symmetric $α$-stable distributions. This analysis facilitates the selection of appropriate truncations in practice and offers theoretical guarantees for the accuracy of resulting estimates. One of the main conclusions obtained is that, for the purposes of inference, the use of a truncated series together with an approximately Gaussian error term has superior statistical properties and is likely a preferable choice in practice.
△ Less
Submitted 1 January, 2020; v1 submitted 27 February, 2018;
originally announced February 2018.
-
Geometric Ergodicity in a Weighted Sobolev Space
Authors:
Adithya Devraj,
Ioannis Kontoyiannis,
Sean Meyn
Abstract:
For a discrete-time Markov chain $\{X(t)\}$ evolving on $\Re^\ell$ with transition kernel $P$, natural, general conditions are developed under which the following are established:
1. The transition kernel $P$ has a purely discrete spectrum, when viewed as a linear operator on a weighted Sobolev space $L_\infty^{v,1}$ of functions with norm,…
▽ More
For a discrete-time Markov chain $\{X(t)\}$ evolving on $\Re^\ell$ with transition kernel $P$, natural, general conditions are developed under which the following are established:
1. The transition kernel $P$ has a purely discrete spectrum, when viewed as a linear operator on a weighted Sobolev space $L_\infty^{v,1}$ of functions with norm, $$ \|f\|_{v,1} = \sup_{x \in \Re^\ell} \frac{1}{v(x)} \max \{|f(x)|, |\partial_1 f(x)|,\ldots,|\partial_\ell f(x)|\}, $$ where $v\colon \Re^\ell \to [1,\infty)$ is a Lyapunov function and $\partial_i:=\partial/\partial x_i$.
2. The Markov chain is geometrically ergodic in $L_\infty^{v,1}$: There is a unique invariant probability measure $π$ and constants $B<\infty$ and $δ>0$ such that, for each $f\in L_\infty^{v,1}$, any initial condition $X(0)=x$, and all $t\geq 0$: $$\Big| \text{E}_x[f(X(t))] - π(f)\Big| \le Be^{-δt}v(x),\quad \|\nabla \text{E}_x[f(X(t))] \|_2 \le Be^{-δt} v(x), $$ where $π(f)=\int fdπ$.
3. For any function $f\in L_\infty^{v,1}$ there is a function $h\in L_\infty^{v,1}$ solving Poisson's equation: \[ h-Ph = f-π(f). \] Part of the analysis is based on an operator-theoretic treatment of the sensitivity process that appears in the theory of Lyapunov exponents.
△ Less
Submitted 18 July, 2019; v1 submitted 9 November, 2017;
originally announced November 2017.
-
Thinning and Information Projections
Authors:
Peter Harremoës,
Oliver Johnson,
Ioannis Kontoyiannis
Abstract:
In this paper we establish lower bounds on information divergence of a distribution on the integers from a Poisson distribution. These lower bounds are tight and in the cases where a rate of convergence in the Law of Thin Numbers can be computed the rate is determined by the lower bounds proved in this paper. General techniques for getting lower bounds in terms of moments are developed. The result…
▽ More
In this paper we establish lower bounds on information divergence of a distribution on the integers from a Poisson distribution. These lower bounds are tight and in the cases where a rate of convergence in the Law of Thin Numbers can be computed the rate is determined by the lower bounds proved in this paper. General techniques for getting lower bounds in terms of moments are developed. The results about lower bound in the Law of Thin Numbers are used to derive similar results for the Central Limit Theorem.
△ Less
Submitted 17 January, 2016;
originally announced January 2016.
-
On the $f$-Norm Ergodicity of Markov Processes in Continuous Time
Authors:
I. Kontoyiannis,
S. P. Meyn
Abstract:
Consider a Markov process $\{Φ(t) : t\geq 0\}$ evolving on a Polish space ${\sf X}$. A version of the $f$-Norm Ergodic Theorem is obtained: Suppose that the process is $ψ$-irreducible and aperiodic. For a given function $f\colon{\sf X}:\to[1,\infty)$, under suitable conditions on the process the following are equivalent: \begin{enumerate} \item[(i)] There is a unique invariant probability measure…
▽ More
Consider a Markov process $\{Φ(t) : t\geq 0\}$ evolving on a Polish space ${\sf X}$. A version of the $f$-Norm Ergodic Theorem is obtained: Suppose that the process is $ψ$-irreducible and aperiodic. For a given function $f\colon{\sf X}:\to[1,\infty)$, under suitable conditions on the process the following are equivalent: \begin{enumerate} \item[(i)] There is a unique invariant probability measure $π$ satisfying $\int f\,dπ<\infty$. \item[(ii)] There is a closed set $C$ satisfying $ψ(C)>0$ that is ``self $f$-regular.'' \item There is a function $V\colon{\sf X} \to (0,\infty]$ that is finite on at least one point in ${\sf X}$, for which the following Lyapunov drift condition is satisfied, \[ {\cal D} V\leq - f+b\field{I}_C\, , \eqno{\hbox{(V3)}} \] where $C$ is a closed small set and ${\cal D}$ is the extended generator of the process. \end{enumerate} For discrete-time chains the result is well-known. Moreover, in that case, the ergodicity of $\bfPhi$ under a suitable norm is also obtained: For each initial condition $x\in{\sf X}$ satisfying $V(x)<\infty$, and any function $g\colon{\sf X}\to\Re$ for which $|g|$ is bounded by $f$, \[ \lim_{t\to\infty} {\sf E}_x[g(Φ(t))] = \int g\,dπ. \] Possible approaches are explored for establishing appropriate versions of corresponding results in continuous time, under appropriate assumptions on the process $\{Φ(t)\}$ or on the function $g$.
△ Less
Submitted 1 December, 2015;
originally announced December 2015.
-
Entropy bounds on abelian groups and the Ruzsa divergence
Authors:
Mokshay Madiman,
Ioannis Kontoyiannis
Abstract:
Over the past few years, a family of interesting new inequalities for the entropies of sums and differences of random variables has been developed by Ruzsa, Tao and others, motivated by analogous results in additive combinatorics. The present work extends these earlier results to the case of random variables taking values in $\mathbb{R}^n$ or, more generally, in arbitrary locally compact and Polis…
▽ More
Over the past few years, a family of interesting new inequalities for the entropies of sums and differences of random variables has been developed by Ruzsa, Tao and others, motivated by analogous results in additive combinatorics. The present work extends these earlier results to the case of random variables taking values in $\mathbb{R}^n$ or, more generally, in arbitrary locally compact and Polish abelian groups. We isolate and study a key quantity, the Ruzsa divergence between two probability distributions, and we show that its properties can be used to extend the earlier inequalities to the present general setting. The new results established include several variations on the theme that the entropies of the sum and the difference of two independent random variables severely constrain each other. Although the setting is quite general, the result are already of interest (and new) for random vectors in $\mathbb{R}^n$. In that special case, quantitative bounds are provided for the stability of the equality conditions in the entropy power inequality; a reverse entropy power inequality for log-concave random vectors is proved; an information-theoretic analog of the Rogers-Shephard inequality for convex bodies is established; and it is observed that some of these results lead to new inequalities for the determinants of positive-definite matrices. Moreover, by considering the multiplicative subgroups of the complex plane, one obtains new inequalities for the differential entropies of products and ratios of nonzero, complex-valued random variables.
△ Less
Submitted 26 October, 2015; v1 submitted 17 August, 2015;
originally announced August 2015.
-
Estimating the Directed Information and Testing for Causality
Authors:
Ioannis Kontoyiannis,
Maria Skoularidou
Abstract:
The problem of estimating the directed information rate between two discrete processes $\{X_n\}$ and $\{Y_n\}$ via the plug-in (or maximum-likelihood) estimator is considered. When the joint process $\{(X_n,Y_n)\}$ is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate $O(1/\sqrt{n})$ under appropriate conditions…
▽ More
The problem of estimating the directed information rate between two discrete processes $\{X_n\}$ and $\{Y_n\}$ via the plug-in (or maximum-likelihood) estimator is considered. When the joint process $\{(X_n,Y_n)\}$ is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate $O(1/\sqrt{n})$ under appropriate conditions; this is the first estimator that has been shown to achieve this rate. An important connection is drawn between the problem of estimating the directed information rate and that of performing a hypothesis test for the presence of causal influence between the two processes. Under fairly general conditions, the null hypothesis, which corresponds to the absence of causal influence, is equivalent to the requirement that the directed information rate be equal to zero. In that case a finer result is established, showing that the plug-in converges at the faster rate $O(1/n)$ and that it is asymptotically $χ^2$-distributed. This is proved by showing that this estimator is equal to (a scalar multiple of) the classical likelihood ratio statistic for the above hypothesis test. Finally it is noted that these results facilitate the design of an actual likelihood ratio test for the presence or absence of causal influence.
△ Less
Submitted 31 March, 2016; v1 submitted 5 July, 2015;
originally announced July 2015.
-
Lossless Data Compression at Finite Blocklengths
Authors:
Ioannis Kontoyiannis,
Sergio Verdu
Abstract:
This paper provides an extensive study of the behavior of the best achievable rate (and other related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the distribu…
▽ More
This paper provides an extensive study of the behavior of the best achievable rate (and other related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the distribution of the optimal codelengths to the source information spectrum, and an exact analysis of the best achievable rate for arbitrary sources is given.
Fine asymptotic results are proved for arbitrary (not necessarily prefix) compressors on general mixing sources. Non-asymptotic, explicit Gaussian approximation bounds are established for the best achievable rate on Markov sources. The source dispersion and the source varentropy rate are defined and characterized. Together with the entropy rate, the varentropy rate serves to tightly approximate the fundamental non-asymptotic limits of fixed-to-variable compression for all but very small blocklengths.
△ Less
Submitted 11 December, 2012;
originally announced December 2012.
-
Sumset and Inverse Sumset Inequalities for Differential Entropy and Mutual Information
Authors:
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
The sumset and inverse sumset theories of Freiman, Plünnecke and Ruzsa, give bounds connecting the cardinality of the sumset $A+B=\{a+b\;;\;a\in A,\,b\in B\}$ of two discrete sets $A,B$, to the cardinalities (or the finer structure) of the original sets $A,B$. For example, the sum-difference bound of Ruzsa states that, $|A+B|\,|A|\,|B|\leq|A-B|^3$, where the difference set…
▽ More
The sumset and inverse sumset theories of Freiman, Plünnecke and Ruzsa, give bounds connecting the cardinality of the sumset $A+B=\{a+b\;;\;a\in A,\,b\in B\}$ of two discrete sets $A,B$, to the cardinalities (or the finer structure) of the original sets $A,B$. For example, the sum-difference bound of Ruzsa states that, $|A+B|\,|A|\,|B|\leq|A-B|^3$, where the difference set $A-B= \{a-b\;;\;a\in A,\,b\in B\}$. Interpreting the differential entropy $h(X)$ of a continuous random variable $X$ as (the logarithm of) the size of the effective support of $X$, the main contribution of this paper is a series of natural information-theoretic analogs for these results. For example, the Ruzsa sum-difference bound becomes the new inequality, $h(X+Y)+h(X)+h(Y)\leq 3h(X-Y)$, for any pair of independent continuous random variables $X$ and $Y$. Our results include differential-entropy versions of Ruzsa's triangle inequality, the Plünnecke-Ruzsa inequality, and the Balog-Szemerédi-Gowers lemma. Also we give a differential entropy version of the Freiman-Green-Ruzsa inverse-sumset theorem, which can be seen as a quantitative converse to the entropy power inequality. Versions of most of these results for the discrete entropy $H(X)$ were recently proved by Tao, relying heavily on a strong, functional form of the submodularity property of $H(X)$. Since differential entropy is {\em not} functionally submodular, in the continuous case many of the corresponding discrete proofs fail, in many cases requiring substantially new proof strategies. We find that the basic property that naturally replaces the discrete functional submodularity, is the data processing property of mutual information.
△ Less
Submitted 3 June, 2012;
originally announced June 2012.
-
Control Variates for Reversible MCMC Samplers
Authors:
Petros Dellaportas,
Ioannis Kontoyiannis
Abstract:
A general methodology is introduced for the construction and effective application of control variates to estimation problems involving data from reversible MCMC samplers. We propose the use of a specific class of functions as control variates, and we introduce a new, consistent estimator for the values of the coefficients of the optimal linear combination of these functions. The form and proposed…
▽ More
A general methodology is introduced for the construction and effective application of control variates to estimation problems involving data from reversible MCMC samplers. We propose the use of a specific class of functions as control variates, and we introduce a new, consistent estimator for the values of the coefficients of the optimal linear combination of these functions. The form and proposed construction of the control variates is derived from our solution of the Poisson equation associated with a specific MCMC scenario. The new estimator, which can be applied to the same MCMC sample, is derived from a novel, finite-dimensional, explicit representation for the optimal coefficients. The resulting variance-reduction methodology is primarily applicable when the simulated data are generated by a conjugate random-scan Gibbs sampler. MCMC examples of Bayesian inference problems demonstrate that the corresponding reduction in the estimation variance is significant, and that in some cases it can be quite dramatic. Extensions of this methodology in several directions are given, including certain families of Metropolis-Hastings samplers and hybrid Metropolis-within-Gibbs algorithms. Corresponding simulation examples are presented illustrating the utility of the proposed methods. All methodological and asymptotic arguments are rigorously justified under easily verifiable and essentially minimal conditions.
△ Less
Submitted 7 August, 2010;
originally announced August 2010.
-
Compound Poisson Approximation via Information Functionals
Authors:
A. D. Barbour,
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
An information-theoretic development is given for the problem of compound Poisson approximation, which parallels earlier treatments for Gaussian and Poisson approximation. Let $P_{S_n}$ be the distribution of a sum $S_n=\Sumn Y_i$ of independent integer-valued random variables $Y_i$. Nonasymptotic bounds are derived for the distance between $P_{S_n}$ and an appropriately chosen compound Poisson la…
▽ More
An information-theoretic development is given for the problem of compound Poisson approximation, which parallels earlier treatments for Gaussian and Poisson approximation. Let $P_{S_n}$ be the distribution of a sum $S_n=\Sumn Y_i$ of independent integer-valued random variables $Y_i$. Nonasymptotic bounds are derived for the distance between $P_{S_n}$ and an appropriately chosen compound Poisson law. In the case where all $Y_i$ have the same conditional distribution given $\{Y_i\neq 0\}$, a bound on the relative entropy distance between $P_{S_n}$ and the compound Poisson distribution is derived, based on the data-processing property of relative entropy and earlier Poisson approximation results. When the $Y_i$ have arbitrary distributions, corresponding bounds are derived in terms of the total variation distance. The main technical ingredient is the introduction of two "information functionals," and the analysis of their properties. These information functionals play a role analogous to that of the classical Fisher information in normal approximation. Detailed comparisons are made between the resulting inequalities and related bounds.
△ Less
Submitted 21 April, 2010;
originally announced April 2010.
-
Log-concavity, ultra-log-concavity, and a maximum entropy property of discrete compound Poisson measures
Authors:
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Sufficient conditions are developed, under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. Recently, one of the authors [O. Johnson, {\em Stoch. Proc. Appl.}, 2007] used a semigroup approach to show that the Poisson has maximal entropy among all ultra-log-concave distributions with fixed mean. We show via a non…
▽ More
Sufficient conditions are developed, under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. Recently, one of the authors [O. Johnson, {\em Stoch. Proc. Appl.}, 2007] used a semigroup approach to show that the Poisson has maximal entropy among all ultra-log-concave distributions with fixed mean. We show via a non-trivial extension of this semigroup approach that the natural analog of the Poisson maximum entropy property remains valid if the compound Poisson distributions under consideration are log-concave, but that it fails in general. A parallel maximum entropy result is established for the family of compound binomial measures. Sufficient conditions for compound distributions to be log-concave are discussed and applications to combinatorics are examined; new bounds are derived on the entropy of the cardinality of a random independent set in a claw-free graph, and a connection is drawn to Mason's conjecture for matroids. The present results are primarily motivated by the desire to provide an information-theoretic foundation for compound Poisson approximation and associated limit theorems, analogous to the corresponding developments for the central limit theorem and for Poisson approximation. Our results also demonstrate new links between some probabilistic methods and the combinatorial notions of log-concavity and ultra-log-concavity, and they add to the growing body of work exploring the applications of maximum entropy characterizations to problems in discrete mathematics.
△ Less
Submitted 27 September, 2011; v1 submitted 3 December, 2009;
originally announced December 2009.
-
Notes on Using Control Variates for Estimation with Reversible MCMC Samplers
Authors:
Ioannis Kontoyiannis,
Petros Dellaportas
Abstract:
A general methodology is presented for the construction and effective use of control variates for reversible MCMC samplers. The values of the coefficients of the optimal linear combination of the control variates are computed, and adaptive, consistent MCMC estimators are derived for these optimal coefficients. All methodological and asymptotic arguments are rigorously justified. Numerous MCMC simu…
▽ More
A general methodology is presented for the construction and effective use of control variates for reversible MCMC samplers. The values of the coefficients of the optimal linear combination of the control variates are computed, and adaptive, consistent MCMC estimators are derived for these optimal coefficients. All methodological and asymptotic arguments are rigorously justified. Numerous MCMC simulation examples from Bayesian inference applications demonstrate that the resulting variance reduction can be quite dramatic.
△ Less
Submitted 4 May, 2010; v1 submitted 24 July, 2009;
originally announced July 2009.
-
Geometric Ergodicity and the Spectral Gap of Non-Reversible Markov Chains
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
We argue that the spectral theory of non-reversible Markov chains may often be more effectively cast within the framework of the naturally associated weighted-$L_\infty$ space $L_\infty^V$, instead of the usual Hilbert space $L_2=L_2(π)$, where $π$ is the invariant measure of the chain. This observation is, in part, based on the following results. A discrete-time Markov chain with values in a ge…
▽ More
We argue that the spectral theory of non-reversible Markov chains may often be more effectively cast within the framework of the naturally associated weighted-$L_\infty$ space $L_\infty^V$, instead of the usual Hilbert space $L_2=L_2(π)$, where $π$ is the invariant measure of the chain. This observation is, in part, based on the following results. A discrete-time Markov chain with values in a general state space is geometrically ergodic if and only if its transition kernel admits a spectral gap in $L_\infty^V$. If the chain is reversible, the same equivalence holds with $L_2$ in place of $L_\infty^V$, but in the absence of reversibility it fails: There are (necessarily non-reversible, geometrically ergodic) chains that admit a spectral gap in $L_\infty^V$ but not in $L_2$. Moreover, if a chain admits a spectral gap in $L_2$, then for any $h\in L_2$ there exists a Lyapunov function $V_h\in L_1$ such that $V_h$ dominates $h$ and the chain admits a spectral gap in $L_\infty^{V_h}$. The relationship between the size of the spectral gap in $L_\infty^V$ or $L_2$, and the rate at which the chain converges to equilibrium is also briefly discussed.
△ Less
Submitted 29 June, 2009;
originally announced June 2009.
-
Thinning, Entropy and the Law of Thin Numbers
Authors:
Peter Harremoes,
Oliver Johnson,
Ioannis Kontoyiannis
Abstract:
Renyi's "thinning" operation on a discrete random variable is a natural discrete analog of the scaling operation for continuous random variables. The properties of thinning are investigated in an information-theoretic context, especially in connection with information-theoretic inequalities related to Poisson approximation results. The classical Binomial-to-Poisson convergence (sometimes referre…
▽ More
Renyi's "thinning" operation on a discrete random variable is a natural discrete analog of the scaling operation for continuous random variables. The properties of thinning are investigated in an information-theoretic context, especially in connection with information-theoretic inequalities related to Poisson approximation results. The classical Binomial-to-Poisson convergence (sometimes referred to as the "law of small numbers" is seen to be a special case of a thinning limit theorem for convolutions of discrete distributions. A rate of convergence is provided for this limit, and nonasymptotic bounds are also established. This development parallels, in part, the development of Gaussian inequalities leading to the information-theoretic version of the central limit theorem. In particular, a "thinning Markov chain" is introduced, and it is shown to play a role analogous to that of the Ornstein-Uhlenbeck process in connection to the entropy power inequality.
△ Less
Submitted 3 June, 2009;
originally announced June 2009.
-
Approximating a Diffusion by a Hidden Markov Model
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
For a wide class of continuous-time Markov processes, including all irreducible hypoelliptic diffusions evolving on an open, connected subset of $\RL^d$, the following are shown to be equivalent: (i) The process satisfies (a slightly weaker version of) the classical Donsker-Varadhan conditions; (ii) The transition semigroup of the process can be approximated by a finite-state hidden Markov model,…
▽ More
For a wide class of continuous-time Markov processes, including all irreducible hypoelliptic diffusions evolving on an open, connected subset of $\RL^d$, the following are shown to be equivalent: (i) The process satisfies (a slightly weaker version of) the classical Donsker-Varadhan conditions; (ii) The transition semigroup of the process can be approximated by a finite-state hidden Markov model, in a strong sense in terms of an associated operator norm; (iii) The resolvent kernel of the process is `$v$-separable', that is, it can be approximated arbitrarily well in operator norm by finite-rank kernels. Under any (hence all) of the above conditions, the Markov process is shown to have a purely discrete spectrum on a naturally associated weighted $L_\infty$ space.
△ Less
Submitted 25 April, 2016; v1 submitted 1 June, 2009;
originally announced June 2009.
-
On the entropy and log-concavity of compound Poisson measures
Authors:
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Motivated, in part, by the desire to develop an information-theoretic foundation for compound Poisson approximation limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation), this work examines sufficient conditions under which the compound Poisson distribution has maximal entropy within a natural class of probability measures…
▽ More
Motivated, in part, by the desire to develop an information-theoretic foundation for compound Poisson approximation limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation), this work examines sufficient conditions under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. We show that the natural analog of the Poisson maximum entropy property remains valid if the measures under consideration are log-concave, but that it fails in general. A parallel maximum entropy result is established for the family of compound binomial measures. The proofs are largely based on ideas related to the semigroup approach introduced in recent work by Johnson for the Poisson family. Sufficient conditions are given for compound distributions to be log-concave, and specific examples are presented illustrating all the above results.
△ Less
Submitted 27 May, 2008;
originally announced May 2008.
-
Estimating the entropy of binary time series: Methodology, some theory and a simulation study
Authors:
Y. Gao,
I. Kontoyiannis,
E. Bienenstock
Abstract:
Partly motivated by entropy-estimation problems in neuroscience, we present a detailed and extensive comparison between some of the most popular and effective entropy estimation methods used in practice: The plug-in method, four different estimators based on the Lempel-Ziv (LZ) family of data compression algorithms, an estimator based on the Context-Tree Weighting (CTW) method, and the renewal e…
▽ More
Partly motivated by entropy-estimation problems in neuroscience, we present a detailed and extensive comparison between some of the most popular and effective entropy estimation methods used in practice: The plug-in method, four different estimators based on the Lempel-Ziv (LZ) family of data compression algorithms, an estimator based on the Context-Tree Weighting (CTW) method, and the renewal entropy estimator.
**Methodology. Three new entropy estimators are introduced. For two of the four LZ-based estimators, a bootstrap procedure is described for evaluating their standard error, and a practical rule of thumb is heuristically derived for selecting the values of their parameters. ** Theory. We prove that, unlike their earlier versions, the two new LZ-based estimators are consistent for every finite-valued, stationary and ergodic process. An effective method is derived for the accurate approximation of the entropy rate of a finite-state HMM with known distribution. Heuristic calculations are presented and approximate formulas are derived for evaluating the bias and the standard error of each estimator. ** Simulation. All estimators are applied to a wide range of data generated by numerous different processes with varying degrees of dependence and memory. Some conclusions drawn from these experiments include: (i) For all estimators considered, the main source of error is the bias. (ii) The CTW method is repeatedly and consistently seen to provide the most accurate results. (iii) The performance of the LZ-based estimators is often comparable to that of the plug-in method. (iv) The main drawback of the plug-in method is its computational inefficiency.
△ Less
Submitted 29 February, 2008;
originally announced February 2008.
-
From the entropy to the statistical structure of spike trains
Authors:
Yun Gao,
Ioannis Kontoyiannis,
Elie Bienenstock
Abstract:
We use statistical estimates of the entropy rate of spike train data in order to make inferences about the underlying structure of the spike train itself. We first examine a number of different parametric and nonparametric estimators (some known and some new), including the ``plug-in'' method, several versions of Lempel-Ziv-based compression algorithms, a maximum likelihood estimator tailored to…
▽ More
We use statistical estimates of the entropy rate of spike train data in order to make inferences about the underlying structure of the spike train itself. We first examine a number of different parametric and nonparametric estimators (some known and some new), including the ``plug-in'' method, several versions of Lempel-Ziv-based compression algorithms, a maximum likelihood estimator tailored to renewal processes, and the natural estimator derived from the Context-Tree Weighting method (CTW). The theoretical properties of these estimators are examined, several new theoretical results are developed, and all estimators are systematically applied to various types of synthetic data and under different conditions.
Our main focus is on the performance of these entropy estimators on the (binary) spike trains of 28 neurons recorded simultaneously for a one-hour period from the primary motor and dorsal premotor cortices of a monkey. We show how the entropy estimates can be used to test for the existence of long-term structure in the data, and we construct a hypothesis test for whether the renewal process model is appropriate for these spike trains. Further, by applying the CTW algorithm we derive the maximum a posterior (MAP) tree model of our empirical data, and comment on the underlying structure it reveals.
△ Less
Submitted 27 March, 2008; v1 submitted 22 October, 2007;
originally announced October 2007.
-
Some information-theoretic computations related to the distribution of prime numbers
Authors:
Ioannis Kontoyiannis
Abstract:
We illustrate how elementary information-theoretic ideas may be employed to provide proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary and fairly short proof of the following asymptotic result: The sum of (log p)/p, taken over all primes p not exceeding n, is asymptotic to log n as n tends to infinity. We also give finite-n bounds refining the above l…
▽ More
We illustrate how elementary information-theoretic ideas may be employed to provide proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary and fairly short proof of the following asymptotic result: The sum of (log p)/p, taken over all primes p not exceeding n, is asymptotic to log n as n tends to infinity. We also give finite-n bounds refining the above limit. This result, originally proved by Chebyshev in 1852, is closely related to the celebrated prime number theorem.
△ Less
Submitted 5 November, 2007; v1 submitted 22 October, 2007;
originally announced October 2007.
-
Estimation of the Rate-Distortion Function
Authors:
M. T. Harrison,
I. Kontoyiannis
Abstract:
Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of the empirical distribution of the observed data. Sufficient…
▽ More
Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of the empirical distribution of the observed data. Sufficient conditions are given for its consistency, and examples are provided to demonstrate that in certain cases it fails to converge to the true rate-distortion function. The analysis of its performance is complicated by the fact that the rate-distortion function is not continuous in the source distribution; the underlying mathematical problem is closely related to the classical problem of establishing the consistency of maximum likelihood estimators. General consistency results are given for the plug-in estimator applied to a broad class of sources, including all stationary and ergodic ones. A more general class of estimation problems is also considered, arising in the context of lossy data compression when the allowed class of coding distributions is restricted; analogous results are developed for the plug-in estimator in that case. Finally, consistency theorems are formulated for modified (e.g., penalized) versions of the plug-in, and for estimating the optimal reproduction distribution.
△ Less
Submitted 11 April, 2008; v1 submitted 2 February, 2007;
originally announced February 2007.
-
Computable exponential bounds for screened estimation and simulation
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond t…
▽ More
Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond to times when the empirical average of the $\{U(X_i)\}$ is sufficiently close to its known mean. As long as $U$ dominates $F$ appropriately, the screened estimates admit exponential error bounds, even when $F(X)$ is heavy-tailed. The main results are several nonasymptotic, explicit exponential bounds for the screened estimates. A geometric interpretation, in the spirit of Sanov's theorem, is given for the fact that the screened estimates always admit exponential error bounds, even if the standard estimates do not. And when they do, the screened estimates' error probability has a significantly better exponent. This implies that screening can be interpreted as a variance reduction technique. Our main mathematical tools come from large deviations techniques. The results are illustrated by a detailed simulation example.
△ Less
Submitted 22 August, 2008; v1 submitted 1 December, 2006;
originally announced December 2006.
-
Mismatched codebooks and the role of entropy-coding in lossy data compression
Authors:
Ioannis Kontoyiannis,
Rami Zamir
Abstract:
We introduce a universal quantization scheme based on random coding, and we analyze its performance. This scheme consists of a source-independent random codebook (typically_mismatched_ to the source distribution), followed by optimal entropy-coding that is_matched_ to the quantized codeword distribution. A single-letter formula is derived for the rate achieved by this scheme at a given distortio…
▽ More
We introduce a universal quantization scheme based on random coding, and we analyze its performance. This scheme consists of a source-independent random codebook (typically_mismatched_ to the source distribution), followed by optimal entropy-coding that is_matched_ to the quantized codeword distribution. A single-letter formula is derived for the rate achieved by this scheme at a given distortion, in the limit of large codebook dimension. The rate reduction due to entropy-coding is quantified, and it is shown that it can be arbitrarily large. In the special case of "almost uniform" codebooks (e.g., an i.i.d. Gaussian codebook with large variance) and difference distortion measures, a novel connection is drawn between the compression achieved by the present scheme and the performance of "universal" entropy-coded dithered lattice quantizers. This connection generalizes the "half-a-bit" bound on the redundancy of dithered lattice quantizers. Moreover, it demonstrates a strong notion of universality where a single "almost uniform" codebook is near-optimal for_any_ source and_any_ difference distortion measure.
△ Less
Submitted 2 November, 2005;
originally announced November 2005.
-
Large deviations asymptotics and the spectral theory of multiplicatively regular Markov processes
Authors:
Ioannis Kontoyiannis,
S. P. Meyn
Abstract:
We continue the investigation of the spectral theory and exponential asymptotics of Markov processes, following Kontoyiannis and Meyn (2003). We introduce a new family of nonlinear Lyapunov drift criteria, characterizing distinct subclasses of geometrically ergodic Markov processes in terms of inequalities for the nonlinear generator. We concentrate on the class of "multiplicatively regular" Mar…
▽ More
We continue the investigation of the spectral theory and exponential asymptotics of Markov processes, following Kontoyiannis and Meyn (2003). We introduce a new family of nonlinear Lyapunov drift criteria, characterizing distinct subclasses of geometrically ergodic Markov processes in terms of inequalities for the nonlinear generator. We concentrate on the class of "multiplicatively regular" Markov processes, characterized via conditions similar to (but weaker than) those of Donsker-Varadhan. For any such process {Phi(t)} with transition kernel P on a general state space, the following are obtained. 1. SPECTRAL THEORY: For a large class of functionals F, the kernel Phat(x,dy) = e^{F(x)}P(x,dy) has a discrete spectrum in an appropriately defined Banach space. There exists a "maximal" solution to the "multiplicative Poisson equation," defined as the eigenvalue problem for Phat. Regularity properties are established for Λ(F) = \log(λ), where λis the maximal eigenvalue, and for its convex dual. 2. MULTIPLICATIVE MEAN ERGODIC THEOREM: The normalized mean E_x[\exp(S_t)] of the exponential of the partial sums {S_t} of the process with respect to any one of the above functionals F, converges to the maximal eigenfunction. 3. MULTIPLICATIVE REGULARITY: The drift criterion under which our results are derived is equivalent to the existence of regeneration times with finite exponential moments for {S_t}. 4. LARGE DEVIATIONS: The sequence of empirical measures of {Phi(t)} satisfies an LDP in a topology finer than the τ-topology. The rate function is Λ^* and it coincides with the Donsker-Varadhan rate function. 5. EXACTR LARGE DEVIATIONS: The partial sums {S_t} satisfy an exact LD expansion, analogous to that obtained for independent random variables.
△ Less
Submitted 14 September, 2005;
originally announced September 2005.
-
Measure Concentration for Compound Poisson Distributions
Authors:
I. Kontoyiannis,
M. Madiman
Abstract:
We give a simple development of the concentration properties of compound Poisson measures on the nonnegative integers. A new modification of the Herbst argument is applied to an appropriate modified logarithmic-Sobolev inequality to derive new concentration bounds. When the measure of interest does not have finite exponential moments, these bounds exhibit optimal polynomial decay. Simple new pro…
▽ More
We give a simple development of the concentration properties of compound Poisson measures on the nonnegative integers. A new modification of the Herbst argument is applied to an appropriate modified logarithmic-Sobolev inequality to derive new concentration bounds. When the measure of interest does not have finite exponential moments, these bounds exhibit optimal polynomial decay. Simple new proofs are also given for earlier results of Houdr{é} (2002) and Wu (2000).
△ Less
Submitted 21 June, 2005;
originally announced June 2005.
-
Entropy and the Law of Small Numbers
Authors:
Ioannis Kontoyiannis,
Peter Harremoes,
Oliver Johnson
Abstract:
Two new information-theoretic methods are introduced for establishing Poisson approximation inequalities. First, using only elementary information-theoretic techniques it is shown that, when $S_n=\sum_{i=1}^nX_i$ is the sum of the (possibly dependent) binary random variables $X_1,X_2,...,X_n$, with $E(X_i)=p_i$ and $E(S_n)=\la$, then \ben D(P_{S_n}\|\Pol)\leq \sum_{i=1}^n p_i^2 + \Big[\sum_{i=1}…
▽ More
Two new information-theoretic methods are introduced for establishing Poisson approximation inequalities. First, using only elementary information-theoretic techniques it is shown that, when $S_n=\sum_{i=1}^nX_i$ is the sum of the (possibly dependent) binary random variables $X_1,X_2,...,X_n$, with $E(X_i)=p_i$ and $E(S_n)=\la$, then \ben D(P_{S_n}\|\Pol)\leq \sum_{i=1}^n p_i^2 + \Big[\sum_{i=1}^nH(X_i) - H(X_1,X_2,..., X_n)\Big], \een where $D(P_{S_n}\|{Po}(\la))$ is the relative entropy between the distribution of $S_n$ and the Poisson($\la$) distribution. The first term in this bound measures the individual smallness of the $X_i$ and the second term measures their dependence. A general method is outlined for obtaining corresponding bounds when approximating the distribution of a sum of general discrete random variables by an infinitely divisible distribution.
Second, in the particular case when the $X_i$ are independent, the following sharper bound is established, \ben D(P_{S_n}\|\Pol)\leq \frac{1}λ \sum_{i=1}^n \frac{p_i^3}{1-p_i}, % \label{eq:abs2} \een and it is also generalized to the case when the $X_i$ are general integer-valued random variables. Its proof is based on the derivation of a subadditivity property for a new discrete version of the Fisher information, and uses a recent logarithmic Sobolev inequality for the Poisson distribution.
△ Less
Submitted 17 November, 2004; v1 submitted 1 November, 2002;
originally announced November 2002.
-
A remark on unified error exponents: Hypothesis testing, data compression and measure concentration
Authors:
Ioannis Kontoyiannis,
Ali Devin Sezer
Abstract:
Let A be finite set equipped with a probability distribution P, and let M be a "mass" function on A. A characterization is given for the most efficient way in which A^n can be covered using spheres of a fixed radius. A covering is a subset C_n of A^n with the property that most of the elements of A^n are within some fixed distance from at least one element of C_n, and "most of the elements" mean…
▽ More
Let A be finite set equipped with a probability distribution P, and let M be a "mass" function on A. A characterization is given for the most efficient way in which A^n can be covered using spheres of a fixed radius. A covering is a subset C_n of A^n with the property that most of the elements of A^n are within some fixed distance from at least one element of C_n, and "most of the elements" means a set whose probability is exponentially close to one (with respect to the product distribution P^n). An efficient covering is one with small mass M^n(C_n). With different choices for the geometry on A, this characterization gives various corollaries as special cases, including Marton's error-exponents theorem in lossy data compression, Hoeffding's optimal hypothesis testing exponents, and a new sharp converse to some measure concentration inequalities on discrete spaces.
△ Less
Submitted 3 October, 2002;
originally announced October 2002.
-
Steady state analysis of balanced-allocation routing
Authors:
Aris Anagnostopoulos,
Ioannis Kontoyiannis,
Eli Upfal
Abstract:
We compare the long-term, steady-state performance of a variant of the standard Dynamic Alternative Routing (DAR) technique commonly used in telephone and ATM networks, to the performance of a path-selection algorithm based on the "balanced-allocation" principle; we refer to this new algorithm as the Balanced Dynamic Alternative Routing (BDAR) algorithm. While DAR checks alternative routes seque…
▽ More
We compare the long-term, steady-state performance of a variant of the standard Dynamic Alternative Routing (DAR) technique commonly used in telephone and ATM networks, to the performance of a path-selection algorithm based on the "balanced-allocation" principle; we refer to this new algorithm as the Balanced Dynamic Alternative Routing (BDAR) algorithm. While DAR checks alternative routes sequentially until available bandwidth is found, the BDAR algorithm compares and chooses the best among a small number of alternatives.
We show that, at the expense of a minor increase in routing overhead, the BDAR algorithm gives a substantial improvement in network performance, in terms both of network congestion and of bandwidth requirement.
△ Less
Submitted 25 September, 2002;
originally announced September 2002.
-
The ODE Method and Spectral Theory of Markov Operators
Authors:
J. Huang,
I. Kontoyiannis,
S. P. Meyn
Abstract:
We give a development of the ODE method for the analysis of recursive algorithms described by a stochastic recursion. With variability modelled via an underlying Markov process, and under general assumptions, the following results are obtained: 1. Stability of an associated ODE implies that the stochastic recursion is stable in a strong sense when a gain parameter is small. 2. The range of gain-…
▽ More
We give a development of the ODE method for the analysis of recursive algorithms described by a stochastic recursion. With variability modelled via an underlying Markov process, and under general assumptions, the following results are obtained: 1. Stability of an associated ODE implies that the stochastic recursion is stable in a strong sense when a gain parameter is small. 2. The range of gain-values is quantified through a spectral analysis of an associated linear operator, providing a non-local theory. 3. A second-order analysis shows precisely how variability leads to sensitivity of the algorithm with respect to the gain parameter.
All results are obtained within the natural operator-theoretic framework of geometrically ergodic Markov processes.
△ Less
Submitted 20 September, 2002;
originally announced September 2002.
-
Spectral Theory and Limit Theorems for Geometrically Ergodic Markov Processes
Authors:
Ioannis Kontoyiannis,
Sean Meyn
Abstract:
Consider the partial sums {S_t} of a real-valued functional F(Phi(t)) of a Markov chain {Phi(t)} with values in a general state space. Assuming only that the Markov chain is geometrically ergodic and that the functional F is bounded, the following conclusions are obtained:
1. Spectral theory: Well-behaved solutions can be constructed for the ``multiplicative Poisson equation''.
2. A ``multip…
▽ More
Consider the partial sums {S_t} of a real-valued functional F(Phi(t)) of a Markov chain {Phi(t)} with values in a general state space. Assuming only that the Markov chain is geometrically ergodic and that the functional F is bounded, the following conclusions are obtained:
1. Spectral theory: Well-behaved solutions can be constructed for the ``multiplicative Poisson equation''.
2. A ``multiplicative'' mean ergodic theorem: For all complex αin a neighborhood of the origin, the normalized mean of \exp(αS_t) converges exponentially fast to a solution of the multiplicative Poisson equation.
3. Edgeworth Expansions: Rates are obtained for the convergence of the distribution function of the normalized partial sums S_t to the standard Gaussian distribution.
4. Large Deviations: The partial sums are shown to satisfy a large deviations principle in a neighborhood of the mean. This result, proved under geometric ergodicity alone, cannot in general be extended to the whole real line.
5. Exact Large Deviations Asymptotics: Rates of convergence are obtained for the large deviations estimates above.
Extensions of these results to continuous-time Markov processes are also given.
△ Less
Submitted 16 September, 2002;
originally announced September 2002.
-
Source Coding, Large Deviations, and Approximate Pattern Matching
Authors:
A. Dembo,
I. Kontoyiannis
Abstract:
We present a development of parts of rate-distortion theory and pattern- matching algorithms for lossy data compression, centered around a lossy version of the Asymptotic Equipartition Property (AEP). This treatment closely parallels the corresponding development in lossless compression, a point of view that was advanced in an important paper of Wyner and Ziv in 1989. In the lossless case we rev…
▽ More
We present a development of parts of rate-distortion theory and pattern- matching algorithms for lossy data compression, centered around a lossy version of the Asymptotic Equipartition Property (AEP). This treatment closely parallels the corresponding development in lossless compression, a point of view that was advanced in an important paper of Wyner and Ziv in 1989. In the lossless case we review how the AEP underlies the analysis of the Lempel-Ziv algorithm by viewing it as a random code and reducing it to the idealized Shannon code. This also provides information about the redundancy of the Lempel-Ziv algorithm and about the asymptotic behavior of several relevant quantities. In the lossy case we give various versions of the statement of the generalized AEP and we outline the general methodology of its proof via large deviations. Its relationship with Barron's generalized AEP is also discussed. The lossy AEP is applied to: (i) prove strengthened versions of Shannon's source coding theorem and universal coding theorems; (ii) characterize the performance of mismatched codebooks; (iii) analyze the performance of pattern- matching algorithms for lossy compression; (iv) determine the first order asymptotics of waiting times (with distortion) between stationary processes; (v) characterize the best achievable rate of weighted codebooks as an optimal sphere-covering exponent. We then present a refinement to the lossy AEP and use it to: (i) prove second order coding theorems; (ii) characterize which sources are easier to compress; (iii) determine the second order asymptotics of waiting times; (iv) determine the precise asymptotic behavior of longest match-lengths. Extensions to random fields are also given.
△ Less
Submitted 1 March, 2001;
originally announced March 2001.
-
Critical Behavior in Lossy Source Coding
Authors:
Amir Dembo,
Ioannis Kontoyiannis
Abstract:
The following critical phenomenon was recently discovered. When a memoryless source is compressed using a variable-length fixed-distortion code, the fastest convergence rate of the (pointwise) compression ratio to the optimal $R(D)$ bits/symbol is either $O(\sqrt{n})$ or $O(\log n)$. We show it is always $O(\sqrt{n})$, except for discrete, uniformly distributed sources.
The following critical phenomenon was recently discovered. When a memoryless source is compressed using a variable-length fixed-distortion code, the fastest convergence rate of the (pointwise) compression ratio to the optimal $R(D)$ bits/symbol is either $O(\sqrt{n})$ or $O(\log n)$. We show it is always $O(\sqrt{n})$, except for discrete, uniformly distributed sources.
△ Less
Submitted 1 September, 2000;
originally announced September 2000.
-
Efficient sphere-covering and converse measure concentration via generalized coding theorems
Authors:
Ioannis Kontoyiannis
Abstract:
Suppose A is a finite set equipped with a probability measure P and let M be a ``mass'' function on A. We give a probabilistic characterization of the most efficient way in which A^n can be almost-covered using spheres of a fixed radius. An almost-covering is a subset C_n of A^n, such that the union of the spheres centered at the points of C_n has probability close to one with respect to the pro…
▽ More
Suppose A is a finite set equipped with a probability measure P and let M be a ``mass'' function on A. We give a probabilistic characterization of the most efficient way in which A^n can be almost-covered using spheres of a fixed radius. An almost-covering is a subset C_n of A^n, such that the union of the spheres centered at the points of C_n has probability close to one with respect to the product measure P^n. An efficient covering is one with small mass M^n(C_n); n is typically large. With different choices for M and the geometry on A our results give various corollaries as special cases, including Shannon's data compression theorem, a version of Stein's lemma (in hypothesis testing), and a new converse to some measure concentration inequalities on discrete spaces. Under mild conditions, we generalize our results to abstract spaces and non-product measures.
△ Less
Submitted 27 September, 2000; v1 submitted 12 October, 1999;
originally announced October 1999.