-
Finite-sample expansions for the optimal error probability in asymmetric binary hypothesis testing
Authors:
Valentinian Lungu,
Ioannis Kontoyiannis
Abstract:
The problem of binary hypothesis testing between two probability measures is considered. New sharp bounds are derived for the best achievable error probability of such tests based on independent and identically distributed observations. Specifically, the asymmetric version of the problem is examined, where different requirements are placed on the two error probabilities. Accurate nonasymptotic exp…
▽ More
The problem of binary hypothesis testing between two probability measures is considered. New sharp bounds are derived for the best achievable error probability of such tests based on independent and identically distributed observations. Specifically, the asymmetric version of the problem is examined, where different requirements are placed on the two error probabilities. Accurate nonasymptotic expansions with explicit constants are obtained for the error probability, using tools from large deviations and Gaussian approximation. Examples are shown indicating that, in the asymmetric regime, the approximations suggested by the new bounds are significantly more accurate than the approximations provided by either of the two main earlier approaches -- normal approximation and error exponents.
△ Less
Submitted 29 May, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Relative entropy bounds for sampling with and without replacement
Authors:
Oliver Johnson,
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
Sharp, nonasymptotic bounds are obtained for the relative entropy between the distributions of sampling with and without replacement from an urn with balls of $c\geq 2$ colors. Our bounds are asymptotically tight in certain regimes and, unlike previous results, they depend on the number of balls of each colour in the urn. The connection of these results with finite de Finetti-style theorems is exp…
▽ More
Sharp, nonasymptotic bounds are obtained for the relative entropy between the distributions of sampling with and without replacement from an urn with balls of $c\geq 2$ colors. Our bounds are asymptotically tight in certain regimes and, unlike previous results, they depend on the number of balls of each colour in the urn. The connection of these results with finite de Finetti-style theorems is explored, and it is observed that a sampling bound due to Stam (1978) combined with the convexity of relative entropy yield a new finite de Finetti bound in relative entropy, which achieves the optimal asymptotic convergence rate.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
The entropic doubling constant and robustness of Gaussian codebooks for additive-noise channels
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Entropy comparison inequalities are obtained for the differential entropy $h(X+Y)$ of the sum of two independent random vectors $X,Y$, when one is replaced by a Gaussian. For identically distributed random vectors $X,Y$, these are closely related to bounds on the entropic doubling constant, which quantifies the entropy increase when adding an independent copy of a random vector to itself. Conseque…
▽ More
Entropy comparison inequalities are obtained for the differential entropy $h(X+Y)$ of the sum of two independent random vectors $X,Y$, when one is replaced by a Gaussian. For identically distributed random vectors $X,Y$, these are closely related to bounds on the entropic doubling constant, which quantifies the entropy increase when adding an independent copy of a random vector to itself. Consequences of both large and small doubling are explored. For the former, lower bounds are deduced on the entropy increase when adding an independent Gaussian, while for the latter, a qualitative stability result for the entropy power inequality is obtained. In the more general case of non-identically distributed random vectors $X,Y$, a Gaussian comparison inequality with interesting implications for channel coding is established: For additive-noise channels with a power constraint, Gaussian codebooks come within a $\frac{\sf snr}{3{\sf snr}+2}$ factor of capacity. In the low-SNR regime this improves the half-a-bit additive bound of Zamir and Erez (2004). Analogous results are obtained for additive-noise multiple access channels, and for linear, additive-noise MIMO channels.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
The Bayesian Context Trees State Space Model for time series modelling and forecasting
Authors:
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
A hierarchical Bayesian framework is introduced for develo** rich mixture models for real-valued time series, partly motivated by important applications in financial time series analysis. At the top level, meaningful discrete states are identified as appropriately quantised values of some of the most recent samples. These observable states are described as a discrete context-tree model. At the b…
▽ More
A hierarchical Bayesian framework is introduced for develo** rich mixture models for real-valued time series, partly motivated by important applications in financial time series analysis. At the top level, meaningful discrete states are identified as appropriately quantised values of some of the most recent samples. These observable states are described as a discrete context-tree model. At the bottom level, a different, arbitrary model for real-valued time series -- a base model -- is associated with each state. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We call this the Bayesian Context Trees State Space Model, or the BCT-X framework. Efficient algorithms are introduced that allow for effective, exact Bayesian inference and learning in this setting; in particular, the maximum a posteriori probability (MAP) context-tree model can be identified. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in two particular instances: When autoregressive (AR) models are used as base models, resulting in a nonlinear AR mixture model, and when conditional heteroscedastic (ARCH) models are used, resulting in a mixture model that offers a powerful and systematic way of modelling the well-known volatility asymmetries in financial data. In forecasting, the BCT-X methods are found to outperform state-of-the-art techniques on simulated and real-world data, both in terms of accuracy and computational requirements. In modelling, the BCT-X structure finds natural structure present in the data. In particular, the BCT-ARCH model reveals a novel, important feature of stock market index data, in the form of an enhanced leverage effect.
△ Less
Submitted 10 October, 2023; v1 submitted 1 August, 2023;
originally announced August 2023.
-
Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains
Authors:
A. Theocharous,
G. G. Gregoriou,
P. Sapountzis,
I. Kontoyiannis
Abstract:
We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, correspon…
▽ More
We consider the problem of detecting causal relationships between discrete time series, in the presence of potential confounders. A hypothesis test is introduced for identifying the temporally causal influence of $(x_n)$ on $(y_n)$, causally conditioned on a possibly confounding third time series $(z_n)$. Under natural Markovian modeling assumptions, it is shown that the null hypothesis, corresponding to the absence of temporally causal influence, is equivalent to the underlying `causal conditional directed information rate' being equal to zero. The plug-in estimator for this functional is identified with the log-likelihood ratio test statistic for the desired test. This statistic is shown to be asymptotically normal under the alternative hypothesis and asymptotically $χ^2$ distributed under the null, facilitating the computation of $p$-values when used on empirical data. The effectiveness of the resulting hypothesis test is illustrated on simulated data, validating the underlying theory. The test is also employed in the analysis of spike train data recorded from neurons in the V4 and FEF brain regions of behaving animals during a visual attention task. There, the test results are seen to identify interesting and biologically relevant information.
△ Less
Submitted 17 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Generalised shot noise representations of stochastic systems driven by non-Gaussian Lévy processes
Authors:
Marcos Tapia Costa,
Ioannis Kontoyiannis,
Simon Godsill
Abstract:
We consider the problem of obtaining effective representations for the solutions of linear, vector-valued stochastic differential equations (SDEs) driven by non-Gaussian pure-jump Lévy processes, and we show how such representations lead to efficient simulation methods. The processes considered constitute a broad class of models that find application across the physical and biological sciences, ma…
▽ More
We consider the problem of obtaining effective representations for the solutions of linear, vector-valued stochastic differential equations (SDEs) driven by non-Gaussian pure-jump Lévy processes, and we show how such representations lead to efficient simulation methods. The processes considered constitute a broad class of models that find application across the physical and biological sciences, mathematics, finance and engineering. Motivated by important relevant problems in statistical inference, we derive new, generalised shot-noise simulation methods whenever a normal variance-mean (NVM) mixture representation exists for the driving Lévy process, including the generalised hyperbolic, normal-Gamma, and normal tempered stable cases. Simple, explicit conditions are identified for the convergence of the residual of a truncated shot-noise representation to a Brownian motion in the case of the pure Lévy process, and to a Brownian-driven SDE in the case of the Lévy-driven SDE. These results provide Gaussian approximations to the small jumps of the process under the NVM representation. The resulting representations are of particular importance in state inference and parameter estimation for Lévy-driven SDE models, since the resulting conditionally Gaussian structures can be readily incorporated into latent variable inference methods such as Markov chain Monte Carlo (MCMC), Expectation-Maximisation (EM), and sequential Monte Carlo.
△ Less
Submitted 7 November, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
A Third Information-Theoretic Approach to Finite de Finetti Theorems
Authors:
Mario Berta,
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A new finite form of de Finetti's representation theorem is established using elementary information-theoretic tools. The distribution of the first $k$ random variables in an exchangeable vector of $n\geq k$ random variables is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided. This bound is tighter than those obta…
▽ More
A new finite form of de Finetti's representation theorem is established using elementary information-theoretic tools. The distribution of the first $k$ random variables in an exchangeable vector of $n\geq k$ random variables is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided. This bound is tighter than those obtained via earlier information-theoretic proofs, and its utility extends to random variables taking values in general spaces. The core argument employed has its origins in the quantum information-theoretic literature.
△ Less
Submitted 25 April, 2024; v1 submitted 11 April, 2023;
originally announced April 2023.
-
Truly Bayesian Entropy Estimation
Authors:
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. I…
▽ More
Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
△ Less
Submitted 21 March, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Context-tree weighting and Bayesian Context Trees: Asymptotic and non-asymptotic justifications
Authors:
Ioannis Kontoyiannis
Abstract:
The Bayesian Context Trees (BCT) framework is a recently introduced, general collection of statistical and algorithmic tools for modelling, analysis and inference with discrete-valued time series. The foundation of this development is built in part on some well-known information-theoretic ideas and techniques, including Rissanen's tree sources and Willems et al.'s context-tree weighting algorithm.…
▽ More
The Bayesian Context Trees (BCT) framework is a recently introduced, general collection of statistical and algorithmic tools for modelling, analysis and inference with discrete-valued time series. The foundation of this development is built in part on some well-known information-theoretic ideas and techniques, including Rissanen's tree sources and Willems et al.'s context-tree weighting algorithm. This paper presents a collection of theoretical results that provide mathematical justifications and further insight into the BCT modelling framework and the associated practical tools. It is shown that the BCT prior predictive likelihood (the probability of a time series of observations averaged over all models and parameters) is both pointwise and minimax optimal, in agreement with the MDL principle and the BIC criterion. The posterior distribution is shown to be asymptotically consistent with probability one (over both models and parameters), and asymptotically Gaussian (over the parameters). And the posterior predictive distribution is also shown to be asymptotically consistent with probability one.
△ Less
Submitted 5 September, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Information in probability: Another information-theoretic proof of a finite de Finetti theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We d…
▽ More
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We derive an upper bound on the relative entropy between the distribution of the first $k$ in a sequence of $n$ exchangeable random variables, and an appropriate mixture over product distributions. The mixing measure is characterised as the law of the empirical measure of the original sequence, and de Finetti's result is recovered as a corollary. The proof is nicely motivated by the Gibbs conditioning principle in connection with statistical mechanics, and it follows along an appealing sequence of steps. The technical estimates required for these steps are obtained via the use of a collection of combinatorial tools known within information theory as `the method of types.'
△ Less
Submitted 26 April, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees
Authors:
Valentinian Lungu,
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-m…
▽ More
A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-memory Markov chains. Inference for the presence and location of change-points is then performed via Markov chain Monte Carlo sampling. The key observation that facilitates effective sampling is that, using one of the BCT algorithms, the prior predictive likelihood of the data can be computed exactly, integrating out all the models and parameters in each segment. This makes it possible to sample directly from the posterior distribution of the number and location of the change-points, leading to accurate estimates and providing a natural quantitative measure of uncertainty in the results. Estimates of the actual model in each segment can also be obtained, at essentially no additional computational cost. Results on both simulated and real-world data indicate that the proposed methodology performs better than or as well as state-of-the-art techniques.
△ Less
Submitted 13 May, 2022; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Posterior Representations for Bayesian Context Trees: Sampling, Estimation and Convergence
Authors:
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
We revisit the Bayesian Context Trees (BCT) modelling framework for discrete time series, which was recently found to be very effective in numerous tasks including model selection, estimation and prediction. A novel representation of the induced posterior distribution on model space is derived in terms of a simple branching process, and several consequences of this are explored in theory and in pr…
▽ More
We revisit the Bayesian Context Trees (BCT) modelling framework for discrete time series, which was recently found to be very effective in numerous tasks including model selection, estimation and prediction. A novel representation of the induced posterior distribution on model space is derived in terms of a simple branching process, and several consequences of this are explored in theory and in practice. First, it is shown that the branching process representation leads to a simple variable-dimensional Monte Carlo sampler for the joint posterior distribution on models and parameters, which can efficiently produce independent samples. This sampler is found to be more efficient than earlier MCMC samplers for the same tasks. Then, the branching process representation is used to establish the asymptotic consistency of the BCT posterior, including the derivation of an almost-sure convergence rate. Finally, an extensive study is carried out on the performance of the induced Bayesian entropy estimator. Its utility is illustrated through both simulation experiments and real-world applications, where it is found to outperform several state-of-the-art methods.
△ Less
Submitted 20 March, 2023; v1 submitted 4 February, 2022;
originally announced February 2022.
-
The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning
Authors:
Vivek Borkar,
Shuhang Chen,
Adithya Devraj,
Ioannis Kontoyiannis,
Sean Meyn
Abstract:
The paper concerns the stochastic approximation recursion, \[ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1})
\,,\quad n\ge 0, \] where the {\em estimates} $θ_n\in\Re^d$ and $ \{ Φ_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow}…
▽ More
The paper concerns the stochastic approximation recursion, \[ θ_{n+1}= θ_n + α_{n + 1} f(θ_n, Φ_{n+1})
\,,\quad n\ge 0, \] where the {\em estimates} $θ_n\in\Re^d$ and $ \{ Φ_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow} $ \tfrac{d}{dt} \vartheta_t = \bar{f}(\vartheta_t)$, is globally asymptotically stable with stationary point denoted $θ^*$, where $\bar{f}(θ)=\text{ E}[f(θ,Φ)]$ with $Φ$ having the stationary distribution of the chain. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3) for the chain:
(i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$.
(ii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\text{ E} [ z_n z_n^T ]$ to the asymptotic covariance $Σ^Θ$ in the CLT, where $z_n= (θ_n-θ^*)/\sqrt{α_n}$.
(iii) The CLT holds for the normalized version $z^{\text{ PR}}_n$ of the averaged parameters $θ^{\text{ PR}}_n$, subject to standard assumptions on the step-size. Moreover, the normalized covariance of both $θ^{\text{ PR}}_n$ and $z^{\text{ PR}}_n$ converge to $Σ^{\text{ PR}}$, the minimal covariance of Polyak and Ruppert.
(iv)} An example is given where $f$ and $\bar{f}$ are linear in $θ$, and the Markov chain is geometrically ergodic but does not satisfy (DV3). While the algorithm is convergent, the second moment of $θ_n$ is unbounded and in fact diverges.
△ Less
Submitted 21 February, 2024; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Context-tree weighting for real-valued time series: Bayesian inference with hierarchical mixture models
Authors:
Ioannis Papageorgiou,
Ioannis Kontoyiannis
Abstract:
Real-valued time series are ubiquitous in the sciences and engineering. In this work, a general, hierarchical Bayesian modelling framework is developed for building mixture models for times series. This development is based, in part, on the use of context trees, and it includes a collection of effective algorithmic tools for learning and inference. A discrete context (or 'state') is extracted for…
▽ More
Real-valued time series are ubiquitous in the sciences and engineering. In this work, a general, hierarchical Bayesian modelling framework is developed for building mixture models for times series. This development is based, in part, on the use of context trees, and it includes a collection of effective algorithmic tools for learning and inference. A discrete context (or 'state') is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context-tree. At the bottom level, a different real-valued time series model is associated with each context-state, i.e., with each leaf of the tree. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. Extending the idea of context-tree weighting leads to algorithms that allow for efficient, exact Bayesian inference in this setting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. The associated methods are found to outperform several state-of-the-art techniques on simulated and real-world experiments.
△ Less
Submitted 14 April, 2023; v1 submitted 5 June, 2021;
originally announced June 2021.
-
Entropy and the Discrete Central Limit Theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A strengthened version of the central limit theorem for discrete random variables is established, relying only on information-theoretic tools and elementary arguments. It is shown that the relative entropy between the standardised sum of $n$ independent and identically distributed lattice random variables and an appropriately discretised Gaussian, vanishes as $n\to\infty$.
A strengthened version of the central limit theorem for discrete random variables is established, relying only on information-theoretic tools and elementary arguments. It is shown that the relative entropy between the standardised sum of $n$ independent and identically distributed lattice random variables and an appropriately discretised Gaussian, vanishes as $n\to\infty$.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
The Feature-First Block Model
Authors:
Lawrence Tray,
Ioannis Kontoyiannis
Abstract:
Labelled networks are an important class of data, naturally appearing in numerous applications in science and engineering. A typical inference goal is to determine how the vertex labels (or features) affect the network's structure. In this work, we introduce a new generative model, the feature-first block model (FFBM), that facilitates the use of rich queries on labelled networks. We develop a Bay…
▽ More
Labelled networks are an important class of data, naturally appearing in numerous applications in science and engineering. A typical inference goal is to determine how the vertex labels (or features) affect the network's structure. In this work, we introduce a new generative model, the feature-first block model (FFBM), that facilitates the use of rich queries on labelled networks. We develop a Bayesian framework and devise a two-level Markov chain Monte Carlo approach to efficiently sample from the relevant posterior distribution of the FFBM parameters. This allows us to infer if and how the observed vertex-features affect macro-structure. We apply the proposed methods to a variety of network data to extract the most important features along which the vertices are partitioned. The main advantages of the proposed approach are that the whole feature-space is used automatically and that features can be rank-ordered implicitly according to impact.
△ Less
Submitted 16 November, 2021; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Population-scale testing can suppress the spread of infectious disease
Authors:
Jussi Taipale,
Ioannis Kontoyiannis,
Sten Linnarsson
Abstract:
Major advances in public health have resulted from disease prevention. However, prevention of a new infectious disease by vaccination or pharmaceuticals is made difficult by the slow process of vaccine and drug development. We propose an additional intervention that allows rapid control of emerging infectious diseases, and can also be used to eradicate diseases that rely almost exclusively on huma…
▽ More
Major advances in public health have resulted from disease prevention. However, prevention of a new infectious disease by vaccination or pharmaceuticals is made difficult by the slow process of vaccine and drug development. We propose an additional intervention that allows rapid control of emerging infectious diseases, and can also be used to eradicate diseases that rely almost exclusively on human-to-human transmission. The intervention is based on (1) testing every individual for the disease, (2) repeatedly, and (3) isolation of infected individuals. We show here that at a sufficient rate of testing, the reproduction number is reduced below 1.0 and the epidemic will rapidly collapse. The approach does not rely on strong or unrealistic assumptions about test accuracy, isolation compliance, population structure or epidemiological parameters, and its success can be monitored in real time by following the test positivity rate. In addition to the compliance rate and false negatives, the required rate of testing depends on the design of the testing regime, with concurrent testing outperforming random sampling. Provided that results are obtained rapidly, the test frequency required to suppress an epidemic is monotonic and near-linear with respect to R0, the infectious period, and the fraction of susceptible individuals. The testing regime is effective against both early phase and established epidemics, and additive to other interventions (e.g. contact tracing and social distancing). It is also robust to failure: any rate of testing reduces the number of infections, improving both public health and economic conditions. These conclusions are based on rigorous analysis and simulations of appropriate epidemiological models. A mass-produced, disposable test that could be used at home would be ideal, due to the optimal performance of concurrent tests that return immediate results.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
An Information-Theoretic Proof of a Finite de Finetti Theorem
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
A finite form of de Finetti's representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.
A finite form of de Finetti's representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.
△ Less
Submitted 25 June, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Compression and Symmetry of Small-World Graphs and Structures
Authors:
Ioannis Kontoyiannis,
Yi Heng Lim,
Katia Papakonstantinopoulou,
Wojtek Szpankowski
Abstract:
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quanti…
▽ More
For various purposes and, in particular, in the context of data compression, a graph can be examined at three levels. Its structure can be described as the unlabeled version of the graph; then the labeling of its structure can be added; and finally, given then structure and labeling, the contents of the labels can be described. Determining the amount of information present at each level and quantifying the degree of dependence between them, requires the study of symmetry, graph automorphism, entropy, and graph compressibility. In this paper, we focus on a class of small-world graphs. These are geometric random graphs where vertices are first connected to their nearest neighbors on a circle and then pairs of non-neighbors are connected according to a distance-dependent probability distribution. We establish the degree distribution of this model, and use it to prove the model's asymmetry in an appropriate range of parameters. Then we derive the relevant entropy and structural entropy of these random graphs, in connection with graph compression.
△ Less
Submitted 22 November, 2021; v1 submitted 31 July, 2020;
originally announced July 2020.
-
Bayesian Context Trees: Modelling and exact inference for discrete time series
Authors:
Ioannis Kontoyiannis,
Lambros Mertzanis,
Athina Panotopoulou,
Ioannis Papageorgiou,
Maria Skoularidou
Abstract:
We develop a new Bayesian modelling framework for the class of higher-order, variable-memory Markov chains, and introduce an associated collection of methodological tools for exact inference with discrete time series. We show that a version of the context tree weighting algorithm can compute the prior predictive likelihood exactly (averaged over both models and parameters), and two related algorit…
▽ More
We develop a new Bayesian modelling framework for the class of higher-order, variable-memory Markov chains, and introduce an associated collection of methodological tools for exact inference with discrete time series. We show that a version of the context tree weighting algorithm can compute the prior predictive likelihood exactly (averaged over both models and parameters), and two related algorithms are introduced, which identify the a posteriori most likely models and compute their exact posterior probabilities. All three algorithms are deterministic and have linear-time complexity. A family of variable-dimension Markov chain Monte Carlo samplers is also provided, facilitating further exploration of the posterior. The performance of the proposed methods in model selection, Markov order estimation and prediction is illustrated through simulation experiments and real-world applications with data from finance, genetics, neuroscience, and animal communication. The associated algorithms are implemented in the R package BCT.
△ Less
Submitted 6 February, 2022; v1 submitted 29 July, 2020;
originally announced July 2020.
-
Sharp Second-Order Pointwise Asymptotics for Lossless Compression with Side Information
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
The problem of determining the best achievable performance of arbitrary lossless compression algorithms is examined, when correlated side information is available at both the encoder and decoder. For arbitrary source-side information pairs, the conditional information density is shown to provide a sharp asymptotic lower bound for the description lengths achieved by an arbitrary sequence of compres…
▽ More
The problem of determining the best achievable performance of arbitrary lossless compression algorithms is examined, when correlated side information is available at both the encoder and decoder. For arbitrary source-side information pairs, the conditional information density is shown to provide a sharp asymptotic lower bound for the description lengths achieved by an arbitrary sequence of compressors. This implies that, for ergodic source-side information pairs, the conditional entropy rate is the best achievable asymptotic lower bound to the rate, not just in expectation but with probability one. Under appropriate mixing conditions, a central limit theorem and a law of the iterated logarithm are proved, describing the inevitable fluctuations of the second-order asymptotically best possible rate. An idealised version of Lempel-Ziv coding with side information is shown to be universally first- and second-order asymptotically optimal, under the same conditions. These results are in part based on a new almost-sure invariance principle for the conditional information density, which may be of independent interest.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Optimal rates for independence testing via $U$-statistic permutation tests
Authors:
Thomas B. Berrett,
Ioannis Kontoyiannis,
Richard J. Samworth
Abstract:
We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives…
▽ More
We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives of the form $\{f: D(f) \geq ρ^2 \}$. We therefore restrict attention to alternatives that impose additional Sobolev-type smoothness constraints, and define a permutation test based on a basis expansion and a $U$-statistic estimator of $D(f)$ that we prove is minimax optimal in terms of its separation rates in many instances. Finally, for the case of a Fourier basis on $[0,1]^2$, we provide an approximation to the power function that offers several additional insights. Our methodology is implemented in the R package USP.
△ Less
Submitted 6 November, 2020; v1 submitted 15 January, 2020;
originally announced January 2020.
-
The Lévy State Space Model
Authors:
Simon Godsill,
Marina Riabiz,
Ioannis Kontoyiannis
Abstract:
In this paper we introduce a new class of state space models based on shot-noise simulation representations of non-Gaussian Lévy-driven linear systems, represented as stochastic differential equations. In particular a conditionally Gaussian version of the models is proposed that is able to capture heavy-tailed non-Gaussianity while retaining tractability for inference procedures. We focus on a can…
▽ More
In this paper we introduce a new class of state space models based on shot-noise simulation representations of non-Gaussian Lévy-driven linear systems, represented as stochastic differential equations. In particular a conditionally Gaussian version of the models is proposed that is able to capture heavy-tailed non-Gaussianity while retaining tractability for inference procedures. We focus on a canonical class of such processes, the $α$-stable Lévy processes, which retain important properties such as self-similarity and heavy-tails, while emphasizing that broader classes of non-Gaussian Lévy processes may be handled by similar methodology. An important feature is that we are able to marginalise both the skewness and the scale parameters of these challenging models from posterior probability distributions. The models are posed in continuous time and so are able to deal with irregular data arrival times. Example modelling and inference procedures are provided using Rao-Blackwellised sequential Monte Carlo applied to a two-dimensional Langevin model, and this is tested on real exchange rate data.
△ Less
Submitted 8 January, 2020; v1 submitted 28 December, 2019;
originally announced December 2019.
-
Fundamental Limits of Lossless Data Compression with Side Information
Authors:
Lampros Gavalakis,
Ioannis Kontoyiannis
Abstract:
The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in two different versions of the problem: Reference-based compression, when a single side information string is used repeatedly in compressing different source messages, and pair-based c…
▽ More
The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in two different versions of the problem: Reference-based compression, when a single side information string is used repeatedly in compressing different source messages, and pair-based compression, where a different side information string is used for each source message. General achievability and converse theorems are established for arbitrary source-side information pairs. Nonasymptotic normal approximation expansions are proved for the optimal rate in both the reference-based and pair-based settings, for memoryless sources. These are stated in terms of explicit, finite-blocklength bounds, that are tight up to third-order terms. Extensions that go significantly beyond the class of memoryless sources are obtained. The relevant source dispersion is identified and its relationship with the conditional varentropy rate is established. Interestingly, the dispersion is different in reference-based and pair-based compression, and it is proved that the reference-based dispersion is in general smaller.
△ Less
Submitted 21 February, 2021; v1 submitted 11 December, 2019;
originally announced December 2019.
-
Differential Temporal Difference Learning
Authors:
Adithya M. Devraj,
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (…
▽ More
Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods.
△ Less
Submitted 27 February, 2020; v1 submitted 28 December, 2018;
originally announced December 2018.
-
A Simple Network of Nodes Moving on the Circle
Authors:
Dimitris Cheliotis,
Ioannis Kontoyiannis,
Michail Loulakis,
Stavros Toumpis
Abstract:
Two simple Markov processes are examined, one in discrete and one in continuous time, arising from idealized versions of a transmission protocol for mobile, delay-tolerant networks. We consider two independent walkers moving with constant speed on either the discrete or continuous circle, and changing directions at independent geometric (respectively, exponential) times. One of the walkers carries…
▽ More
Two simple Markov processes are examined, one in discrete and one in continuous time, arising from idealized versions of a transmission protocol for mobile, delay-tolerant networks. We consider two independent walkers moving with constant speed on either the discrete or continuous circle, and changing directions at independent geometric (respectively, exponential) times. One of the walkers carries a message that wishes to travel as far and as fast as possible in the clockwise direction. The message stays with its current carrier unless the two walkers meet, the carrier is moving counter-clockwise, and the other walker is moving clockwise. In that case, the message jumps to the other walker. The long-term average clockwise speed of the message is computed. An explicit expression is derived via the solution of an associated boundary value problem in terms of the generator of the underlying Markov process. The average transmission cost is also similarly computed, measured as the long-term number of jumps the message makes per unit time. The tradeoff between speed and cost is examined, as a function of the underlying problem parameters.
△ Less
Submitted 4 March, 2020; v1 submitted 11 August, 2018;
originally announced August 2018.
-
Nonasymptotic Gaussian Approximation for Inference with Stable Noise
Authors:
Marina Riabiz,
Tohid Ardeshiri,
Ioannis Kontoyiannis,
Simon Godsill
Abstract:
The results of a series of theoretical studies are reported, examining the convergence rate for different approximate representations of $α$-stable distributions. Although they play a key role in modelling random processes with jumps and discontinuities, the use of $α$-stable distributions in inference often leads to analytically intractable problems. The LePage series, which is a probabilistic re…
▽ More
The results of a series of theoretical studies are reported, examining the convergence rate for different approximate representations of $α$-stable distributions. Although they play a key role in modelling random processes with jumps and discontinuities, the use of $α$-stable distributions in inference often leads to analytically intractable problems. The LePage series, which is a probabilistic representation employed in this work, is used to transform an intractable, infinite-dimensional inference problem into a conditionally Gaussian parametric problem. A major component of our approach is the approximation of the tail of this series by a Gaussian random variable. Standard statistical techniques, such as Expectation-Maximization, Markov chain Monte Carlo, and Particle Filtering, can then be applied. In addition to the asymptotic normality of the tail of this series, we establish explicit, nonasymptotic bounds on the approximation error. Their proofs follow classical Fourier-analytic arguments, using Esséen's smoothing lemma. Specifically, we consider the distance between the distributions of: $(i)$~the tail of the series and an appropriate Gaussian; $(ii)$~the full series and the truncated series; and $(iii)$~the full series and the truncated series with an added Gaussian term. In all three cases, sharp bounds are established, and the theoretical results are compared with the actual distances (computed numerically) in specific examples of symmetric $α$-stable distributions. This analysis facilitates the selection of appropriate truncations in practice and offers theoretical guarantees for the accuracy of resulting estimates. One of the main conclusions obtained is that, for the purposes of inference, the use of a truncated series together with an approximately Gaussian error term has superior statistical properties and is likely a preferable choice in practice.
△ Less
Submitted 1 January, 2020; v1 submitted 27 February, 2018;
originally announced February 2018.
-
Packet Speed and Cost in Mobile Wireless Delay-Tolerant Networks
Authors:
Riccardo Cavallari,
Stavros Toumpis,
Roberto Verdone,
Ioannis Kontoyiannis
Abstract:
A mobile wireless delay-tolerant network (DTN) model is proposed and analyzed, in which infinitely many nodes are initially placed on R^2 according to a uniform Poisson point process (PPP) and subsequently travel, independently of each other, along trajectories comprised of line segments, changing travel direction at time instances that form a Poisson process, each time selecting a new travel dire…
▽ More
A mobile wireless delay-tolerant network (DTN) model is proposed and analyzed, in which infinitely many nodes are initially placed on R^2 according to a uniform Poisson point process (PPP) and subsequently travel, independently of each other, along trajectories comprised of line segments, changing travel direction at time instances that form a Poisson process, each time selecting a new travel direction from an arbitrary distribution; all nodes maintain constant speed. A single information packet is traveling towards a given direction using both wireless transmissions and sojourns on node buffers, according to a member of a broad class of possible routing rules. For this model, we compute the long-term averages of the speed with which the packet travels towards its destination and the rate with which the wireless transmission cost accumulates. Because of the complexity of the problem, we employ two intuitive, simplifying approximations; simulations verify that the approximation error is typically small. Our results quantify the fundamental trade-off that exists in mobile wireless DTNs between the packet speed and the packet delivery cost. The framework developed here is both general and versatile, and can be used as a starting point for further investigation.
△ Less
Submitted 28 February, 2018; v1 submitted 7 January, 2018;
originally announced January 2018.
-
Geometric Ergodicity in a Weighted Sobolev Space
Authors:
Adithya Devraj,
Ioannis Kontoyiannis,
Sean Meyn
Abstract:
For a discrete-time Markov chain $\{X(t)\}$ evolving on $\Re^\ell$ with transition kernel $P$, natural, general conditions are developed under which the following are established:
1. The transition kernel $P$ has a purely discrete spectrum, when viewed as a linear operator on a weighted Sobolev space $L_\infty^{v,1}$ of functions with norm,…
▽ More
For a discrete-time Markov chain $\{X(t)\}$ evolving on $\Re^\ell$ with transition kernel $P$, natural, general conditions are developed under which the following are established:
1. The transition kernel $P$ has a purely discrete spectrum, when viewed as a linear operator on a weighted Sobolev space $L_\infty^{v,1}$ of functions with norm, $$ \|f\|_{v,1} = \sup_{x \in \Re^\ell} \frac{1}{v(x)} \max \{|f(x)|, |\partial_1 f(x)|,\ldots,|\partial_\ell f(x)|\}, $$ where $v\colon \Re^\ell \to [1,\infty)$ is a Lyapunov function and $\partial_i:=\partial/\partial x_i$.
2. The Markov chain is geometrically ergodic in $L_\infty^{v,1}$: There is a unique invariant probability measure $π$ and constants $B<\infty$ and $δ>0$ such that, for each $f\in L_\infty^{v,1}$, any initial condition $X(0)=x$, and all $t\geq 0$: $$\Big| \text{E}_x[f(X(t))] - π(f)\Big| \le Be^{-δt}v(x),\quad \|\nabla \text{E}_x[f(X(t))] \|_2 \le Be^{-δt} v(x), $$ where $π(f)=\int fdπ$.
3. For any function $f\in L_\infty^{v,1}$ there is a function $h\in L_\infty^{v,1}$ solving Poisson's equation: \[ h-Ph = f-π(f). \] Part of the analysis is based on an operator-theoretic treatment of the sensitivity process that appears in the theory of Lyapunov exponents.
△ Less
Submitted 18 July, 2019; v1 submitted 9 November, 2017;
originally announced November 2017.
-
Thinning and Information Projections
Authors:
Peter Harremoës,
Oliver Johnson,
Ioannis Kontoyiannis
Abstract:
In this paper we establish lower bounds on information divergence of a distribution on the integers from a Poisson distribution. These lower bounds are tight and in the cases where a rate of convergence in the Law of Thin Numbers can be computed the rate is determined by the lower bounds proved in this paper. General techniques for getting lower bounds in terms of moments are developed. The result…
▽ More
In this paper we establish lower bounds on information divergence of a distribution on the integers from a Poisson distribution. These lower bounds are tight and in the cases where a rate of convergence in the Law of Thin Numbers can be computed the rate is determined by the lower bounds proved in this paper. General techniques for getting lower bounds in terms of moments are developed. The results about lower bound in the Law of Thin Numbers are used to derive similar results for the Central Limit Theorem.
△ Less
Submitted 17 January, 2016;
originally announced January 2016.
-
On the $f$-Norm Ergodicity of Markov Processes in Continuous Time
Authors:
I. Kontoyiannis,
S. P. Meyn
Abstract:
Consider a Markov process $\{Φ(t) : t\geq 0\}$ evolving on a Polish space ${\sf X}$. A version of the $f$-Norm Ergodic Theorem is obtained: Suppose that the process is $ψ$-irreducible and aperiodic. For a given function $f\colon{\sf X}:\to[1,\infty)$, under suitable conditions on the process the following are equivalent: \begin{enumerate} \item[(i)] There is a unique invariant probability measure…
▽ More
Consider a Markov process $\{Φ(t) : t\geq 0\}$ evolving on a Polish space ${\sf X}$. A version of the $f$-Norm Ergodic Theorem is obtained: Suppose that the process is $ψ$-irreducible and aperiodic. For a given function $f\colon{\sf X}:\to[1,\infty)$, under suitable conditions on the process the following are equivalent: \begin{enumerate} \item[(i)] There is a unique invariant probability measure $π$ satisfying $\int f\,dπ<\infty$. \item[(ii)] There is a closed set $C$ satisfying $ψ(C)>0$ that is ``self $f$-regular.'' \item There is a function $V\colon{\sf X} \to (0,\infty]$ that is finite on at least one point in ${\sf X}$, for which the following Lyapunov drift condition is satisfied, \[ {\cal D} V\leq - f+b\field{I}_C\, , \eqno{\hbox{(V3)}} \] where $C$ is a closed small set and ${\cal D}$ is the extended generator of the process. \end{enumerate} For discrete-time chains the result is well-known. Moreover, in that case, the ergodicity of $\bfPhi$ under a suitable norm is also obtained: For each initial condition $x\in{\sf X}$ satisfying $V(x)<\infty$, and any function $g\colon{\sf X}\to\Re$ for which $|g|$ is bounded by $f$, \[ \lim_{t\to\infty} {\sf E}_x[g(Φ(t))] = \int g\,dπ. \] Possible approaches are explored for establishing appropriate versions of corresponding results in continuous time, under appropriate assumptions on the process $\{Φ(t)\}$ or on the function $g$.
△ Less
Submitted 1 December, 2015;
originally announced December 2015.
-
Entropy bounds on abelian groups and the Ruzsa divergence
Authors:
Mokshay Madiman,
Ioannis Kontoyiannis
Abstract:
Over the past few years, a family of interesting new inequalities for the entropies of sums and differences of random variables has been developed by Ruzsa, Tao and others, motivated by analogous results in additive combinatorics. The present work extends these earlier results to the case of random variables taking values in $\mathbb{R}^n$ or, more generally, in arbitrary locally compact and Polis…
▽ More
Over the past few years, a family of interesting new inequalities for the entropies of sums and differences of random variables has been developed by Ruzsa, Tao and others, motivated by analogous results in additive combinatorics. The present work extends these earlier results to the case of random variables taking values in $\mathbb{R}^n$ or, more generally, in arbitrary locally compact and Polish abelian groups. We isolate and study a key quantity, the Ruzsa divergence between two probability distributions, and we show that its properties can be used to extend the earlier inequalities to the present general setting. The new results established include several variations on the theme that the entropies of the sum and the difference of two independent random variables severely constrain each other. Although the setting is quite general, the result are already of interest (and new) for random vectors in $\mathbb{R}^n$. In that special case, quantitative bounds are provided for the stability of the equality conditions in the entropy power inequality; a reverse entropy power inequality for log-concave random vectors is proved; an information-theoretic analog of the Rogers-Shephard inequality for convex bodies is established; and it is observed that some of these results lead to new inequalities for the determinants of positive-definite matrices. Moreover, by considering the multiplicative subgroups of the complex plane, one obtains new inequalities for the differential entropies of products and ratios of nonzero, complex-valued random variables.
△ Less
Submitted 26 October, 2015; v1 submitted 17 August, 2015;
originally announced August 2015.
-
Estimating the Directed Information and Testing for Causality
Authors:
Ioannis Kontoyiannis,
Maria Skoularidou
Abstract:
The problem of estimating the directed information rate between two discrete processes $\{X_n\}$ and $\{Y_n\}$ via the plug-in (or maximum-likelihood) estimator is considered. When the joint process $\{(X_n,Y_n)\}$ is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate $O(1/\sqrt{n})$ under appropriate conditions…
▽ More
The problem of estimating the directed information rate between two discrete processes $\{X_n\}$ and $\{Y_n\}$ via the plug-in (or maximum-likelihood) estimator is considered. When the joint process $\{(X_n,Y_n)\}$ is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate $O(1/\sqrt{n})$ under appropriate conditions; this is the first estimator that has been shown to achieve this rate. An important connection is drawn between the problem of estimating the directed information rate and that of performing a hypothesis test for the presence of causal influence between the two processes. Under fairly general conditions, the null hypothesis, which corresponds to the absence of causal influence, is equivalent to the requirement that the directed information rate be equal to zero. In that case a finer result is established, showing that the plug-in converges at the faster rate $O(1/n)$ and that it is asymptotically $χ^2$-distributed. This is proved by showing that this estimator is equal to (a scalar multiple of) the classical likelihood ratio statistic for the above hypothesis test. Finally it is noted that these results facilitate the design of an actual likelihood ratio test for the presence or absence of causal influence.
△ Less
Submitted 31 March, 2016; v1 submitted 5 July, 2015;
originally announced July 2015.
-
Lossless Data Compression at Finite Blocklengths
Authors:
Ioannis Kontoyiannis,
Sergio Verdu
Abstract:
This paper provides an extensive study of the behavior of the best achievable rate (and other related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the distribu…
▽ More
This paper provides an extensive study of the behavior of the best achievable rate (and other related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the distribution of the optimal codelengths to the source information spectrum, and an exact analysis of the best achievable rate for arbitrary sources is given.
Fine asymptotic results are proved for arbitrary (not necessarily prefix) compressors on general mixing sources. Non-asymptotic, explicit Gaussian approximation bounds are established for the best achievable rate on Markov sources. The source dispersion and the source varentropy rate are defined and characterized. Together with the entropy rate, the varentropy rate serves to tightly approximate the fundamental non-asymptotic limits of fixed-to-variable compression for all but very small blocklengths.
△ Less
Submitted 11 December, 2012;
originally announced December 2012.
-
Sumset and Inverse Sumset Inequalities for Differential Entropy and Mutual Information
Authors:
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
The sumset and inverse sumset theories of Freiman, Plünnecke and Ruzsa, give bounds connecting the cardinality of the sumset $A+B=\{a+b\;;\;a\in A,\,b\in B\}$ of two discrete sets $A,B$, to the cardinalities (or the finer structure) of the original sets $A,B$. For example, the sum-difference bound of Ruzsa states that, $|A+B|\,|A|\,|B|\leq|A-B|^3$, where the difference set…
▽ More
The sumset and inverse sumset theories of Freiman, Plünnecke and Ruzsa, give bounds connecting the cardinality of the sumset $A+B=\{a+b\;;\;a\in A,\,b\in B\}$ of two discrete sets $A,B$, to the cardinalities (or the finer structure) of the original sets $A,B$. For example, the sum-difference bound of Ruzsa states that, $|A+B|\,|A|\,|B|\leq|A-B|^3$, where the difference set $A-B= \{a-b\;;\;a\in A,\,b\in B\}$. Interpreting the differential entropy $h(X)$ of a continuous random variable $X$ as (the logarithm of) the size of the effective support of $X$, the main contribution of this paper is a series of natural information-theoretic analogs for these results. For example, the Ruzsa sum-difference bound becomes the new inequality, $h(X+Y)+h(X)+h(Y)\leq 3h(X-Y)$, for any pair of independent continuous random variables $X$ and $Y$. Our results include differential-entropy versions of Ruzsa's triangle inequality, the Plünnecke-Ruzsa inequality, and the Balog-Szemerédi-Gowers lemma. Also we give a differential entropy version of the Freiman-Green-Ruzsa inverse-sumset theorem, which can be seen as a quantitative converse to the entropy power inequality. Versions of most of these results for the discrete entropy $H(X)$ were recently proved by Tao, relying heavily on a strong, functional form of the submodularity property of $H(X)$. Since differential entropy is {\em not} functionally submodular, in the continuous case many of the corresponding discrete proofs fail, in many cases requiring substantially new proof strategies. We find that the basic property that naturally replaces the discrete functional submodularity, is the data processing property of mutual information.
△ Less
Submitted 3 June, 2012;
originally announced June 2012.
-
Control Variates for Reversible MCMC Samplers
Authors:
Petros Dellaportas,
Ioannis Kontoyiannis
Abstract:
A general methodology is introduced for the construction and effective application of control variates to estimation problems involving data from reversible MCMC samplers. We propose the use of a specific class of functions as control variates, and we introduce a new, consistent estimator for the values of the coefficients of the optimal linear combination of these functions. The form and proposed…
▽ More
A general methodology is introduced for the construction and effective application of control variates to estimation problems involving data from reversible MCMC samplers. We propose the use of a specific class of functions as control variates, and we introduce a new, consistent estimator for the values of the coefficients of the optimal linear combination of these functions. The form and proposed construction of the control variates is derived from our solution of the Poisson equation associated with a specific MCMC scenario. The new estimator, which can be applied to the same MCMC sample, is derived from a novel, finite-dimensional, explicit representation for the optimal coefficients. The resulting variance-reduction methodology is primarily applicable when the simulated data are generated by a conjugate random-scan Gibbs sampler. MCMC examples of Bayesian inference problems demonstrate that the corresponding reduction in the estimation variance is significant, and that in some cases it can be quite dramatic. Extensions of this methodology in several directions are given, including certain families of Metropolis-Hastings samplers and hybrid Metropolis-within-Gibbs algorithms. Corresponding simulation examples are presented illustrating the utility of the proposed methods. All methodological and asymptotic arguments are rigorously justified under easily verifiable and essentially minimal conditions.
△ Less
Submitted 7 August, 2010;
originally announced August 2010.
-
Compound Poisson Approximation via Information Functionals
Authors:
A. D. Barbour,
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
An information-theoretic development is given for the problem of compound Poisson approximation, which parallels earlier treatments for Gaussian and Poisson approximation. Let $P_{S_n}$ be the distribution of a sum $S_n=\Sumn Y_i$ of independent integer-valued random variables $Y_i$. Nonasymptotic bounds are derived for the distance between $P_{S_n}$ and an appropriately chosen compound Poisson la…
▽ More
An information-theoretic development is given for the problem of compound Poisson approximation, which parallels earlier treatments for Gaussian and Poisson approximation. Let $P_{S_n}$ be the distribution of a sum $S_n=\Sumn Y_i$ of independent integer-valued random variables $Y_i$. Nonasymptotic bounds are derived for the distance between $P_{S_n}$ and an appropriately chosen compound Poisson law. In the case where all $Y_i$ have the same conditional distribution given $\{Y_i\neq 0\}$, a bound on the relative entropy distance between $P_{S_n}$ and the compound Poisson distribution is derived, based on the data-processing property of relative entropy and earlier Poisson approximation results. When the $Y_i$ have arbitrary distributions, corresponding bounds are derived in terms of the total variation distance. The main technical ingredient is the introduction of two "information functionals," and the analysis of their properties. These information functionals play a role analogous to that of the classical Fisher information in normal approximation. Detailed comparisons are made between the resulting inequalities and related bounds.
△ Less
Submitted 21 April, 2010;
originally announced April 2010.
-
Log-concavity, ultra-log-concavity, and a maximum entropy property of discrete compound Poisson measures
Authors:
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Sufficient conditions are developed, under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. Recently, one of the authors [O. Johnson, {\em Stoch. Proc. Appl.}, 2007] used a semigroup approach to show that the Poisson has maximal entropy among all ultra-log-concave distributions with fixed mean. We show via a non…
▽ More
Sufficient conditions are developed, under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. Recently, one of the authors [O. Johnson, {\em Stoch. Proc. Appl.}, 2007] used a semigroup approach to show that the Poisson has maximal entropy among all ultra-log-concave distributions with fixed mean. We show via a non-trivial extension of this semigroup approach that the natural analog of the Poisson maximum entropy property remains valid if the compound Poisson distributions under consideration are log-concave, but that it fails in general. A parallel maximum entropy result is established for the family of compound binomial measures. Sufficient conditions for compound distributions to be log-concave are discussed and applications to combinatorics are examined; new bounds are derived on the entropy of the cardinality of a random independent set in a claw-free graph, and a connection is drawn to Mason's conjecture for matroids. The present results are primarily motivated by the desire to provide an information-theoretic foundation for compound Poisson approximation and associated limit theorems, analogous to the corresponding developments for the central limit theorem and for Poisson approximation. Our results also demonstrate new links between some probabilistic methods and the combinatorial notions of log-concavity and ultra-log-concavity, and they add to the growing body of work exploring the applications of maximum entropy characterizations to problems in discrete mathematics.
△ Less
Submitted 27 September, 2011; v1 submitted 3 December, 2009;
originally announced December 2009.
-
Notes on Using Control Variates for Estimation with Reversible MCMC Samplers
Authors:
Ioannis Kontoyiannis,
Petros Dellaportas
Abstract:
A general methodology is presented for the construction and effective use of control variates for reversible MCMC samplers. The values of the coefficients of the optimal linear combination of the control variates are computed, and adaptive, consistent MCMC estimators are derived for these optimal coefficients. All methodological and asymptotic arguments are rigorously justified. Numerous MCMC simu…
▽ More
A general methodology is presented for the construction and effective use of control variates for reversible MCMC samplers. The values of the coefficients of the optimal linear combination of the control variates are computed, and adaptive, consistent MCMC estimators are derived for these optimal coefficients. All methodological and asymptotic arguments are rigorously justified. Numerous MCMC simulation examples from Bayesian inference applications demonstrate that the resulting variance reduction can be quite dramatic.
△ Less
Submitted 4 May, 2010; v1 submitted 24 July, 2009;
originally announced July 2009.
-
Geometric Ergodicity and the Spectral Gap of Non-Reversible Markov Chains
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
We argue that the spectral theory of non-reversible Markov chains may often be more effectively cast within the framework of the naturally associated weighted-$L_\infty$ space $L_\infty^V$, instead of the usual Hilbert space $L_2=L_2(π)$, where $π$ is the invariant measure of the chain. This observation is, in part, based on the following results. A discrete-time Markov chain with values in a ge…
▽ More
We argue that the spectral theory of non-reversible Markov chains may often be more effectively cast within the framework of the naturally associated weighted-$L_\infty$ space $L_\infty^V$, instead of the usual Hilbert space $L_2=L_2(π)$, where $π$ is the invariant measure of the chain. This observation is, in part, based on the following results. A discrete-time Markov chain with values in a general state space is geometrically ergodic if and only if its transition kernel admits a spectral gap in $L_\infty^V$. If the chain is reversible, the same equivalence holds with $L_2$ in place of $L_\infty^V$, but in the absence of reversibility it fails: There are (necessarily non-reversible, geometrically ergodic) chains that admit a spectral gap in $L_\infty^V$ but not in $L_2$. Moreover, if a chain admits a spectral gap in $L_2$, then for any $h\in L_2$ there exists a Lyapunov function $V_h\in L_1$ such that $V_h$ dominates $h$ and the chain admits a spectral gap in $L_\infty^{V_h}$. The relationship between the size of the spectral gap in $L_\infty^V$ or $L_2$, and the rate at which the chain converges to equilibrium is also briefly discussed.
△ Less
Submitted 29 June, 2009;
originally announced June 2009.
-
Thinning, Entropy and the Law of Thin Numbers
Authors:
Peter Harremoes,
Oliver Johnson,
Ioannis Kontoyiannis
Abstract:
Renyi's "thinning" operation on a discrete random variable is a natural discrete analog of the scaling operation for continuous random variables. The properties of thinning are investigated in an information-theoretic context, especially in connection with information-theoretic inequalities related to Poisson approximation results. The classical Binomial-to-Poisson convergence (sometimes referre…
▽ More
Renyi's "thinning" operation on a discrete random variable is a natural discrete analog of the scaling operation for continuous random variables. The properties of thinning are investigated in an information-theoretic context, especially in connection with information-theoretic inequalities related to Poisson approximation results. The classical Binomial-to-Poisson convergence (sometimes referred to as the "law of small numbers" is seen to be a special case of a thinning limit theorem for convolutions of discrete distributions. A rate of convergence is provided for this limit, and nonasymptotic bounds are also established. This development parallels, in part, the development of Gaussian inequalities leading to the information-theoretic version of the central limit theorem. In particular, a "thinning Markov chain" is introduced, and it is shown to play a role analogous to that of the Ornstein-Uhlenbeck process in connection to the entropy power inequality.
△ Less
Submitted 3 June, 2009;
originally announced June 2009.
-
Approximating a Diffusion by a Hidden Markov Model
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
For a wide class of continuous-time Markov processes, including all irreducible hypoelliptic diffusions evolving on an open, connected subset of $\RL^d$, the following are shown to be equivalent: (i) The process satisfies (a slightly weaker version of) the classical Donsker-Varadhan conditions; (ii) The transition semigroup of the process can be approximated by a finite-state hidden Markov model,…
▽ More
For a wide class of continuous-time Markov processes, including all irreducible hypoelliptic diffusions evolving on an open, connected subset of $\RL^d$, the following are shown to be equivalent: (i) The process satisfies (a slightly weaker version of) the classical Donsker-Varadhan conditions; (ii) The transition semigroup of the process can be approximated by a finite-state hidden Markov model, in a strong sense in terms of an associated operator norm; (iii) The resolvent kernel of the process is `$v$-separable', that is, it can be approximated arbitrarily well in operator norm by finite-rank kernels. Under any (hence all) of the above conditions, the Markov process is shown to have a purely discrete spectrum on a naturally associated weighted $L_\infty$ space.
△ Less
Submitted 25 April, 2016; v1 submitted 1 June, 2009;
originally announced June 2009.
-
Lossy Compression in Near-Linear Time via Efficient Random Codebooks and Databases
Authors:
Chris Gioran,
Ioannis Kontoyiannis
Abstract:
The compression-complexity trade-off of lossy compression algorithms that are based on a random codebook or a random database is examined. Motivated, in part, by recent results of Gupta-Verdú-Weissman (GVW) and their underlying connections with the pattern-matching scheme of Kontoyiannis' lossy Lempel-Ziv algorithm, we introduce a non-universal version of the lossy Lempel-Ziv method (termed LLZ)…
▽ More
The compression-complexity trade-off of lossy compression algorithms that are based on a random codebook or a random database is examined. Motivated, in part, by recent results of Gupta-Verdú-Weissman (GVW) and their underlying connections with the pattern-matching scheme of Kontoyiannis' lossy Lempel-Ziv algorithm, we introduce a non-universal version of the lossy Lempel-Ziv method (termed LLZ). The optimality of LLZ for memoryless sources is established, and its performance is compared to that of the GVW divide-and-conquer approach. Experimental results indicate that the GVW approach often yields better compression than LLZ, but at the price of much higher memory requirements. To combine the advantages of both, we introduce a hybrid algorithm (HYB) that utilizes both the divide-and-conquer idea of GVW and the single-database structure of LLZ. It is proved that HYB shares with GVW the exact same rate-distortion performance and implementation complexity, while, like LLZ, requiring less memory, by a factor which may become unbounded, depending on the choice or the relevant design parameters. Experimental results are also presented, illustrating the performance of all three methods on data generated by simple discrete memoryless sources. In particular, the HYB algorithm is shown to outperform existing schemes for the compression of some simple discrete sources with respect to the Hamming distortion criterion.
△ Less
Submitted 21 April, 2009;
originally announced April 2009.
-
On the entropy and log-concavity of compound Poisson measures
Authors:
Oliver Johnson,
Ioannis Kontoyiannis,
Mokshay Madiman
Abstract:
Motivated, in part, by the desire to develop an information-theoretic foundation for compound Poisson approximation limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation), this work examines sufficient conditions under which the compound Poisson distribution has maximal entropy within a natural class of probability measures…
▽ More
Motivated, in part, by the desire to develop an information-theoretic foundation for compound Poisson approximation limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation), this work examines sufficient conditions under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. We show that the natural analog of the Poisson maximum entropy property remains valid if the measures under consideration are log-concave, but that it fails in general. A parallel maximum entropy result is established for the family of compound binomial measures. The proofs are largely based on ideas related to the semigroup approach introduced in recent work by Johnson for the Poisson family. Sufficient conditions are given for compound distributions to be log-concave, and specific examples are presented illustrating all the above results.
△ Less
Submitted 27 May, 2008;
originally announced May 2008.
-
Estimating the entropy of binary time series: Methodology, some theory and a simulation study
Authors:
Y. Gao,
I. Kontoyiannis,
E. Bienenstock
Abstract:
Partly motivated by entropy-estimation problems in neuroscience, we present a detailed and extensive comparison between some of the most popular and effective entropy estimation methods used in practice: The plug-in method, four different estimators based on the Lempel-Ziv (LZ) family of data compression algorithms, an estimator based on the Context-Tree Weighting (CTW) method, and the renewal e…
▽ More
Partly motivated by entropy-estimation problems in neuroscience, we present a detailed and extensive comparison between some of the most popular and effective entropy estimation methods used in practice: The plug-in method, four different estimators based on the Lempel-Ziv (LZ) family of data compression algorithms, an estimator based on the Context-Tree Weighting (CTW) method, and the renewal entropy estimator.
**Methodology. Three new entropy estimators are introduced. For two of the four LZ-based estimators, a bootstrap procedure is described for evaluating their standard error, and a practical rule of thumb is heuristically derived for selecting the values of their parameters. ** Theory. We prove that, unlike their earlier versions, the two new LZ-based estimators are consistent for every finite-valued, stationary and ergodic process. An effective method is derived for the accurate approximation of the entropy rate of a finite-state HMM with known distribution. Heuristic calculations are presented and approximate formulas are derived for evaluating the bias and the standard error of each estimator. ** Simulation. All estimators are applied to a wide range of data generated by numerous different processes with varying degrees of dependence and memory. Some conclusions drawn from these experiments include: (i) For all estimators considered, the main source of error is the bias. (ii) The CTW method is repeatedly and consistently seen to provide the most accurate results. (iii) The performance of the LZ-based estimators is often comparable to that of the plug-in method. (iv) The main drawback of the plug-in method is its computational inefficiency.
△ Less
Submitted 29 February, 2008;
originally announced February 2008.
-
Identifying statistical dependence in genomic sequences via mutual information estimates
Authors:
H. M. Aktulga,
I. Kontoyiannis,
L. A. Lyznik,
L. Szpankowski,
A. Y. Grama,
W. Szpankowski
Abstract:
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a preci…
▽ More
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.
△ Less
Submitted 26 October, 2007;
originally announced October 2007.
-
From the entropy to the statistical structure of spike trains
Authors:
Yun Gao,
Ioannis Kontoyiannis,
Elie Bienenstock
Abstract:
We use statistical estimates of the entropy rate of spike train data in order to make inferences about the underlying structure of the spike train itself. We first examine a number of different parametric and nonparametric estimators (some known and some new), including the ``plug-in'' method, several versions of Lempel-Ziv-based compression algorithms, a maximum likelihood estimator tailored to…
▽ More
We use statistical estimates of the entropy rate of spike train data in order to make inferences about the underlying structure of the spike train itself. We first examine a number of different parametric and nonparametric estimators (some known and some new), including the ``plug-in'' method, several versions of Lempel-Ziv-based compression algorithms, a maximum likelihood estimator tailored to renewal processes, and the natural estimator derived from the Context-Tree Weighting method (CTW). The theoretical properties of these estimators are examined, several new theoretical results are developed, and all estimators are systematically applied to various types of synthetic data and under different conditions.
Our main focus is on the performance of these entropy estimators on the (binary) spike trains of 28 neurons recorded simultaneously for a one-hour period from the primary motor and dorsal premotor cortices of a monkey. We show how the entropy estimates can be used to test for the existence of long-term structure in the data, and we construct a hypothesis test for whether the renewal process model is appropriate for these spike trains. Further, by applying the CTW algorithm we derive the maximum a posterior (MAP) tree model of our empirical data, and comment on the underlying structure it reveals.
△ Less
Submitted 27 March, 2008; v1 submitted 22 October, 2007;
originally announced October 2007.
-
Some information-theoretic computations related to the distribution of prime numbers
Authors:
Ioannis Kontoyiannis
Abstract:
We illustrate how elementary information-theoretic ideas may be employed to provide proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary and fairly short proof of the following asymptotic result: The sum of (log p)/p, taken over all primes p not exceeding n, is asymptotic to log n as n tends to infinity. We also give finite-n bounds refining the above l…
▽ More
We illustrate how elementary information-theoretic ideas may be employed to provide proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary and fairly short proof of the following asymptotic result: The sum of (log p)/p, taken over all primes p not exceeding n, is asymptotic to log n as n tends to infinity. We also give finite-n bounds refining the above limit. This result, originally proved by Chebyshev in 1852, is closely related to the celebrated prime number theorem.
△ Less
Submitted 5 November, 2007; v1 submitted 22 October, 2007;
originally announced October 2007.
-
Estimation of the Rate-Distortion Function
Authors:
M. T. Harrison,
I. Kontoyiannis
Abstract:
Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of the empirical distribution of the observed data. Sufficient…
▽ More
Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of the empirical distribution of the observed data. Sufficient conditions are given for its consistency, and examples are provided to demonstrate that in certain cases it fails to converge to the true rate-distortion function. The analysis of its performance is complicated by the fact that the rate-distortion function is not continuous in the source distribution; the underlying mathematical problem is closely related to the classical problem of establishing the consistency of maximum likelihood estimators. General consistency results are given for the plug-in estimator applied to a broad class of sources, including all stationary and ergodic ones. A more general class of estimation problems is also considered, arising in the context of lossy data compression when the allowed class of coding distributions is restricted; analogous results are developed for the plug-in estimator in that case. Finally, consistency theorems are formulated for modified (e.g., penalized) versions of the plug-in, and for estimating the optimal reproduction distribution.
△ Less
Submitted 11 April, 2008; v1 submitted 2 February, 2007;
originally announced February 2007.
-
Computable exponential bounds for screened estimation and simulation
Authors:
Ioannis Kontoyiannis,
Sean P. Meyn
Abstract:
Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond t…
▽ More
Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond to times when the empirical average of the $\{U(X_i)\}$ is sufficiently close to its known mean. As long as $U$ dominates $F$ appropriately, the screened estimates admit exponential error bounds, even when $F(X)$ is heavy-tailed. The main results are several nonasymptotic, explicit exponential bounds for the screened estimates. A geometric interpretation, in the spirit of Sanov's theorem, is given for the fact that the screened estimates always admit exponential error bounds, even if the standard estimates do not. And when they do, the screened estimates' error probability has a significantly better exponent. This implies that screening can be interpreted as a variance reduction technique. Our main mathematical tools come from large deviations techniques. The results are illustrated by a detailed simulation example.
△ Less
Submitted 22 August, 2008; v1 submitted 1 December, 2006;
originally announced December 2006.