-
Activation Bottleneck: Sigmoidal Neural Networks Cannot Forecast a Straight Line
Authors:
Maximilian Toller,
Hussain Hussain,
Bernhard C Geiger
Abstract:
A neural network has an activation bottleneck if one of its hidden layers has a bounded image. We show that networks with an activation bottleneck cannot forecast unbounded sequences such as straight lines, random walks, or any sequence with a trend: The difference between prediction and ground truth becomes arbitrary large, regardless of the training procedure. Widely-used neural network architec…
▽ More
A neural network has an activation bottleneck if one of its hidden layers has a bounded image. We show that networks with an activation bottleneck cannot forecast unbounded sequences such as straight lines, random walks, or any sequence with a trend: The difference between prediction and ground truth becomes arbitrary large, regardless of the training procedure. Widely-used neural network architectures such as LSTM and GRU suffer from this limitation. In our analysis, we characterize activation bottlenecks and explain why they prevent sigmoidal networks from learning unbounded sequences. We experimentally validate our findings and discuss modifications to network architectures which mitigate the effects of activation bottlenecks.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Software in the natural world: A computational approach to hierarchical emergence
Authors:
Fernando E. Rosas,
Bernhard C. Geiger,
Andrea I Luppi,
Anil K. Seth,
Daniel Polani,
Michael Gastpar,
Pedro A. M. Mediano
Abstract:
Understanding the functional architecture of complex systems is crucial to illuminate their inner workings and enable effective methods for their prediction and control. Recent advances have introduced tools to characterise emergent macroscopic levels; however, while these approaches are successful in identifying when emergence takes place, they are limited in the extent they can determine how it…
▽ More
Understanding the functional architecture of complex systems is crucial to illuminate their inner workings and enable effective methods for their prediction and control. Recent advances have introduced tools to characterise emergent macroscopic levels; however, while these approaches are successful in identifying when emergence takes place, they are limited in the extent they can determine how it does. Here we address this limitation by develo** a computational approach to emergence, which characterises macroscopic processes in terms of their computational capabilities. Concretely, we articulate a view on emergence based on how software works, which is rooted on a mathematical formalism that articulates how macroscopic processes can express self-contained informational, interventional, and computational properties. This framework establishes a hierarchy of nested self-contained processes that determines what computations take place at what level, which in turn delineates the functional architecture of a complex system. This approach is illustrated on paradigmatic models from the statistical physics and computational neuroscience literature, which are shown to exhibit macroscopic processes that are akin to software in human-engineered systems. Overall, this framework enables a deeper understanding of the multi-level structure of complex systems, revealing specific ways in which they can be efficiently simulated, predicted, and controlled.
△ Less
Submitted 5 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Approximating Families of Sharp Solutions to Fisher's Equation with Physics-Informed Neural Networks
Authors:
Franz M. Rohrhofer,
Stefan Posch,
Clemens Gößnitzer,
Bernhard C. Geiger
Abstract:
This paper employs physics-informed neural networks (PINNs) to solve Fisher's equation, a fundamental representation of a reaction-diffusion system with both simplicity and significance. The focus lies specifically in investigating Fisher's equation under conditions of large reaction rate coefficients, wherein solutions manifest as traveling waves, posing a challenge for numerical methods due to t…
▽ More
This paper employs physics-informed neural networks (PINNs) to solve Fisher's equation, a fundamental representation of a reaction-diffusion system with both simplicity and significance. The focus lies specifically in investigating Fisher's equation under conditions of large reaction rate coefficients, wherein solutions manifest as traveling waves, posing a challenge for numerical methods due to the occurring steepness of the wave front. To address optimization challenges associated with the standard PINN approach, a residual weighting scheme is introduced. This scheme is designed to enhance the tracking of propagating wave fronts by considering the reaction term in the reaction-diffusion equation. Furthermore, a specific network architecture is studied which is tailored for solutions in the form of traveling waves. Lastly, the capacity of PINNs to approximate an entire family of solutions is assessed by incorporating the reaction rate coefficient as an additional input to the network architecture. This modification enables the approximation of the solution across a broad and continuous range of reaction rate coefficients, thus solving a class of reaction-diffusion systems using a single PINN instance.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Bringing Chemistry to Scale: Loss Weight Adjustment for Multivariate Regression in Deep Learning of Thermochemical Processes
Authors:
Franz M. Rohrhofer,
Stefan Posch,
Clemens Gößnitzer,
José M. García-Oliver,
Bernhard C. Geiger
Abstract:
Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weight…
▽ More
Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weights, potentially reducing the memory demands of complex simulations by orders of magnitude. However, ANNs with standard training losses often struggle with underrepresented targets in multivariate regression tasks, e.g., when learning minor species mass fractions as part of lookup tables. This paper seeks to improve the accuracy of an ANN when learning multiple species mass fractions of a hydrogen (\ce{H2}) combustion lookup table. We assess a simple, yet effective loss weight adjustment that outperforms the standard mean-squared error optimization and enables accurate learning of all species mass fractions, even of minor species where the standard optimization completely fails. Furthermore, we find that the loss weight adjustment leads to more balanced gradients in the network training, which explains its effectiveness.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Finding the Optimum Design of Large Gas Engines Prechambers Using CFD and Bayesian Optimization
Authors:
Stefan Posch,
Clemens Gößnitzer,
Franz Rohrhofer,
Bernhard C. Geiger,
Andreas Wimmer
Abstract:
The turbulent jet ignition concept using prechambers is a promising solution to achieve stable combustion at lean conditions in large gas engines, leading to high efficiency at low emission levels. Due to the wide range of design and operating parameters for large gas engine prechambers, the preferred method for evaluating different designs is computational fluid dynamics (CFD), as testing in test…
▽ More
The turbulent jet ignition concept using prechambers is a promising solution to achieve stable combustion at lean conditions in large gas engines, leading to high efficiency at low emission levels. Due to the wide range of design and operating parameters for large gas engine prechambers, the preferred method for evaluating different designs is computational fluid dynamics (CFD), as testing in test bed measurement campaigns is time-consuming and expensive. However, the significant computational time required for detailed CFD simulations due to the complexity of solving the underlying physics also limits its applicability. In optimization settings similar to the present case, i.e., where the evaluation of the objective function(s) is computationally costly, Bayesian optimization has largely replaced classical design-of-experiment. Thus, the present study deals with the computationally efficient Bayesian optimization of large gas engine prechambers design using CFD simulation. Reynolds-averaged-Navier-Stokes simulations are used to determine the target values as a function of the selected prechamber design parameters. The results indicate that the chosen strategy is effective to find a prechamber design that achieves the desired target values.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Information Plane Analysis for Dropout Neural Networks
Authors:
Linara Adilova,
Bernhard C. Geiger,
Asja Fischer
Abstract:
The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. Th…
▽ More
The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. The problem is amplified for deterministic networks if the MI between input and representation is infinite. Thus, the estimated values are defined by the different approaches for estimation, but do not adequately represent the training process from an information-theoretic perspective. In this work, we show that dropout with continuously distributed noise ensures that MI is finite. We demonstrate in a range of experiments that this enables a meaningful information plane analysis for a class of dropout neural networks that is widely used in practice.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Cluster Purging: Efficient Outlier Detection based on Rate-Distortion Theory
Authors:
Maximilian B. Toller,
Bernhard C. Geiger,
Roman Kern
Abstract:
Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique cluster…
▽ More
Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique clusters. We propose two efficient algorithms for performing Cluster Purging, one being parameter-free, while the other algorithm has a parameter that controls representivity estimations, allowing it to be tuned in supervised setups. In an experimental evaluation, we show that Cluster Purging improves upon outliers detected from raw clusterings, and that Cluster Purging competes strongly against state-of-the-art alternatives.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Robust Bayesian Target Value Optimization
Authors:
Johannes G. Hoffer,
Sascha Ranftl,
Bernhard C. Geiger
Abstract:
We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) ma…
▽ More
We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.
△ Less
Submitted 11 January, 2023;
originally announced January 2023.
-
FUNCK: Information Funnels and Bottlenecks for Invariant Representation Learning
Authors:
João Machado de Freitas,
Bernhard C. Geiger
Abstract:
Learning invariant representations that remain useful for a downstream task is still a key challenge in machine learning. We investigate a set of related information funnels and bottleneck problems that claim to learn invariant representations from the data. We also propose a new element to this family of information-theoretic objectives: The Conditional Privacy Funnel with Side Information, which…
▽ More
Learning invariant representations that remain useful for a downstream task is still a key challenge in machine learning. We investigate a set of related information funnels and bottleneck problems that claim to learn invariant representations from the data. We also propose a new element to this family of information-theoretic objectives: The Conditional Privacy Funnel with Side Information, which we investigate in fully and semi-supervised settings. Given the generally intractable objectives, we derive tractable approximations using amortized variational inference parameterized by neural networks and study the intrinsic trade-offs of these objectives. We describe empirically the proposed approach and show that with a few labels it is possible to learn fair classifiers and generate useful representations approximately invariant to unwanted sources of variation. Furthermore, we provide insights about the applicability of these methods in real-world scenarios with ordinary tabular datasets when the data is scarce.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Compressed Hierarchical Representations for Multi-Task Learning and Task Clustering
Authors:
João Machado de Freitas,
Sebastian Berg,
Bernhard C. Geiger,
Manfred Mücke
Abstract:
In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information c…
▽ More
In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information contained in each task-specific representation. It is shown that our resulting representations yield competitive performance for several MTL benchmarks. Furthermore, for certain setups, we show that the trained parameters of the additive noise model are closely related to the similarity of different tasks. This indicates that our approach yields a task-agnostic representation that is disentangled in the sense that its individual dimensions may be interpretable from a task-specific perspective.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Generating Simple Directed Social Network Graphs for Information Spreading
Authors:
Christoph Schweimer,
Christine Gfrerer,
Florian Lugstein,
David Pape,
Jan A. Velimsky,
Robert Elsässer,
Bernhard C. Geiger
Abstract:
Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of…
▽ More
Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between reciprocal (i.e., two directed edges of opposite direction between two nodes) and directed edges. We propose an approach to generate directed social network graphs that creates reciprocal and directed edges and considers the correlation between the respective degree sequences.
Our model relies on crawled directed graphs in Twitter, on which information w.r.t. a topic is exchanged or disseminated. While these graphs exhibit a high clustering coefficient and small average distances between random node pairs (which is typical in real-world networks), their degree sequences seem to follow a $χ^2$-distribution rather than power law. To achieve high clustering coefficients, we apply an edge rewiring procedure that preserves the node degrees.
We compare the crawled and the created graphs, and simulate certain algorithms for information dissemination and epidemic spreading on them. The results show that the created graphs exhibit very similar topological and algorithmic properties as the real-world graphs, providing evidence that they can be used as surrogates in social network analysis. Furthermore, our model is highly scalable, which enables us to create graphs of arbitrary size with almost the same properties as the corresponding real-world networks.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Information-Theoretic Reduction of Markov Chains
Authors:
Bernhard C. Geiger
Abstract:
We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the…
▽ More
We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the original Markov model by a simplified one that yields similar behavior as the original Markov model. We discuss the practical relevance of both approaches in the field of knowledge discovery and data mining by formulating problems of unsupervised machine learning as reduction problems of Markov chains. Finally, we briefly discuss the concept of lumpability, the phenomenon when a coarse graining yields a reduced Markov model.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
On the Role of Fixed Points of Dynamical Systems in Training Physics-Informed Neural Networks
Authors:
Franz M. Rohrhofer,
Stefan Posch,
Clemens Gößnitzer,
Bernhard C. Geiger
Abstract:
This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points…
▽ More
This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points. We find that these local optima contribute to the complexity of the physics loss optimization which can explain common training difficulties and resulting nonphysical predictions. Under certain settings, e.g., initial conditions close to fixed points or long simulations times, we show that those optima can even become better than that of the desired solution.
△ Less
Submitted 13 February, 2023; v1 submitted 25 March, 2022;
originally announced March 2022.
-
Knock Detection in Combustion Engine Time Series Using a Theory-Guided 1D Convolutional Neural Network Approach
Authors:
Andreas B. Ofner,
Achilles Kefalas,
Stefan Posch,
Bernhard C. Geiger
Abstract:
This paper introduces a method for the detection of knock occurrences in an internal combustion engine (ICE) using a 1D convolutional neural network trained on in-cylinder pressure data. The model architecture was based on considerations regarding the expected frequency characteristics of knocking combustion. To aid the feature extraction, all cycles were reduced to 60° CA long windows, with no fu…
▽ More
This paper introduces a method for the detection of knock occurrences in an internal combustion engine (ICE) using a 1D convolutional neural network trained on in-cylinder pressure data. The model architecture was based on considerations regarding the expected frequency characteristics of knocking combustion. To aid the feature extraction, all cycles were reduced to 60° CA long windows, with no further processing applied to the pressure traces. The neural networks were trained exclusively on in-cylinder pressure traces from multiple conditions and labels provided by human experts. The best-performing model architecture achieves an accuracy of above 92% on all test sets in a tenfold cross-validation when distinguishing between knocking and non-knocking cycles. In a multi-class problem where each cycle was labeled by the number of experts who rated it as knocking, 78% of cycles were labeled perfectly, while 90% of cycles were classified at most one class from ground truth. They thus considerably outperform the broadly applied MAPO (Maximum Amplitude of Pressure Oscillation) detection method, as well as other references reconstructed from previous works. Our analysis indicates that the neural network learned physically meaningful features connected to engine-characteristic resonance frequencies, thus verifying the intended theory-guided data science approach. Deeper performance investigation further shows remarkable generalization ability to unseen operating points. In addition, the model proved to classify knocking cycles in unseen engines with increased accuracy of 89% after adapting to their features via training on a small number of exclusively non-knocking cycles. The algorithm takes below 1 ms (on CPU) to classify individual cycles, effectively making it suitable for real-time engine control.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation
Authors:
Sophie Steger,
Bernhard C. Geiger,
Marek Smieja
Abstract:
We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorpor…
▽ More
We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art.
△ Less
Submitted 7 February, 2022; v1 submitted 17 December, 2021;
originally announced December 2021.
-
Data vs. Physics: The Apparent Pareto Front of Physics-Informed Neural Networks
Authors:
Franz M. Rohrhofer,
Stefan Posch,
Clemens Gößnitzer,
Bernhard C. Geiger
Abstract:
Physics-informed neural networks (PINNs) have emerged as a promising deep learning method, capable of solving forward and inverse problems governed by differential equations. Despite their recent advance, it is widely acknowledged that PINNs are difficult to train and often require a careful tuning of loss weights when data and physics loss functions are combined by scalarization of a multi-object…
▽ More
Physics-informed neural networks (PINNs) have emerged as a promising deep learning method, capable of solving forward and inverse problems governed by differential equations. Despite their recent advance, it is widely acknowledged that PINNs are difficult to train and often require a careful tuning of loss weights when data and physics loss functions are combined by scalarization of a multi-objective (MO) problem. In this paper, we aim to understand how parameters of the physical system, such as characteristic length and time scales, the computational domain, and coefficients of differential equations affect MO optimization and the optimal choice of loss weights. Through a theoretical examination of where these system parameters appear in PINN training, we find that they effectively and individually scale the loss residuals, causing imbalances in MO optimization with certain choices of system parameters. The immediate effects of this are reflected in the apparent Pareto front, which we define as the set of loss values achievable with gradient-based training and visualize accordingly. We empirically verify that loss weights can be used successfully to compensate for the scaling of system parameters, and enable the selection of an optimal solution on the apparent Pareto front that aligns well with the physically valid solution. We further demonstrate that by altering the system parameterization, the apparent Pareto front can shift and exhibit locally convex parts, resulting in a wider range of loss weights for which gradient-based training becomes successful. This work explains the effects of system parameters on MO optimization in PINNs, and highlights the utility of proposed loss weighting schemes.
△ Less
Submitted 10 June, 2024; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Importance of feature engineering and database selection in a machine learning model: A case study on carbon crystal structures
Authors:
Franz M. Rohrhofer,
Santanu Saha,
Simone Di Cataldo,
Bernhard C. Geiger,
Wolfgang von der Linden,
Lilia Boeri
Abstract:
Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to unde…
▽ More
Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to understand in depth the effect that the choice of features and the properties of the database have on a machine learning application. In our experiments, we consider the complex phase space of carbon as a test case, for which we use a set of simple, human understandable and cheaply computable features for the aim of predicting the total energy of the crystal structure. Our study shows that (i) the performance of the machine learning model varies depending on the set of features and the database, (ii) is not transferable to every structure in the phase space and (iii) depends on how well structures are represented in the database.
△ Less
Submitted 30 January, 2021;
originally announced February 2021.
-
Synwalk -- Community Detection via Random Walk Modelling
Authors:
Christian Toth,
Denis Helic,
Bernhard C. Geiger
Abstract:
Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk b…
▽ More
Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk's performance with the performance of Infomap and Walktrap. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.
△ Less
Submitted 21 January, 2021;
originally announced January 2021.
-
A Formally Robust Time Series Distance Metric
Authors:
Maximilian Toller,
Bernhard C. Geiger,
Roman Kern
Abstract:
Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered…
▽ More
Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
On Functions of Markov Random Fields
Authors:
Bernhard C. Geiger,
Ali Al-Bashabsheh
Abstract:
We derive two sufficient conditions for a function of a Markov random field (MRF) on a given graph to be a MRF on the same graph. The first condition is information-theoretic and parallels a recent information-theoretic characterization of lumpability of Markov chains. The second condition, which is easier to check, is based on the potential functions of the corresponding Gibbs field. We illustrat…
▽ More
We derive two sufficient conditions for a function of a Markov random field (MRF) on a given graph to be a MRF on the same graph. The first condition is information-theoretic and parallels a recent information-theoretic characterization of lumpability of Markov chains. The second condition, which is easier to check, is based on the potential functions of the corresponding Gibbs field. We illustrate our sufficient conditions at the hand of several examples and discuss implications for practical applications of MRFs. As a side result, we give a partial characterization of functions of MRFs that are information-preserving.
△ Less
Submitted 15 October, 2020; v1 submitted 28 May, 2020;
originally announced May 2020.
-
On Information Plane Analyses of Neural Network Classifiers -- A Review
Authors:
Bernhard C. Geiger
Abstract:
We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the r…
▽ More
We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the respective information quantities were estimated. Our survey suggests that compression visualized in information planes is not necessarily information-theoretic, but is rather often compatible with geometric compression of the latent representations. This insight gives the information plane a renewed justification.
Aside from this, we shed light on the problem of estimating mutual information in deterministic neural networks and its consequences. Specifically, we argue that even in feed-forward neural networks the data processing inequality need not hold for estimates of mutual information. Similarly, while a fitting phase, in which the mutual information between the latent representation and the target increases, is necessary (but not sufficient) for good classification performance, depending on the specifics of mutual information estimation such a fitting phase need not be visible in the information plane.
△ Less
Submitted 10 June, 2021; v1 submitted 21 March, 2020;
originally announced March 2020.
-
SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder
Authors:
Marek Śmieja,
Maciej Wołczyk,
Jacek Tabor,
Bernhard C. Geiger
Abstract:
We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small…
▽ More
We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and thus obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: (a) interpolation between any pair of points in the latent space produces realistically-looking samples; (b) combining the interpolation property with disentangled class and style variables, SeGMA is able to perform a continuous style transfer from one class to another; (c) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components.
△ Less
Submitted 27 August, 2020; v1 submitted 21 June, 2019;
originally announced June 2019.
-
Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers
Authors:
Rana Ali Amjad,
Bernhard C. Geiger
Abstract:
In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are lea…
▽ More
In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are learned such that they can be used in a naive Bayes classifier. We continue by suggesting a series of experiments along the lines of Nonlinear In-formation Bottleneck [Kolchinsky et al., 2018], Deep Variational Information Bottleneck [Alemi et al., 2017], and Information Dropout [Achille and Soatto, 2018]. We furthermore suggest a neural network where the decoder architecture is a parameterized naive Bayes decoder.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
A Short Note on the Jensen-Shannon Divergence between Simple Mixture Distributions
Authors:
Bernhard C. Geiger
Abstract:
This short note presents results about the symmetric Jensen-Shannon divergence between two discrete mixture distributions $p_1$ and $p_2$. Specifically, for $i=1,2$, $p_i$ is the mixture of a common distribution $q$ and a distribution $\tilde{p}_i$ with mixture proportion $λ_i$. In general, $\tilde{p}_1\neq \tilde{p}_2$ and $λ_1\neqλ_2$. We provide experimental and theoretical insight to the behav…
▽ More
This short note presents results about the symmetric Jensen-Shannon divergence between two discrete mixture distributions $p_1$ and $p_2$. Specifically, for $i=1,2$, $p_i$ is the mixture of a common distribution $q$ and a distribution $\tilde{p}_i$ with mixture proportion $λ_i$. In general, $\tilde{p}_1\neq \tilde{p}_2$ and $λ_1\neqλ_2$. We provide experimental and theoretical insight to the behavior of the symmetric Jensen-Shannon divergence between $p_1$ and $p_2$ as the mixture proportions or the divergence between $\tilde{p}_1$ and $\tilde{p}_2$ change. We also provide insight into scenarios where the supports of the distributions $\tilde{p}_1$, $\tilde{p}_2$, and $q$ do not coincide.
△ Less
Submitted 6 December, 2018; v1 submitted 5 December, 2018;
originally announced December 2018.
-
Understanding Neural Networks and Individual Neuron Importance via Information-Ordered Cumulative Ablation
Authors:
Rana Ali Amjad,
Kairen Liu,
Bernhard C. Geiger
Abstract:
In this work, we investigate the use of three information-theoretic quantities -- entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence -- to understand and study the behavior of already trained fully-connected feed-forward neural networks. We analyze the connection between these information-theoretic quantities and classification…
▽ More
In this work, we investigate the use of three information-theoretic quantities -- entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence -- to understand and study the behavior of already trained fully-connected feed-forward neural networks. We analyze the connection between these information-theoretic quantities and classification performance on the test set by cumulatively ablating neurons in networks trained on MNIST, FashionMNIST, and CIFAR-10. Our results parallel those recently published by Morcos et al., indicating that class selectivity is not a good indicator for classification performance. However, looking at individual layers separately, both mutual information and class selectivity are positively correlated with classification performance, at least for networks with ReLU activation functions. We provide explanations for this phenomenon and conclude that it is ill-advised to compare the proposed information-theoretic quantities across layers. Furthermore, we show that cumulative ablation of neurons with ascending or descending information-theoretic quantities can be used to formulate hypotheses regarding the joint behavior of multiple neurons, such as redundancy and synergy, with comparably low computational cost. We also draw connections to the information bottleneck theory for neural networks.
△ Less
Submitted 9 June, 2021; v1 submitted 18 April, 2018;
originally announced April 2018.
-
Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle
Authors:
Rana Ali Amjad,
Bernhard C. Geiger
Abstract:
In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is pie…
▽ More
In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.
△ Less
Submitted 11 April, 2019; v1 submitted 27 February, 2018;
originally announced February 2018.
-
Co-Clustering via Information-Theoretic Markov Aggregation
Authors:
Clemens Bloechl,
Rana Ali Amjad,
Bernhard C. Geiger
Abstract:
We present an information-theoretic cost function for co-clustering, i.e., for simultaneous clustering of two sets based on similarities between their elements. By constructing a simple random walk on the corresponding bipartite graph, our cost function is derived from a recently proposed generalized framework for information-theoretic Markov chain aggregation. The goal of our cost function is to…
▽ More
We present an information-theoretic cost function for co-clustering, i.e., for simultaneous clustering of two sets based on similarities between their elements. By constructing a simple random walk on the corresponding bipartite graph, our cost function is derived from a recently proposed generalized framework for information-theoretic Markov chain aggregation. The goal of our cost function is to minimize relevant information loss, hence it connects to the information bottleneck formalism. Moreover, via the connection to Markov aggregation, our cost function is not ad hoc, but inherits its justification from the operational qualities associated with the corresponding Markov aggregation problem. We furthermore show that, for appropriate parameter settings, our cost function is identical to well-known approaches from the literature, such as Information-Theoretic Co-Clustering of Dhillon et al. Hence, understanding the influence of this parameter admits a deeper understanding of the relationship between previously proposed information-theoretic cost functions. We highlight some strengths and weaknesses of the cost function for different parameters. We also illustrate the performance of our cost function, optimized with a simple sequential heuristic, on several synthetic and real-world data sets, including the Newsgroup20 and the MovieLens100k data sets.
△ Less
Submitted 15 June, 2018; v1 submitted 2 January, 2018;
originally announced January 2018.
-
On the Information Dimension of Multivariate Gaussian Processes
Authors:
Bernhard C. Geiger,
Tobias Koch
Abstract:
The authors have recently defined the Rényi information dimension rate $d(\{X_t\})$ of a stationary stochastic process $\{X_t,\,t\in\mathbb{Z}\}$ as the entropy rate of the uniformly-quantized process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$ (B. Geiger and T. Koch, "On the information dimension rate of stochastic processes," in Proc. IEEE Int. Sy…
▽ More
The authors have recently defined the Rényi information dimension rate $d(\{X_t\})$ of a stationary stochastic process $\{X_t,\,t\in\mathbb{Z}\}$ as the entropy rate of the uniformly-quantized process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$ (B. Geiger and T. Koch, "On the information dimension rate of stochastic processes," in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aachen, Germany, June 2017). For Gaussian processes with a given spectral distribution function $F_X$, they showed that the information dimension rate equals the Lebesgue measure of the set of harmonics where the derivative of $F_X$ is positive. This paper extends this result to multivariate Gaussian processes with a given matrix-valued spectral distribution function $F_{\mathbf{X}}$. It is demonstrated that the information dimension rate equals the average rank of the derivative of $F_{\mathbf{X}}$. As a side result, it is shown that the scale and translation invariance of information dimension carries over from random variables to stochastic processes.
△ Less
Submitted 21 December, 2017;
originally announced December 2017.
-
A Generalized Framework for Kullback-Leibler Markov Aggregation
Authors:
Rana Ali Amjad,
Clemens Blöchl,
Bernhard C. Geiger
Abstract:
This paper proposes an information-theoretic cost function for aggregating a Markov chain via a (possibly stochastic) map**. The cost function is motivated by two objectives: 1) The process obtained by observing the Markov chain through the map** should be close to a Markov chain, and 2) the aggregated Markov chain should retain as much of the temporal dependence structure of the original Mark…
▽ More
This paper proposes an information-theoretic cost function for aggregating a Markov chain via a (possibly stochastic) map**. The cost function is motivated by two objectives: 1) The process obtained by observing the Markov chain through the map** should be close to a Markov chain, and 2) the aggregated Markov chain should retain as much of the temporal dependence structure of the original Markov chain as possible. We discuss properties of this parameterized cost function and show that it contains the cost functions previously proposed by Deng et al., Xu et al., and Geiger et al. as special cases. We moreover discuss these special cases providing a better understanding and highlighting potential shortcomings: For example, the cost function proposed by Geiger et al. is tightly connected to approximate probabilistic bisimulation, but leads to trivial solutions if optimized without regularization. We furthermore propose a simple heuristic to optimize our cost function for deterministic aggregations and illustrate its performance on a set of synthetic examples.
△ Less
Submitted 18 September, 2017;
originally announced September 2017.
-
Semi-supervised cross-entropy clustering with information bottleneck constraint
Authors:
Marek Śmieja,
Bernhard C. Geiger
Abstract:
In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goa…
▽ More
In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CEC-IB has a performance comparable to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to other semi-supervised models, it can be successfully applied in discovering natural subgroups if the partition-level side information is derived from the top levels of a hierarchical clustering.
△ Less
Submitted 3 May, 2017;
originally announced May 2017.
-
On the Information Dimension of Stochastic Processes
Authors:
Bernhard C. Geiger,
Tobias Koch
Abstract:
In 1959, Rényi proposed the information dimension and the $d$-dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size…
▽ More
In 1959, Rényi proposed the information dimension and the $d$-dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$. It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function $R(D)$ of the stochastic process divided by $-\log(D)$ in the limit as $D\downarrow 0$. It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information dimension rate of multivariate stationary Gaussian processes is given by the average rank of the derivative of the SDF. The presented results reveal that the fundamental limits of almost zero-distortion recovery via compressible signal pursuit and almost lossless analog compression are different in general.
△ Less
Submitted 11 June, 2019; v1 submitted 2 February, 2017;
originally announced February 2017.
-
Divergence Scaling of Fixed-Length, Binary-Output, One-to-One Distribution Matching
Authors:
Patrick Schulte,
Bernhard C. Geiger
Abstract:
Distribution matching is the process of invertibly map** a uniformly distributed input sequence onto sequences that approximate the output of a desired discrete memoryless source. The special case of a binary output alphabet and one-to-one map** is studied. A fixed-length distribution matcher is proposed that is optimal in the sense of minimizing the unnormalized informational divergence betwe…
▽ More
Distribution matching is the process of invertibly map** a uniformly distributed input sequence onto sequences that approximate the output of a desired discrete memoryless source. The special case of a binary output alphabet and one-to-one map** is studied. A fixed-length distribution matcher is proposed that is optimal in the sense of minimizing the unnormalized informational divergence between its output distribution and a binary memoryless target distribution. Upper and lower bounds on the unnormalized divergence are computed that increase logarithmically in the output block length $n$. It follows that a recently proposed constant composition distribution matcher performs within a constant gap of the minimal achievable informational divergence.
△ Less
Submitted 16 May, 2017; v1 submitted 25 January, 2017;
originally announced January 2017.
-
A Sufficient Condition for a Unique Invariant Distribution of a Higher-Order Markov Chain
Authors:
Bernhard C. Geiger
Abstract:
We derive a sufficient condition for a $k$-th order homogeneous Markov chain $\mathbf{Z}$ with finite alphabet $\mathcal{Z}$ to have a unique invariant distribution on $\mathcal{Z}^k$. Specifically, let $\mathbf{X}$ be a first-order, stationary Markov chain with finite alphabet $\mathcal{X}$ and a single recurrent class, let $g{:}\ \mathcal{X}\to\mathcal{Z}$ be non-injective, and define the (possi…
▽ More
We derive a sufficient condition for a $k$-th order homogeneous Markov chain $\mathbf{Z}$ with finite alphabet $\mathcal{Z}$ to have a unique invariant distribution on $\mathcal{Z}^k$. Specifically, let $\mathbf{X}$ be a first-order, stationary Markov chain with finite alphabet $\mathcal{X}$ and a single recurrent class, let $g{:}\ \mathcal{X}\to\mathcal{Z}$ be non-injective, and define the (possibly non-Markovian) process $\mathbf{Y}:=g(\mathbf{X})$ (where $g$ is applied coordinate-wise). If $\mathbf{Z}$ is the $k$-th order Markov approximation of $\mathbf{Y}$, its invariant distribution is unique. We generalize this to non-Markovian processes $\mathbf{X}$.
△ Less
Submitted 7 April, 2017; v1 submitted 16 November, 2016;
originally announced November 2016.
-
A Rate-Distortion Approach to Caching
Authors:
Roy Timo,
Shirin Saeedi Bidokhti,
Michèle Wigger,
Bernhard C. Geiger
Abstract:
This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected- and…
▽ More
This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected- and excess-distortion constraints, and separable and f-separable distortion functions. We also establish a strong converse for separable-distortion functions, and we show that lossy versions of common information (Gács-Körner and Wyner) play an important role in caching. Finally, we illustrate and explicitly evaluate these laws for multivariate Gaussian sources and binary symmetric sources.
△ Less
Submitted 24 October, 2016;
originally announced October 2016.
-
Hard Clusters Maximize Mutual Information
Authors:
Bernhard C. Geiger,
Rana Ali Amjad
Abstract:
In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem tha…
▽ More
In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem that deals with simultaneous clustering of two dependent data sets. If both data sets have to be clustered using the same cluster assignment, hard clusters are not optimal in general. We point at interesting and practically relevant special cases of this so-called pairwise clustering problem, for which we can either prove or have evidence that hard clusters are optimal. Our results thus show that one can relax the otherwise combinatorial hard clustering problem to a real-valued optimization problem with the same global optimum.
△ Less
Submitted 17 August, 2016;
originally announced August 2016.
-
Higher-Order Kullback-Leibler Aggregation of Markov Chains
Authors:
Bernhard C. Geiger,
Yuchen Wu
Abstract:
We consider the problem of reducing a first-order Markov chain on a large alphabet to a higher-order Markov chain on a small alphabet. We present information-theoretic cost functions that are related to predictability and lumpability, show relations between these cost functions, and discuss heuristics to minimize them. Our experiments suggest that the generalization to higher orders is useful for…
▽ More
We consider the problem of reducing a first-order Markov chain on a large alphabet to a higher-order Markov chain on a small alphabet. We present information-theoretic cost functions that are related to predictability and lumpability, show relations between these cost functions, and discuss heuristics to minimize them. Our experiments suggest that the generalization to higher orders is useful for model reduction in reliability analysis and natural language processing.
△ Less
Submitted 16 August, 2016;
originally announced August 2016.
-
Greedy Algorithms for Optimal Distribution Approximation
Authors:
Bernhard C. Geiger,
Georg Böcherer
Abstract:
The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error…
▽ More
The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. It is shown that $M$-type approximations that minimize either $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, or $\mathbb{D}(\mathbf{p}\Vert\mathbf{t})$, or the variational distance $\Vert\mathbf{p}-\mathbf{t}\Vert_1$ can all be found by using specific instances of the same general greedy algorithm.
△ Less
Submitted 22 January, 2016;
originally announced January 2016.
-
Graph-Based Lossless Markov Lum**s
Authors:
Bernhard C. Geiger,
Christoph Hofer-Temmel
Abstract:
We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lum** functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lum** is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating ed…
▽ More
We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lum** functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lum** is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating edges in the transition graph of the Markov chain trades the required output alphabet size versus information loss, for which we present bounds.
△ Less
Submitted 22 January, 2016; v1 submitted 22 September, 2015;
originally announced September 2015.
-
The Fractality of Polar and Reed-Muller Codes
Authors:
Bernhard C. Geiger
Abstract:
The generator matrices of polar codes and Reed-Muller codes are obtained by selecting rows from the Kronecker product of a lower-triangular binary square matrix. For polar codes, the selection is based on the Bhattacharyya parameter of the row, which is closely related to the error probability of the corresponding input bit under sequential decoding. For Reed-Muller codes, the selection is based o…
▽ More
The generator matrices of polar codes and Reed-Muller codes are obtained by selecting rows from the Kronecker product of a lower-triangular binary square matrix. For polar codes, the selection is based on the Bhattacharyya parameter of the row, which is closely related to the error probability of the corresponding input bit under sequential decoding. For Reed-Muller codes, the selection is based on the Hamming weight of the row. This work investigates the properties of the index sets pointing to those rows in the infinite blocklength limit. In particular, the Lebesgue measure, the Hausdorff dimension, and the self-similarity of these sets will be discussed. It is shown that these index sets have several properties that are common to fractals.
△ Less
Submitted 25 February, 2016; v1 submitted 17 June, 2015;
originally announced June 2015.
-
Cepstral Analysis of Random Variables: Muculants
Authors:
Christian Knoll,
Bernhard C. Geiger,
Gernot Kubin
Abstract:
An alternative parametric description for discrete random variables, called muculants, is proposed. In contrast to cumulants, muculants are based on the Fourier series expansion, rather than on the Taylor series expansion, of the logarithm of the characteristic function. We utilize results from cepstral theory to derive elementary properties of muculants, some of which demonstrate behavior superio…
▽ More
An alternative parametric description for discrete random variables, called muculants, is proposed. In contrast to cumulants, muculants are based on the Fourier series expansion, rather than on the Taylor series expansion, of the logarithm of the characteristic function. We utilize results from cepstral theory to derive elementary properties of muculants, some of which demonstrate behavior superior to those of cumulants. For example, muculants and cumulants are both additive. While the existence of cumulants is linked to how often the characteristic function is differentiable, all muculants exist if the characteristic function satisfies a Paley-Wiener condition. Moreover, the muculant sequence and, if the random variable has finite expectation, the reconstruction of the characteristic function from its muculants converge. We furthermore develop a connection between muculants and cumulants and present the muculants of selected discrete random variables. Specifically, it is shown that the Poisson distribution is the only distribution where only the first two muculants are nonzero.
△ Less
Submitted 13 November, 2017; v1 submitted 15 June, 2015;
originally announced June 2015.
-
Non-constrictive bead immobilization leading to decreased and uniform shear stress in microfluidic bead-based ELISA
Authors:
Kinshuk Mitra,
Brett C. Geiger,
Preethi Chidambaram,
Aaron P. Maharry,
Ronald X. Xu,
Michael F. Tweedle
Abstract:
Microfluidic biosensors have been utilized for sensing a wide range of antigens using numerous configurations. Bead based microfluidic sensors have been a popular modality due to the plug and play nature of analyte choice and the favorable geometry of spherical sensor scaffolds. While constriction of beads against fluid flow remains a popular method to immobilize the sensor, it results in poor flu…
▽ More
Microfluidic biosensors have been utilized for sensing a wide range of antigens using numerous configurations. Bead based microfluidic sensors have been a popular modality due to the plug and play nature of analyte choice and the favorable geometry of spherical sensor scaffolds. While constriction of beads against fluid flow remains a popular method to immobilize the sensor, it results in poor fluidic regimes and shear conditions around sensor beads that can affect sensor performance. We present an alternative means of sensor bead immobilization using poly-carbonate membrane. This system results in several orders of magnitude lower variance of flow radially around the sensor bead. Shear stress experienced by our non-constrictive immobilized bead was three orders of magnitude lower. We demonstrate ability to quantitatively sense EpCAM protein, a marker for cancer stem cells and operation under both far-red and green wavelengths with no auto-fluorescence.
△ Less
Submitted 3 December, 2014;
originally announced December 2014.
-
Information Loss and Anti-Aliasing Filters in Multirate Systems
Authors:
Bernhard C. Geiger,
Gernot Kubin
Abstract:
This work investigates the information loss in a decimation system, i.e., in a downsampler preceded by an anti-aliasing filter. It is shown that, without a specific signal model in mind, the anti-aliasing filter cannot reduce information loss, while, e.g., for a simple signal-plus-noise model it can. For the Gaussian case, the optimal anti-aliasing filter is shown to coincide with the one obtained…
▽ More
This work investigates the information loss in a decimation system, i.e., in a downsampler preceded by an anti-aliasing filter. It is shown that, without a specific signal model in mind, the anti-aliasing filter cannot reduce information loss, while, e.g., for a simple signal-plus-noise model it can. For the Gaussian case, the optimal anti-aliasing filter is shown to coincide with the one obtained from energetic considerations. For a non-Gaussian signal corrupted by Gaussian noise, the Gaussian assumption yields an upper bound on the information loss, justifying filter design principles based on second-order statistics from an information-theoretic point-of-view.
△ Less
Submitted 7 July, 2014; v1 submitted 31 October, 2013;
originally announced October 2013.
-
Optimal Quantization for Distribution Synthesis
Authors:
Georg Böcherer,
Bernhard C. Geiger
Abstract:
Finite precision approximations of discrete probability distributions are considered, applicable for distribution synthesis, e.g., probabilistic sha**. Two algorithms are presented that find the optimal $M$-type approximation $Q$ of a distribution $P$ in terms of the variational distance $| Q-P|_1$ and the informational divergence $\mathbb{D}(Q| P)$. Bounds on the approximation errors are derive…
▽ More
Finite precision approximations of discrete probability distributions are considered, applicable for distribution synthesis, e.g., probabilistic sha**. Two algorithms are presented that find the optimal $M$-type approximation $Q$ of a distribution $P$ in terms of the variational distance $| Q-P|_1$ and the informational divergence $\mathbb{D}(Q| P)$. Bounds on the approximation errors are derived and shown to be asymptotically tight. Several examples illustrate that the variational distance optimal approximation can be quite different from the informational divergence optimal approximation.
△ Less
Submitted 19 January, 2016; v1 submitted 25 July, 2013;
originally announced July 2013.
-
Optimal Kullback-Leibler Aggregation via Information Bottleneck
Authors:
Bernhard C. Geiger,
Tatjana Petrov,
Gernot Kubin,
Heinz Koeppl
Abstract:
In this paper, we present a method for reducing a regular, discrete-time Markov chain (DTMC) to another DTMC with a given, typically much smaller number of states. The cost of reduction is defined as the Kullback-Leibler divergence rate between a projection of the original process through a partition function and a DTMC on the correspondingly partitioned state space. Finding the reduced model with…
▽ More
In this paper, we present a method for reducing a regular, discrete-time Markov chain (DTMC) to another DTMC with a given, typically much smaller number of states. The cost of reduction is defined as the Kullback-Leibler divergence rate between a projection of the original process through a partition function and a DTMC on the correspondingly partitioned state space. Finding the reduced model with minimal cost is computationally expensive, as it requires an exhaustive search among all state space partitions, and an exact evaluation of the reduction cost for each candidate partition. Our approach deals with the latter problem by minimizing an upper bound on the reduction cost instead of minimizing the exact cost; The proposed upper bound is easy to compute and it is tight if the original chain is lumpable with respect to the partition. Then, we express the problem in the form of information bottleneck optimization, and propose using the agglomerative information bottleneck algorithm for searching a sub-optimal partition greedily, rather than exhaustively. The theory is illustrated with examples and one application scenario in the context of modeling bio-molecular interactions.
△ Less
Submitted 10 February, 2015; v1 submitted 24 April, 2013;
originally announced April 2013.
-
On the Rate of Information Loss in Memoryless Systems
Authors:
Bernhard C. Geiger,
Gernot Kubin
Abstract:
In this work we present results about the rate of (relative) information loss induced by passing a real-valued, stationary stochastic process through a memoryless system. We show that for a special class of systems the information loss rate is closely related to the difference of differential entropy rates of the input and output processes. It is further shown that the rate of (relative) informati…
▽ More
In this work we present results about the rate of (relative) information loss induced by passing a real-valued, stationary stochastic process through a memoryless system. We show that for a special class of systems the information loss rate is closely related to the difference of differential entropy rates of the input and output processes. It is further shown that the rate of (relative) information loss is bounded from above by the (relative) information loss the system induces on a random variable distributed according to the process's marginal distribution.
As a side result, in this work we present sufficient conditions such that for a continuous-valued Markovian input process also the output process possesses the Markov property.
△ Less
Submitted 18 April, 2013;
originally announced April 2013.
-
Information-Preserving Markov Aggregation
Authors:
Bernhard C. Geiger,
Christoph Temmel
Abstract:
We present a sufficient condition for a non-injective function of a Markov chain to be a second-order Markov chain with the same entropy rate as the original chain. This permits an information-preserving state space reduction by merging states or, equivalently, lossless compression of a Markov source on a sample-by-sample basis. The cardinality of the reduced state space is bounded from below by t…
▽ More
We present a sufficient condition for a non-injective function of a Markov chain to be a second-order Markov chain with the same entropy rate as the original chain. This permits an information-preserving state space reduction by merging states or, equivalently, lossless compression of a Markov source on a sample-by-sample basis. The cardinality of the reduced state space is bounded from below by the node degrees of the transition graph associated with the original Markov chain.
We also present an algorithm listing all possible information-preserving state space reductions, for a given transition graph. We illustrate our results by applying the algorithm to a bi-gram letter model of an English text.
△ Less
Submitted 24 July, 2013; v1 submitted 3 April, 2013;
originally announced April 2013.
-
Information Measures for Deterministic Input-Output Systems
Authors:
Bernhard C. Geiger,
Gernot Kubin
Abstract:
In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addresse…
▽ More
In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addressed and Fano-type bounds between the information loss and the reconstruction error probability are derived.
For systems with infinite information loss a relative measure is defined and shown to be tightly related to Rényi information dimension. Employing another Fano-type argument, the reconstruction error probability is bounded by the relative information loss from below.
In view of develo** a system theory from an information-theoretic point-of-view, the theoretical results are illustrated by a few example systems, among them a multi-channel autocorrelation receiver.
△ Less
Submitted 17 April, 2013; v1 submitted 26 March, 2013;
originally announced March 2013.
-
Lum**s of Markov chains, entropy rate preservation, and higher-order lumpability
Authors:
Bernhard C. Geiger,
Christoph Temmel
Abstract:
A lum** of a Markov chain is a coordinate-wise projection of the chain. We characterise the entropy rate preservation of a lum** of an aperiodic and irreducible Markov chain on a finite state space by the random growth rate of the cardinality of the realisable preimage of a finite-length trajectory of the lumped chain and by the information needed to reconstruct original trajectories from thei…
▽ More
A lum** of a Markov chain is a coordinate-wise projection of the chain. We characterise the entropy rate preservation of a lum** of an aperiodic and irreducible Markov chain on a finite state space by the random growth rate of the cardinality of the realisable preimage of a finite-length trajectory of the lumped chain and by the information needed to reconstruct original trajectories from their lumped images. Both are purely combinatorial criteria, depending only on the transition graph of the Markov chain and the lum** function. A lum** is strongly k-lumpable, iff the lumped process is a k-th order Markov chain for each starting distribution of the original Markov chain. We characterise strong k-lumpability via tightness of stationary entropic bounds. In the sparse setting, we give sufficient conditions on the lum** to both preserve the entropy rate and be strongly k-lumpable.
△ Less
Submitted 20 April, 2015; v1 submitted 18 December, 2012;
originally announced December 2012.
-
Signal Enhancement as Minimization of Relevant Information Loss
Authors:
Bernhard C. Geiger,
Gernot Kubin
Abstract:
We introduce the notion of relevant information loss for the purpose of casting the signal enhancement problem in information-theoretic terms. We show that many algorithms from machine learning can be reformulated using relevant information loss, which allows their application to the aforementioned problem. As a particular example we analyze principle component analysis for dimensionality reductio…
▽ More
We introduce the notion of relevant information loss for the purpose of casting the signal enhancement problem in information-theoretic terms. We show that many algorithms from machine learning can be reformulated using relevant information loss, which allows their application to the aforementioned problem. As a particular example we analyze principle component analysis for dimensionality reduction, discuss its optimality, and show that the relevant information loss can indeed vanish if the relevant information is concentrated on a lower-dimensional subspace of the input space.
△ Less
Submitted 16 January, 2013; v1 submitted 31 May, 2012;
originally announced May 2012.
-
Relative Information Loss in the PCA
Authors:
Bernhard C. Geiger,
Gernot Kubin
Abstract:
In this work we analyze principle component analysis (PCA) as a deterministic input-output system. We show that the relative information loss induced by reducing the dimensionality of the data after performing the PCA is the same as in dimensionality reduction without PCA. Finally, we analyze the case where the PCA uses the sample covariance matrix to compute the rotation. If the rotation matrix i…
▽ More
In this work we analyze principle component analysis (PCA) as a deterministic input-output system. We show that the relative information loss induced by reducing the dimensionality of the data after performing the PCA is the same as in dimensionality reduction without PCA. Finally, we analyze the case where the PCA uses the sample covariance matrix to compute the rotation. If the rotation matrix is not available at the output, we show that an infinite amount of information is lost. The relative information loss is shown to decrease with increasing sample size.
△ Less
Submitted 31 July, 2012; v1 submitted 2 April, 2012;
originally announced April 2012.