Search | arXiv e-print repository

Activation Bottleneck: Sigmoidal Neural Networks Cannot Forecast a Straight Line

Authors: Maximilian Toller, Hussain Hussain, Bernhard C Geiger

Abstract: A neural network has an activation bottleneck if one of its hidden layers has a bounded image. We show that networks with an activation bottleneck cannot forecast unbounded sequences such as straight lines, random walks, or any sequence with a trend: The difference between prediction and ground truth becomes arbitrary large, regardless of the training procedure. Widely-used neural network architec… ▽ More A neural network has an activation bottleneck if one of its hidden layers has a bounded image. We show that networks with an activation bottleneck cannot forecast unbounded sequences such as straight lines, random walks, or any sequence with a trend: The difference between prediction and ground truth becomes arbitrary large, regardless of the training procedure. Widely-used neural network architectures such as LSTM and GRU suffer from this limitation. In our analysis, we characterize activation bottlenecks and explain why they prevent sigmoidal networks from learning unbounded sequences. We experimentally validate our findings and discuss modifications to network architectures which mitigate the effects of activation bottlenecks. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2402.09090 [pdf, other]

Software in the natural world: A computational approach to hierarchical emergence

Authors: Fernando E. Rosas, Bernhard C. Geiger, Andrea I Luppi, Anil K. Seth, Daniel Polani, Michael Gastpar, Pedro A. M. Mediano

Abstract: Understanding the functional architecture of complex systems is crucial to illuminate their inner workings and enable effective methods for their prediction and control. Recent advances have introduced tools to characterise emergent macroscopic levels; however, while these approaches are successful in identifying when emergence takes place, they are limited in the extent they can determine how it… ▽ More Understanding the functional architecture of complex systems is crucial to illuminate their inner workings and enable effective methods for their prediction and control. Recent advances have introduced tools to characterise emergent macroscopic levels; however, while these approaches are successful in identifying when emergence takes place, they are limited in the extent they can determine how it does. Here we address this limitation by develo** a computational approach to emergence, which characterises macroscopic processes in terms of their computational capabilities. Concretely, we articulate a view on emergence based on how software works, which is rooted on a mathematical formalism that articulates how macroscopic processes can express self-contained informational, interventional, and computational properties. This framework establishes a hierarchy of nested self-contained processes that determines what computations take place at what level, which in turn delineates the functional architecture of a complex system. This approach is illustrated on paradigmatic models from the statistical physics and computational neuroscience literature, which are shown to exhibit macroscopic processes that are akin to software in human-engineered systems. Overall, this framework enables a deeper understanding of the multi-level structure of complex systems, revealing specific ways in which they can be efficiently simulated, predicted, and controlled. △ Less

Submitted 5 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: 33 pages, 13 figures

arXiv:2402.08313 [pdf, other]

Approximating Families of Sharp Solutions to Fisher's Equation with Physics-Informed Neural Networks

Authors: Franz M. Rohrhofer, Stefan Posch, Clemens Gößnitzer, Bernhard C. Geiger

Abstract: This paper employs physics-informed neural networks (PINNs) to solve Fisher's equation, a fundamental representation of a reaction-diffusion system with both simplicity and significance. The focus lies specifically in investigating Fisher's equation under conditions of large reaction rate coefficients, wherein solutions manifest as traveling waves, posing a challenge for numerical methods due to t… ▽ More This paper employs physics-informed neural networks (PINNs) to solve Fisher's equation, a fundamental representation of a reaction-diffusion system with both simplicity and significance. The focus lies specifically in investigating Fisher's equation under conditions of large reaction rate coefficients, wherein solutions manifest as traveling waves, posing a challenge for numerical methods due to the occurring steepness of the wave front. To address optimization challenges associated with the standard PINN approach, a residual weighting scheme is introduced. This scheme is designed to enhance the tracking of propagating wave fronts by considering the reaction term in the reaction-diffusion equation. Furthermore, a specific network architecture is studied which is tailored for solutions in the form of traveling waves. Lastly, the capacity of PINNs to approximate an entire family of solutions is assessed by incorporating the reaction rate coefficient as an additional input to the network architecture. This modification enables the approximation of the solution across a broad and continuous range of reaction rate coefficients, thus solving a class of reaction-diffusion systems using a single PINN instance. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 14 pages, 7 figures

arXiv:2308.01954 [pdf, other]

Bringing Chemistry to Scale: Loss Weight Adjustment for Multivariate Regression in Deep Learning of Thermochemical Processes

Authors: Franz M. Rohrhofer, Stefan Posch, Clemens Gößnitzer, José M. García-Oliver, Bernhard C. Geiger

Abstract: Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weight… ▽ More Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weights, potentially reducing the memory demands of complex simulations by orders of magnitude. However, ANNs with standard training losses often struggle with underrepresented targets in multivariate regression tasks, e.g., when learning minor species mass fractions as part of lookup tables. This paper seeks to improve the accuracy of an ANN when learning multiple species mass fractions of a hydrogen (\ce{H2}) combustion lookup table. We assess a simple, yet effective loss weight adjustment that outperforms the standard mean-squared error optimization and enables accurate learning of all species mass fractions, even of minor species where the standard optimization completely fails. Furthermore, we find that the loss weight adjustment leads to more balanced gradients in the network training, which explains its effectiveness. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: 8 pages. Part of Scientific Computing 2023 Conference Proceedings (ISBN e-Book: 978-3-903318-20-5)

arXiv:2308.01743 [pdf, other]

Finding the Optimum Design of Large Gas Engines Prechambers Using CFD and Bayesian Optimization

Authors: Stefan Posch, Clemens Gößnitzer, Franz Rohrhofer, Bernhard C. Geiger, Andreas Wimmer

Abstract: The turbulent jet ignition concept using prechambers is a promising solution to achieve stable combustion at lean conditions in large gas engines, leading to high efficiency at low emission levels. Due to the wide range of design and operating parameters for large gas engine prechambers, the preferred method for evaluating different designs is computational fluid dynamics (CFD), as testing in test… ▽ More The turbulent jet ignition concept using prechambers is a promising solution to achieve stable combustion at lean conditions in large gas engines, leading to high efficiency at low emission levels. Due to the wide range of design and operating parameters for large gas engine prechambers, the preferred method for evaluating different designs is computational fluid dynamics (CFD), as testing in test bed measurement campaigns is time-consuming and expensive. However, the significant computational time required for detailed CFD simulations due to the complexity of solving the underlying physics also limits its applicability. In optimization settings similar to the present case, i.e., where the evaluation of the objective function(s) is computationally costly, Bayesian optimization has largely replaced classical design-of-experiment. Thus, the present study deals with the computationally efficient Bayesian optimization of large gas engine prechambers design using CFD simulation. Reynolds-averaged-Navier-Stokes simulations are used to determine the target values as a function of the selected prechamber design parameters. The results indicate that the chosen strategy is effective to find a prechamber design that achieves the desired target values. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: 9 pages. Part of Scientific Computing 2023 Conference Proceedings (ISBN e-Book: 978-3-903318-20-5)

arXiv:2303.00596 [pdf, other]

Information Plane Analysis for Dropout Neural Networks

Authors: Linara Adilova, Bernhard C. Geiger, Asja Fischer

Abstract: The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. Th… ▽ More The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. The problem is amplified for deterministic networks if the MI between input and representation is infinite. Thus, the estimated values are defined by the different approaches for estimation, but do not adequately represent the training process from an information-theoretic perspective. In this work, we show that dropout with continuously distributed noise ensures that MI is finite. We demonstrate in a range of experiments that this enables a meaningful information plane analysis for a class of dropout neural networks that is widely used in practice. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: Published as a conference paper at ICLR2023

arXiv:2302.11234 [pdf, other]

doi 10.1109/TKDE.2021.3103571

Cluster Purging: Efficient Outlier Detection based on Rate-Distortion Theory

Authors: Maximilian B. Toller, Bernhard C. Geiger, Roman Kern

Abstract: Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique cluster… ▽ More Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique clusters. We propose two efficient algorithms for performing Cluster Purging, one being parameter-free, while the other algorithm has a parameter that controls representivity estimations, allowing it to be tuned in supervised setups. In an experimental evaluation, we show that Cluster Purging improves upon outliers detected from raw clusterings, and that Cluster Purging competes strongly against state-of-the-art alternatives. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Journal ref: IEEE Transactions on Knowledge and Data Engineering 35 (2023) 1270-1282

arXiv:2301.04344 [pdf, other]

doi 10.1016/j.cie.2023.109279

Robust Bayesian Target Value Optimization

Authors: Johannes G. Hoffer, Sascha Ranftl, Bernhard C. Geiger

Abstract: We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) ma… ▽ More We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented. △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: 24 pages; submitted to Computers and Industrial Engineering

MSC Class: 90C26; 60G15 ACM Class: G.1.6

Journal ref: Computers & Industrial Engineering, vol. 180, 2023, 109279

arXiv:2211.01446 [pdf, other]

FUNCK: Information Funnels and Bottlenecks for Invariant Representation Learning

Authors: João Machado de Freitas, Bernhard C. Geiger

Abstract: Learning invariant representations that remain useful for a downstream task is still a key challenge in machine learning. We investigate a set of related information funnels and bottleneck problems that claim to learn invariant representations from the data. We also propose a new element to this family of information-theoretic objectives: The Conditional Privacy Funnel with Side Information, which… ▽ More Learning invariant representations that remain useful for a downstream task is still a key challenge in machine learning. We investigate a set of related information funnels and bottleneck problems that claim to learn invariant representations from the data. We also propose a new element to this family of information-theoretic objectives: The Conditional Privacy Funnel with Side Information, which we investigate in fully and semi-supervised settings. Given the generally intractable objectives, we derive tractable approximations using amortized variational inference parameterized by neural networks and study the intrinsic trade-offs of these objectives. We describe empirically the proposed approach and show that with a few labels it is possible to learn fair classifiers and generate useful representations approximately invariant to unwanted sources of variation. Furthermore, we provide insights about the applicability of these methods in real-world scenarios with ordinary tabular datasets when the data is scarce. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: 28 pages

arXiv:2205.15882 [pdf, other]

doi 10.1109/IJCNN55064.2022.9892342

Compressed Hierarchical Representations for Multi-Task Learning and Task Clustering

Authors: João Machado de Freitas, Sebastian Berg, Bernhard C. Geiger, Manfred Mücke

Abstract: In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information c… ▽ More In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information contained in each task-specific representation. It is shown that our resulting representations yield competitive performance for several MTL benchmarks. Furthermore, for certain setups, we show that the trained parameters of the additive noise model are closely related to the similarity of different tasks. This indicates that our approach yields a task-agnostic representation that is disentangled in the sense that its individual dimensions may be interpretable from a task-specific perspective. △ Less

Submitted 31 May, 2022; originally announced May 2022.

Comments: Accepted by the 2022 International Joint Conference on Neural Networks (IJCNN 2022)

Journal ref: 2022 International Joint Conference on Neural Networks (IJCNN), 2022

arXiv:2205.02485 [pdf, other]

doi 10.1145/3485447.3512194

Generating Simple Directed Social Network Graphs for Information Spreading

Authors: Christoph Schweimer, Christine Gfrerer, Florian Lugstein, David Pape, Jan A. Velimsky, Robert Elsässer, Bernhard C. Geiger

Abstract: Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of… ▽ More Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between reciprocal (i.e., two directed edges of opposite direction between two nodes) and directed edges. We propose an approach to generate directed social network graphs that creates reciprocal and directed edges and considers the correlation between the respective degree sequences. Our model relies on crawled directed graphs in Twitter, on which information w.r.t. a topic is exchanged or disseminated. While these graphs exhibit a high clustering coefficient and small average distances between random node pairs (which is typical in real-world networks), their degree sequences seem to follow a $χ^2$-distribution rather than power law. To achieve high clustering coefficients, we apply an edge rewiring procedure that preserves the node degrees. We compare the crawled and the created graphs, and simulate certain algorithms for information dissemination and epidemic spreading on them. The results show that the created graphs exhibit very similar topological and algorithmic properties as the real-world graphs, providing evidence that they can be used as surrogates in social network analysis. Furthermore, our model is highly scalable, which enables us to create graphs of arbitrary size with almost the same properties as the corresponding real-world networks. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: 11 pages, 7 figures; published at ACM Web Conference 2022

Journal ref: Proc. ACM Web Conf., p. 1475-1485, Apr. 2022

arXiv:2204.13896 [pdf, other]

Information-Theoretic Reduction of Markov Chains

Authors: Bernhard C. Geiger

Abstract: We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the… ▽ More We survey information-theoretic approaches to the reduction of Markov chains. Our survey is structured in two parts: The first part considers Markov chain coarse graining, which focuses on projecting the Markov chain to a process on a smaller state space that is informative}about certain quantities of interest. The second part considers Markov chain model reduction, which focuses on replacing the original Markov model by a simplified one that yields similar behavior as the original Markov model. We discuss the practical relevance of both approaches in the field of knowledge discovery and data mining by formulating problems of unsupervised machine learning as reduction problems of Markov chains. Finally, we briefly discuss the concept of lumpability, the phenomenon when a coarse graining yields a reduced Markov model. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: 16 pages, 3 figures; survey paper

MSC Class: 60J10; 94A16;

arXiv:2203.13648 [pdf, other]

On the Role of Fixed Points of Dynamical Systems in Training Physics-Informed Neural Networks

Authors: Franz M. Rohrhofer, Stefan Posch, Clemens Gößnitzer, Bernhard C. Geiger

Abstract: This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points… ▽ More This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points. We find that these local optima contribute to the complexity of the physics loss optimization which can explain common training difficulties and resulting nonphysical predictions. Under certain settings, e.g., initial conditions close to fixed points or long simulations times, we show that those optima can even become better than that of the desired solution. △ Less

Submitted 13 February, 2023; v1 submitted 25 March, 2022; originally announced March 2022.

Comments: 22 pages

Journal ref: Transactions on Machine Learning Research, 2023(1)

arXiv:2201.06990 [pdf, other]

doi 10.1109/TMECH.2022.3144832

Knock Detection in Combustion Engine Time Series Using a Theory-Guided 1D Convolutional Neural Network Approach

Authors: Andreas B. Ofner, Achilles Kefalas, Stefan Posch, Bernhard C. Geiger

Abstract: This paper introduces a method for the detection of knock occurrences in an internal combustion engine (ICE) using a 1D convolutional neural network trained on in-cylinder pressure data. The model architecture was based on considerations regarding the expected frequency characteristics of knocking combustion. To aid the feature extraction, all cycles were reduced to 60° CA long windows, with no fu… ▽ More This paper introduces a method for the detection of knock occurrences in an internal combustion engine (ICE) using a 1D convolutional neural network trained on in-cylinder pressure data. The model architecture was based on considerations regarding the expected frequency characteristics of knocking combustion. To aid the feature extraction, all cycles were reduced to 60° CA long windows, with no further processing applied to the pressure traces. The neural networks were trained exclusively on in-cylinder pressure traces from multiple conditions and labels provided by human experts. The best-performing model architecture achieves an accuracy of above 92% on all test sets in a tenfold cross-validation when distinguishing between knocking and non-knocking cycles. In a multi-class problem where each cycle was labeled by the number of experts who rated it as knocking, 78% of cycles were labeled perfectly, while 90% of cycles were classified at most one class from ground truth. They thus considerably outperform the broadly applied MAPO (Maximum Amplitude of Pressure Oscillation) detection method, as well as other references reconstructed from previous works. Our analysis indicates that the neural network learned physically meaningful features connected to engine-characteristic resonance frequencies, thus verifying the intended theory-guided data science approach. Deeper performance investigation further shows remarkable generalization ability to unseen operating points. In addition, the model proved to classify knocking cycles in unseen engines with increased accuracy of 89% after adapting to their features via training on a small number of exclusively non-knocking cycles. The algorithm takes below 1 ms (on CPU) to classify individual cycles, effectively making it suitable for real-time engine control. △ Less

Submitted 18 January, 2022; originally announced January 2022.

Comments: accepted for publication in IEEE/ASME Transactions on Mechatronics. (c) IEEE 2022

Journal ref: IEEE/ASME Trans. on Mechatronics, 27(5):4101-4111, Oct. 2022

arXiv:2112.09397 [pdf, other]

doi 10.1145/3477314.3507181

Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation

Authors: Sophie Steger, Bernhard C. Geiger, Marek Smieja

Abstract: We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorpor… ▽ More We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art. △ Less

Submitted 7 February, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 13 pages, 6 figures; this is an extended version of a short paper accepted at ACM SAC 2022 (minor changes to the text; error in source code corrected)

ACM Class: H.1.1; I.5.3; I.2.0

Journal ref: Proc. of ACM/SIGAPP Symposium on Applied Computing, pp. 1136-1139, 2022

arXiv:2105.00862 [pdf, other]

doi 10.1109/ACCESS.2023.3302892

Data vs. Physics: The Apparent Pareto Front of Physics-Informed Neural Networks

Authors: Franz M. Rohrhofer, Stefan Posch, Clemens Gößnitzer, Bernhard C. Geiger

Abstract: Physics-informed neural networks (PINNs) have emerged as a promising deep learning method, capable of solving forward and inverse problems governed by differential equations. Despite their recent advance, it is widely acknowledged that PINNs are difficult to train and often require a careful tuning of loss weights when data and physics loss functions are combined by scalarization of a multi-object… ▽ More Physics-informed neural networks (PINNs) have emerged as a promising deep learning method, capable of solving forward and inverse problems governed by differential equations. Despite their recent advance, it is widely acknowledged that PINNs are difficult to train and often require a careful tuning of loss weights when data and physics loss functions are combined by scalarization of a multi-objective (MO) problem. In this paper, we aim to understand how parameters of the physical system, such as characteristic length and time scales, the computational domain, and coefficients of differential equations affect MO optimization and the optimal choice of loss weights. Through a theoretical examination of where these system parameters appear in PINN training, we find that they effectively and individually scale the loss residuals, causing imbalances in MO optimization with certain choices of system parameters. The immediate effects of this are reflected in the apparent Pareto front, which we define as the set of loss values achievable with gradient-based training and visualize accordingly. We empirically verify that loss weights can be used successfully to compensate for the scaling of system parameters, and enable the selection of an optimal solution on the apparent Pareto front that aligns well with the physically valid solution. We further demonstrate that by altering the system parameterization, the apparent Pareto front can shift and exhibit locally convex parts, resulting in a wider range of loss weights for which gradient-based training becomes successful. This work explains the effects of system parameters on MO optimization in PINNs, and highlights the utility of proposed loss weighting schemes. △ Less

Submitted 10 June, 2024; v1 submitted 3 May, 2021; originally announced May 2021.

Comments: 11 pages

Journal ref: IEEE Access, vol. 11, pp. 86252-86261, 2023

arXiv:2102.00191 [pdf, other]

Importance of feature engineering and database selection in a machine learning model: A case study on carbon crystal structures

Authors: Franz M. Rohrhofer, Santanu Saha, Simone Di Cataldo, Bernhard C. Geiger, Wolfgang von der Linden, Lilia Boeri

Abstract: Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to unde… ▽ More Drive towards improved performance of machine learning models has led to the creation of complex features representing a database of condensed matter systems. The complex features, however, do not offer an intuitive explanation on which physical attributes do improve the performance. The effect of the database on the performance of the trained model is often neglected. In this work we seek to understand in depth the effect that the choice of features and the properties of the database have on a machine learning application. In our experiments, we consider the complex phase space of carbon as a test case, for which we use a set of simple, human understandable and cheaply computable features for the aim of predicting the total energy of the crystal structure. Our study shows that (i) the performance of the machine learning model varies depending on the set of features and the database, (ii) is not transferable to every structure in the phase space and (iii) depends on how well structures are represented in the database. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 18 pages, 11 figures

arXiv:2101.08623 [pdf, other]

doi 10.1007/s10618-021-00809-w

Synwalk -- Community Detection via Random Walk Modelling

Authors: Christian Toth, Denis Helic, Bernhard C. Geiger

Abstract: Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk b… ▽ More Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk's performance with the performance of Infomap and Walktrap. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research. △ Less

Submitted 21 January, 2021; originally announced January 2021.

Comments: 31 pages, 13 figures

Journal ref: Data Mining and Knowledge Discovery, 2022, Special Issue of the Journal Track of ECML PKDD 2022

arXiv:2008.07865 [pdf, other]

A Formally Robust Time Series Distance Metric

Authors: Maximilian Toller, Bernhard C. Geiger, Roman Kern

Abstract: Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered… ▽ More Distance-based classification is among the most competitive classification methods for time series data. The most critical component of distance-based classification is the selected distance function. Past research has proposed various different distance metrics or measures dedicated to particular aspects of real-world time series data, yet there is an important aspect that has not been considered so far: Robustness against arbitrary data contamination. In this work, we propose a novel distance metric that is robust against arbitrarily "bad" contamination and has a worst-case computational complexity of $\mathcal{O}(n\log n)$. We formally argue why our proposed metric is robust, and demonstrate in an empirical evaluation that the metric yields competitive classification accuracy when applied in k-Nearest Neighbor time series classification. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: MileTS Workshop at KDD'19

arXiv:2005.13908 [pdf, ps, other]

doi 10.1109/ITW46852.2021.9457677

On Functions of Markov Random Fields

Authors: Bernhard C. Geiger, Ali Al-Bashabsheh

Abstract: We derive two sufficient conditions for a function of a Markov random field (MRF) on a given graph to be a MRF on the same graph. The first condition is information-theoretic and parallels a recent information-theoretic characterization of lumpability of Markov chains. The second condition, which is easier to check, is based on the potential functions of the corresponding Gibbs field. We illustrat… ▽ More We derive two sufficient conditions for a function of a Markov random field (MRF) on a given graph to be a MRF on the same graph. The first condition is information-theoretic and parallels a recent information-theoretic characterization of lumpability of Markov chains. The second condition, which is easier to check, is based on the potential functions of the corresponding Gibbs field. We illustrate our sufficient conditions at the hand of several examples and discuss implications for practical applications of MRFs. As a side result, we give a partial characterization of functions of MRFs that are information-preserving. △ Less

Submitted 15 October, 2020; v1 submitted 28 May, 2020; originally announced May 2020.

Comments: 7 pages, submitted to IEEE Information Theory Workshop

Journal ref: Proc. IEEE Information Theory Workshop, pp. 316-320, 2021. (c) IEEE

arXiv:2003.09671 [pdf, other]

doi 10.1109/TNNLS.2021.3089037

On Information Plane Analyses of Neural Network Classifiers -- A Review

Authors: Bernhard C. Geiger

Abstract: We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the r… ▽ More We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the respective information quantities were estimated. Our survey suggests that compression visualized in information planes is not necessarily information-theoretic, but is rather often compatible with geometric compression of the latent representations. This insight gives the information plane a renewed justification. Aside from this, we shed light on the problem of estimating mutual information in deterministic neural networks and its consequences. Specifically, we argue that even in feed-forward neural networks the data processing inequality need not hold for estimates of mutual information. Similarly, while a fitting phase, in which the mutual information between the latent representation and the target increases, is necessary (but not sufficient) for good classification performance, depending on the specifics of mutual information estimation such a fitting phase need not be visible in the information plane. △ Less

Submitted 10 June, 2021; v1 submitted 21 March, 2020; originally announced March 2020.

Journal ref: IEEE Trans. Neural Networks and Learning Systems 33(12):7039-7051

arXiv:1906.09333 [pdf, other]

SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder

Authors: Marek Śmieja, Maciej Wołczyk, Jacek Tabor, Bernhard C. Geiger

Abstract: We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small… ▽ More We propose a semi-supervised generative model, SeGMA, which learns a joint probability distribution of data and their classes and which is implemented in a typical Wasserstein auto-encoder framework. We choose a mixture of Gaussians as a target distribution in latent space, which provides a natural splitting of data into clusters. To connect Gaussian components with correct classes, we use a small amount of labeled data and a Gaussian classifier induced by the target distribution. SeGMA is optimized efficiently due to the use of Cramer-Wold distance as a maximum mean discrepancy penalty, which yields a closed-form expression for a mixture of spherical Gaussian components and thus obviates the need of sampling. While SeGMA preserves all properties of its semi-supervised predecessors and achieves at least as good generative performance on standard benchmark data sets, it presents additional features: (a) interpolation between any pair of points in the latent space produces realistically-looking samples; (b) combining the interpolation property with disentangled class and style variables, SeGMA is able to perform a continuous style transfer from one class to another; (c) it is possible to change the intensity of class characteristics in a data point by moving the latent representation of the data point away from specific Gaussian components. △ Less

Submitted 27 August, 2020; v1 submitted 21 June, 2019; originally announced June 2019.

arXiv:1906.02576 [pdf, ps, other]

Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers

Authors: Rana Ali Amjad, Bernhard C. Geiger

Abstract: In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are lea… ▽ More In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are learned such that they can be used in a naive Bayes classifier. We continue by suggesting a series of experiments along the lines of Nonlinear In-formation Bottleneck [Kolchinsky et al., 2018], Deep Variational Information Bottleneck [Alemi et al., 2017], and Information Dropout [Achille and Soatto, 2018]. We furthermore suggest a neural network where the decoder architecture is a parameterized naive Bayes decoder. △ Less

Submitted 6 June, 2019; originally announced June 2019.

Comments: draft; work in progress

arXiv:1812.02059 [pdf, other]

A Short Note on the Jensen-Shannon Divergence between Simple Mixture Distributions

Authors: Bernhard C. Geiger

Abstract: This short note presents results about the symmetric Jensen-Shannon divergence between two discrete mixture distributions $p_1$ and $p_2$. Specifically, for $i=1,2$, $p_i$ is the mixture of a common distribution $q$ and a distribution $\tilde{p}_i$ with mixture proportion $λ_i$. In general, $\tilde{p}_1\neq \tilde{p}_2$ and $λ_1\neqλ_2$. We provide experimental and theoretical insight to the behav… ▽ More This short note presents results about the symmetric Jensen-Shannon divergence between two discrete mixture distributions $p_1$ and $p_2$. Specifically, for $i=1,2$, $p_i$ is the mixture of a common distribution $q$ and a distribution $\tilde{p}_i$ with mixture proportion $λ_i$. In general, $\tilde{p}_1\neq \tilde{p}_2$ and $λ_1\neqλ_2$. We provide experimental and theoretical insight to the behavior of the symmetric Jensen-Shannon divergence between $p_1$ and $p_2$ as the mixture proportions or the divergence between $\tilde{p}_1$ and $\tilde{p}_2$ change. We also provide insight into scenarios where the supports of the distributions $\tilde{p}_1$, $\tilde{p}_2$, and $q$ do not coincide. △ Less

Submitted 6 December, 2018; v1 submitted 5 December, 2018; originally announced December 2018.

Comments: four-page tech note

arXiv:1804.06679 [pdf, other]

doi 10.1109/TNNLS.2021.3088685

Understanding Neural Networks and Individual Neuron Importance via Information-Ordered Cumulative Ablation

Authors: Rana Ali Amjad, Kairen Liu, Bernhard C. Geiger

Abstract: In this work, we investigate the use of three information-theoretic quantities -- entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence -- to understand and study the behavior of already trained fully-connected feed-forward neural networks. We analyze the connection between these information-theoretic quantities and classification… ▽ More In this work, we investigate the use of three information-theoretic quantities -- entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence -- to understand and study the behavior of already trained fully-connected feed-forward neural networks. We analyze the connection between these information-theoretic quantities and classification performance on the test set by cumulatively ablating neurons in networks trained on MNIST, FashionMNIST, and CIFAR-10. Our results parallel those recently published by Morcos et al., indicating that class selectivity is not a good indicator for classification performance. However, looking at individual layers separately, both mutual information and class selectivity are positively correlated with classification performance, at least for networks with ReLU activation functions. We provide explanations for this phenomenon and conclude that it is ill-advised to compare the proposed information-theoretic quantities across layers. Furthermore, we show that cumulative ablation of neurons with ascending or descending information-theoretic quantities can be used to formulate hypotheses regarding the joint behavior of multiple neurons, such as redundancy and synergy, with comparably low computational cost. We also draw connections to the information bottleneck theory for neural networks. △ Less

Submitted 9 June, 2021; v1 submitted 18 April, 2018; originally announced April 2018.

Comments: 12 pages; accepted for publication in IEEE Transactions on Neural Networks and Learning Systems

Journal ref: IEEE Trans. Neural Networks and Learning Systems 33(12):7842-7852

arXiv:1802.09766 [pdf, other]

doi 10.1109/TPAMI.2019.2909031

Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle

Authors: Rana Ali Amjad, Bernhard C. Geiger

Abstract: In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is pie… ▽ More In this theory paper, we investigate training deep neural networks (DNNs) for classification via minimizing the information bottleneck (IB) functional. We show that the resulting optimization problem suffers from two severe issues: First, for deterministic DNNs, either the IB functional is infinite for almost all values of network parameters, making the optimization problem ill-posed, or it is piecewise constant, hence not admitting gradient-based optimization methods. Second, the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification, such as robustness and simplicity. We argue that these issues are partly resolved for stochastic DNNs, DNNs that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. We conclude that recent successes reported about training DNNs using the IB framework must be attributed to such solutions. As a side effect, our results indicate limitations of the IB framework for the analysis of DNNs. We also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly. △ Less

Submitted 11 April, 2019; v1 submitted 27 February, 2018; originally announced February 2018.

Comments: 16 pages, to appear in IEEE Trans. Pattern Analysis and Machine Intelligence

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence 42(9):2225-2239, 2020. (c) IEEE

arXiv:1801.00584 [pdf, other]

doi 10.1109/TKDE.2018.2846252

Co-Clustering via Information-Theoretic Markov Aggregation

Authors: Clemens Bloechl, Rana Ali Amjad, Bernhard C. Geiger

Abstract: We present an information-theoretic cost function for co-clustering, i.e., for simultaneous clustering of two sets based on similarities between their elements. By constructing a simple random walk on the corresponding bipartite graph, our cost function is derived from a recently proposed generalized framework for information-theoretic Markov chain aggregation. The goal of our cost function is to… ▽ More We present an information-theoretic cost function for co-clustering, i.e., for simultaneous clustering of two sets based on similarities between their elements. By constructing a simple random walk on the corresponding bipartite graph, our cost function is derived from a recently proposed generalized framework for information-theoretic Markov chain aggregation. The goal of our cost function is to minimize relevant information loss, hence it connects to the information bottleneck formalism. Moreover, via the connection to Markov aggregation, our cost function is not ad hoc, but inherits its justification from the operational qualities associated with the corresponding Markov aggregation problem. We furthermore show that, for appropriate parameter settings, our cost function is identical to well-known approaches from the literature, such as Information-Theoretic Co-Clustering of Dhillon et al. Hence, understanding the influence of this parameter admits a deeper understanding of the relationship between previously proposed information-theoretic cost functions. We highlight some strengths and weaknesses of the cost function for different parameters. We also illustrate the performance of our cost function, optimized with a simple sequential heuristic, on several synthetic and real-world data sets, including the Newsgroup20 and the MovieLens100k data sets. △ Less

Submitted 15 June, 2018; v1 submitted 2 January, 2018; originally announced January 2018.

arXiv:1712.07863 [pdf, ps, other]

doi 10.1109/TIT.2019.2922186

On the Information Dimension of Multivariate Gaussian Processes

Authors: Bernhard C. Geiger, Tobias Koch

Abstract: The authors have recently defined the Rényi information dimension rate $d(\{X_t\})$ of a stationary stochastic process $\{X_t,\,t\in\mathbb{Z}\}$ as the entropy rate of the uniformly-quantized process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$ (B. Geiger and T. Koch, "On the information dimension rate of stochastic processes," in Proc. IEEE Int. Sy… ▽ More The authors have recently defined the Rényi information dimension rate $d(\{X_t\})$ of a stationary stochastic process $\{X_t,\,t\in\mathbb{Z}\}$ as the entropy rate of the uniformly-quantized process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$ (B. Geiger and T. Koch, "On the information dimension rate of stochastic processes," in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aachen, Germany, June 2017). For Gaussian processes with a given spectral distribution function $F_X$, they showed that the information dimension rate equals the Lebesgue measure of the set of harmonics where the derivative of $F_X$ is positive. This paper extends this result to multivariate Gaussian processes with a given matrix-valued spectral distribution function $F_{\mathbf{X}}$. It is demonstrated that the information dimension rate equals the average rank of the derivative of $F_{\mathbf{X}}$. As a side result, it is shown that the scale and translation invariance of information dimension carries over from random variables to stochastic processes. △ Less

Submitted 21 December, 2017; originally announced December 2017.

Comments: This work will be presented in part at the 2018 International Zurich Seminar on Information and Communication

Journal ref: IEEE Trans. on Information Theory 65(10):6496-6518. (C) IEEE 2019

arXiv:1709.05907 [pdf, ps, other]

doi 10.1109/TAC.2019.2945891

A Generalized Framework for Kullback-Leibler Markov Aggregation

Authors: Rana Ali Amjad, Clemens Blöchl, Bernhard C. Geiger

Abstract: This paper proposes an information-theoretic cost function for aggregating a Markov chain via a (possibly stochastic) map**. The cost function is motivated by two objectives: 1) The process obtained by observing the Markov chain through the map** should be close to a Markov chain, and 2) the aggregated Markov chain should retain as much of the temporal dependence structure of the original Mark… ▽ More This paper proposes an information-theoretic cost function for aggregating a Markov chain via a (possibly stochastic) map**. The cost function is motivated by two objectives: 1) The process obtained by observing the Markov chain through the map** should be close to a Markov chain, and 2) the aggregated Markov chain should retain as much of the temporal dependence structure of the original Markov chain as possible. We discuss properties of this parameterized cost function and show that it contains the cost functions previously proposed by Deng et al., Xu et al., and Geiger et al. as special cases. We moreover discuss these special cases providing a better understanding and highlighting potential shortcomings: For example, the cost function proposed by Geiger et al. is tightly connected to approximate probabilistic bisimulation, but leads to trivial solutions if optimized without regularization. We furthermore propose a simple heuristic to optimize our cost function for deterministic aggregations and illustrate its performance on a set of synthetic examples. △ Less

Submitted 18 September, 2017; originally announced September 2017.

Comments: 12 pages, 3 figures; submitted to a journal

arXiv:1705.01601 [pdf, other]

doi 10.1016/j.ins.2017.07.016

Semi-supervised cross-entropy clustering with information bottleneck constraint

Authors: Marek Śmieja, Bernhard C. Geiger

Abstract: In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goa… ▽ More In this paper, we propose a semi-supervised clustering method, CEC-IB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CEC-IB has a performance comparable to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to other semi-supervised models, it can be successfully applied in discovering natural subgroups if the partition-level side information is derived from the top levels of a hierarchical clustering. △ Less

Submitted 3 May, 2017; originally announced May 2017.

Journal ref: Information Sciences, vol. 421, Dec. 2017, pp. 254-271

arXiv:1702.00645 [pdf, ps, other]

doi 10.1109/TIT.2019.2922186

On the Information Dimension of Stochastic Processes

Authors: Bernhard C. Geiger, Tobias Koch

Abstract: In 1959, Rényi proposed the information dimension and the $d$-dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size… ▽ More In 1959, Rényi proposed the information dimension and the $d$-dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$. It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function $R(D)$ of the stochastic process divided by $-\log(D)$ in the limit as $D\downarrow 0$. It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information dimension rate of multivariate stationary Gaussian processes is given by the average rank of the derivative of the SDF. The presented results reveal that the fundamental limits of almost zero-distortion recovery via compressible signal pursuit and almost lossless analog compression are different in general. △ Less

Submitted 11 June, 2019; v1 submitted 2 February, 2017; originally announced February 2017.

Comments: 23 pages, double column. Accepted for publication in the IEEE Transactions on Information Theory, copyright (c) 2019 IEEE. This version supersedes our previous submissions arXiv:1702.00645v2 and arXiv:1712.07863

arXiv:1701.07371 [pdf, ps, other]

doi 10.1109/ISIT.2017.8007095

Divergence Scaling of Fixed-Length, Binary-Output, One-to-One Distribution Matching

Authors: Patrick Schulte, Bernhard C. Geiger

Abstract: Distribution matching is the process of invertibly map** a uniformly distributed input sequence onto sequences that approximate the output of a desired discrete memoryless source. The special case of a binary output alphabet and one-to-one map** is studied. A fixed-length distribution matcher is proposed that is optimal in the sense of minimizing the unnormalized informational divergence betwe… ▽ More Distribution matching is the process of invertibly map** a uniformly distributed input sequence onto sequences that approximate the output of a desired discrete memoryless source. The special case of a binary output alphabet and one-to-one map** is studied. A fixed-length distribution matcher is proposed that is optimal in the sense of minimizing the unnormalized informational divergence between its output distribution and a binary memoryless target distribution. Upper and lower bounds on the unnormalized divergence are computed that increase logarithmically in the output block length $n$. It follows that a recently proposed constant composition distribution matcher performs within a constant gap of the minimal achievable informational divergence. △ Less

Submitted 16 May, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

Comments: 5 pages, 1 figure; Lemma 6 updated; This work will be presented at ISIT 2017

Journal ref: Proc. IEEE Int. Symp. on Information Theory 2017, pp. 3075-3079

arXiv:1611.05219 [pdf, ps, other]

doi 10.1016/j.spl.2017.07.006

A Sufficient Condition for a Unique Invariant Distribution of a Higher-Order Markov Chain

Authors: Bernhard C. Geiger

Abstract: We derive a sufficient condition for a $k$-th order homogeneous Markov chain $\mathbf{Z}$ with finite alphabet $\mathcal{Z}$ to have a unique invariant distribution on $\mathcal{Z}^k$. Specifically, let $\mathbf{X}$ be a first-order, stationary Markov chain with finite alphabet $\mathcal{X}$ and a single recurrent class, let $g{:}\ \mathcal{X}\to\mathcal{Z}$ be non-injective, and define the (possi… ▽ More We derive a sufficient condition for a $k$-th order homogeneous Markov chain $\mathbf{Z}$ with finite alphabet $\mathcal{Z}$ to have a unique invariant distribution on $\mathcal{Z}^k$. Specifically, let $\mathbf{X}$ be a first-order, stationary Markov chain with finite alphabet $\mathcal{X}$ and a single recurrent class, let $g{:}\ \mathcal{X}\to\mathcal{Z}$ be non-injective, and define the (possibly non-Markovian) process $\mathbf{Y}:=g(\mathbf{X})$ (where $g$ is applied coordinate-wise). If $\mathbf{Z}$ is the $k$-th order Markov approximation of $\mathbf{Y}$, its invariant distribution is unique. We generalize this to non-Markovian processes $\mathbf{X}$. △ Less

Submitted 7 April, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

Comments: 11 pages, 1 figure

MSC Class: 60J10

Journal ref: Statistics & Probability Letters, vol. 130, Nov. 2017

arXiv:1610.07304 [pdf, other]

A Rate-Distortion Approach to Caching

Authors: Roy Timo, Shirin Saeedi Bidokhti, Michèle Wigger, Bernhard C. Geiger

Abstract: This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected- and… ▽ More This paper takes a rate-distortion approach to understanding the information-theoretic laws governing cache-aided communications systems. Specifically, we characterise the optimal tradeoffs between the delivery rate, cache capacity and reconstruction distortions for a single-user problem and some special cases of a two-user problem. Our analysis considers discrete memoryless sources, expected- and excess-distortion constraints, and separable and f-separable distortion functions. We also establish a strong converse for separable-distortion functions, and we show that lossy versions of common information (Gács-Körner and Wyner) play an important role in caching. Finally, we illustrate and explicitly evaluate these laws for multivariate Gaussian sources and binary symmetric sources. △ Less

Submitted 24 October, 2016; originally announced October 2016.

arXiv:1608.04872 [pdf, ps, other]

Hard Clusters Maximize Mutual Information

Authors: Bernhard C. Geiger, Rana Ali Amjad

Abstract: In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem tha… ▽ More In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem that deals with simultaneous clustering of two dependent data sets. If both data sets have to be clustered using the same cluster assignment, hard clusters are not optimal in general. We point at interesting and practically relevant special cases of this so-called pairwise clustering problem, for which we can either prove or have evidence that hard clusters are optimal. Our results thus show that one can relax the otherwise combinatorial hard clustering problem to a real-valued optimization problem with the same global optimum. △ Less

Submitted 17 August, 2016; originally announced August 2016.

arXiv:1608.04637 [pdf, ps, other]

Higher-Order Kullback-Leibler Aggregation of Markov Chains

Authors: Bernhard C. Geiger, Yuchen Wu

Abstract: We consider the problem of reducing a first-order Markov chain on a large alphabet to a higher-order Markov chain on a small alphabet. We present information-theoretic cost functions that are related to predictability and lumpability, show relations between these cost functions, and discuss heuristics to minimize them. Our experiments suggest that the generalization to higher orders is useful for… ▽ More We consider the problem of reducing a first-order Markov chain on a large alphabet to a higher-order Markov chain on a small alphabet. We present information-theoretic cost functions that are related to predictability and lumpability, show relations between these cost functions, and discuss heuristics to minimize them. Our experiments suggest that the generalization to higher orders is useful for model reduction in reliability analysis and natural language processing. △ Less

Submitted 16 August, 2016; originally announced August 2016.

arXiv:1601.06039 [pdf, ps, other]

doi 10.3390/e18070262

Greedy Algorithms for Optimal Distribution Approximation

Authors: Bernhard C. Geiger, Georg Böcherer

Abstract: The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error… ▽ More The approximation of a discrete probability distribution $\mathbf{t}$ by an $M$-type distribution $\mathbf{p}$ is considered. The approximation error is measured by the informational divergence $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. It is shown that $M$-type approximations that minimize either $\mathbb{D}(\mathbf{t}\Vert\mathbf{p})$, or $\mathbb{D}(\mathbf{p}\Vert\mathbf{t})$, or the variational distance $\Vert\mathbf{p}-\mathbf{t}\Vert_1$ can all be found by using specific instances of the same general greedy algorithm. △ Less

Submitted 22 January, 2016; originally announced January 2016.

Comments: 5 pages

Journal ref: Entropy 2016, 18(7), 262

arXiv:1509.06580 [pdf, ps, other]

doi 10.1109/ISIT.2016.7541856

Graph-Based Lossless Markov Lum**s

Authors: Bernhard C. Geiger, Christoph Hofer-Temmel

Abstract: We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lum** functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lum** is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating ed… ▽ More We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lum** functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lum** is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating edges in the transition graph of the Markov chain trades the required output alphabet size versus information loss, for which we present bounds. △ Less

Submitted 22 January, 2016; v1 submitted 22 September, 2015; originally announced September 2015.

Comments: 6 pages

MSC Class: 60J10; 68R10

Journal ref: Proc. IEEE Int. Sym. on Information Theory (ISIT) 2015

arXiv:1506.05231 [pdf, ps, other]

doi 10.3390/e20010070

The Fractality of Polar and Reed-Muller Codes

Authors: Bernhard C. Geiger

Abstract: The generator matrices of polar codes and Reed-Muller codes are obtained by selecting rows from the Kronecker product of a lower-triangular binary square matrix. For polar codes, the selection is based on the Bhattacharyya parameter of the row, which is closely related to the error probability of the corresponding input bit under sequential decoding. For Reed-Muller codes, the selection is based o… ▽ More The generator matrices of polar codes and Reed-Muller codes are obtained by selecting rows from the Kronecker product of a lower-triangular binary square matrix. For polar codes, the selection is based on the Bhattacharyya parameter of the row, which is closely related to the error probability of the corresponding input bit under sequential decoding. For Reed-Muller codes, the selection is based on the Hamming weight of the row. This work investigates the properties of the index sets pointing to those rows in the infinite blocklength limit. In particular, the Lebesgue measure, the Hausdorff dimension, and the self-similarity of these sets will be discussed. It is shown that these index sets have several properties that are common to fractals. △ Less

Submitted 25 February, 2016; v1 submitted 17 June, 2015; originally announced June 2015.

Comments: 9 pages, one figure

Journal ref: a slightly extended version of this manuscript is published in Entropy 2018, 20(1), 70

arXiv:1506.04518 [pdf, ps, other]

Cepstral Analysis of Random Variables: Muculants

Authors: Christian Knoll, Bernhard C. Geiger, Gernot Kubin

Abstract: An alternative parametric description for discrete random variables, called muculants, is proposed. In contrast to cumulants, muculants are based on the Fourier series expansion, rather than on the Taylor series expansion, of the logarithm of the characteristic function. We utilize results from cepstral theory to derive elementary properties of muculants, some of which demonstrate behavior superio… ▽ More An alternative parametric description for discrete random variables, called muculants, is proposed. In contrast to cumulants, muculants are based on the Fourier series expansion, rather than on the Taylor series expansion, of the logarithm of the characteristic function. We utilize results from cepstral theory to derive elementary properties of muculants, some of which demonstrate behavior superior to those of cumulants. For example, muculants and cumulants are both additive. While the existence of cumulants is linked to how often the characteristic function is differentiable, all muculants exist if the characteristic function satisfies a Paley-Wiener condition. Moreover, the muculant sequence and, if the random variable has finite expectation, the reconstruction of the characteristic function from its muculants converge. We furthermore develop a connection between muculants and cumulants and present the muculants of selected discrete random variables. Specifically, it is shown that the Poisson distribution is the only distribution where only the first two muculants are nonzero. △ Less

Submitted 13 November, 2017; v1 submitted 15 June, 2015; originally announced June 2015.

Comments: 5 pages

MSC Class: 60E10

arXiv:1412.1770 [pdf]

Non-constrictive bead immobilization leading to decreased and uniform shear stress in microfluidic bead-based ELISA

Authors: Kinshuk Mitra, Brett C. Geiger, Preethi Chidambaram, Aaron P. Maharry, Ronald X. Xu, Michael F. Tweedle

Abstract: Microfluidic biosensors have been utilized for sensing a wide range of antigens using numerous configurations. Bead based microfluidic sensors have been a popular modality due to the plug and play nature of analyte choice and the favorable geometry of spherical sensor scaffolds. While constriction of beads against fluid flow remains a popular method to immobilize the sensor, it results in poor flu… ▽ More Microfluidic biosensors have been utilized for sensing a wide range of antigens using numerous configurations. Bead based microfluidic sensors have been a popular modality due to the plug and play nature of analyte choice and the favorable geometry of spherical sensor scaffolds. While constriction of beads against fluid flow remains a popular method to immobilize the sensor, it results in poor fluidic regimes and shear conditions around sensor beads that can affect sensor performance. We present an alternative means of sensor bead immobilization using poly-carbonate membrane. This system results in several orders of magnitude lower variance of flow radially around the sensor bead. Shear stress experienced by our non-constrictive immobilized bead was three orders of magnitude lower. We demonstrate ability to quantitatively sense EpCAM protein, a marker for cancer stem cells and operation under both far-red and green wavelengths with no auto-fluorescence. △ Less

Submitted 3 December, 2014; originally announced December 2014.

Comments: 15 pages, 11 figures

arXiv:1310.8487 [pdf, ps, other]

Information Loss and Anti-Aliasing Filters in Multirate Systems

Authors: Bernhard C. Geiger, Gernot Kubin

Abstract: This work investigates the information loss in a decimation system, i.e., in a downsampler preceded by an anti-aliasing filter. It is shown that, without a specific signal model in mind, the anti-aliasing filter cannot reduce information loss, while, e.g., for a simple signal-plus-noise model it can. For the Gaussian case, the optimal anti-aliasing filter is shown to coincide with the one obtained… ▽ More This work investigates the information loss in a decimation system, i.e., in a downsampler preceded by an anti-aliasing filter. It is shown that, without a specific signal model in mind, the anti-aliasing filter cannot reduce information loss, while, e.g., for a simple signal-plus-noise model it can. For the Gaussian case, the optimal anti-aliasing filter is shown to coincide with the one obtained from energetic considerations. For a non-Gaussian signal corrupted by Gaussian noise, the Gaussian assumption yields an upper bound on the information loss, justifying filter design principles based on second-order statistics from an information-theoretic point-of-view. △ Less

Submitted 7 July, 2014; v1 submitted 31 October, 2013; originally announced October 2013.

Comments: 12 pages; a shorter version of this paper was published at the 2014 International Zurich Seminar on Communications

Journal ref: Proc. Int. Zurich Seminar on Communications, 2014, pp. 148 - 151

arXiv:1307.6843 [pdf, ps, other]

doi 10.1109/TIT.2016.2610433

Optimal Quantization for Distribution Synthesis

Authors: Georg Böcherer, Bernhard C. Geiger

Abstract: Finite precision approximations of discrete probability distributions are considered, applicable for distribution synthesis, e.g., probabilistic sha**. Two algorithms are presented that find the optimal $M$-type approximation $Q$ of a distribution $P$ in terms of the variational distance $| Q-P|_1$ and the informational divergence $\mathbb{D}(Q| P)$. Bounds on the approximation errors are derive… ▽ More Finite precision approximations of discrete probability distributions are considered, applicable for distribution synthesis, e.g., probabilistic sha**. Two algorithms are presented that find the optimal $M$-type approximation $Q$ of a distribution $P$ in terms of the variational distance $| Q-P|_1$ and the informational divergence $\mathbb{D}(Q| P)$. Bounds on the approximation errors are derived and shown to be asymptotically tight. Several examples illustrate that the variational distance optimal approximation can be quite different from the informational divergence optimal approximation. △ Less

Submitted 19 January, 2016; v1 submitted 25 July, 2013; originally announced July 2013.

Comments: Submitted to the IEEE Transactions on Information Theory

arXiv:1304.6603 [pdf, ps, other]

doi 10.1109/TAC.2014.2364971

Optimal Kullback-Leibler Aggregation via Information Bottleneck

Authors: Bernhard C. Geiger, Tatjana Petrov, Gernot Kubin, Heinz Koeppl

Abstract: In this paper, we present a method for reducing a regular, discrete-time Markov chain (DTMC) to another DTMC with a given, typically much smaller number of states. The cost of reduction is defined as the Kullback-Leibler divergence rate between a projection of the original process through a partition function and a DTMC on the correspondingly partitioned state space. Finding the reduced model with… ▽ More In this paper, we present a method for reducing a regular, discrete-time Markov chain (DTMC) to another DTMC with a given, typically much smaller number of states. The cost of reduction is defined as the Kullback-Leibler divergence rate between a projection of the original process through a partition function and a DTMC on the correspondingly partitioned state space. Finding the reduced model with minimal cost is computationally expensive, as it requires an exhaustive search among all state space partitions, and an exact evaluation of the reduction cost for each candidate partition. Our approach deals with the latter problem by minimizing an upper bound on the reduction cost instead of minimizing the exact cost; The proposed upper bound is easy to compute and it is tight if the original chain is lumpable with respect to the partition. Then, we express the problem in the form of information bottleneck optimization, and propose using the agglomerative information bottleneck algorithm for searching a sub-optimal partition greedily, rather than exhaustively. The theory is illustrated with examples and one application scenario in the context of modeling bio-molecular interactions. △ Less

Submitted 10 February, 2015; v1 submitted 24 April, 2013; originally announced April 2013.

Comments: 13 pages, 4 figures

Journal ref: IEEE Trans. Autom. Control, vol. 60, no. 4, p. 1010 - 1022, 2015

arXiv:1304.5075 [pdf, ps, other]

On the Rate of Information Loss in Memoryless Systems

Authors: Bernhard C. Geiger, Gernot Kubin

Abstract: In this work we present results about the rate of (relative) information loss induced by passing a real-valued, stationary stochastic process through a memoryless system. We show that for a special class of systems the information loss rate is closely related to the difference of differential entropy rates of the input and output processes. It is further shown that the rate of (relative) informati… ▽ More In this work we present results about the rate of (relative) information loss induced by passing a real-valued, stationary stochastic process through a memoryless system. We show that for a special class of systems the information loss rate is closely related to the difference of differential entropy rates of the input and output processes. It is further shown that the rate of (relative) information loss is bounded from above by the (relative) information loss the system induces on a random variable distributed according to the process's marginal distribution. As a side result, in this work we present sufficient conditions such that for a continuous-valued Markovian input process also the output process possesses the Markov property. △ Less

Submitted 18 April, 2013; originally announced April 2013.

Comments: 9 pages, 4 figures; submitted to a conference

arXiv:1304.0920 [pdf, ps, other]

Information-Preserving Markov Aggregation

Authors: Bernhard C. Geiger, Christoph Temmel

Abstract: We present a sufficient condition for a non-injective function of a Markov chain to be a second-order Markov chain with the same entropy rate as the original chain. This permits an information-preserving state space reduction by merging states or, equivalently, lossless compression of a Markov source on a sample-by-sample basis. The cardinality of the reduced state space is bounded from below by t… ▽ More We present a sufficient condition for a non-injective function of a Markov chain to be a second-order Markov chain with the same entropy rate as the original chain. This permits an information-preserving state space reduction by merging states or, equivalently, lossless compression of a Markov source on a sample-by-sample basis. The cardinality of the reduced state space is bounded from below by the node degrees of the transition graph associated with the original Markov chain. We also present an algorithm listing all possible information-preserving state space reductions, for a given transition graph. We illustrate our results by applying the algorithm to a bi-gram letter model of an English text. △ Less

Submitted 24 July, 2013; v1 submitted 3 April, 2013; originally announced April 2013.

Comments: 7 pages, 3 figures, 2 tables

Journal ref: Proc. IEEE Information Theory Workshop, 2013, pp. 258-262

arXiv:1303.6409 [pdf, ps, other]

Information Measures for Deterministic Input-Output Systems

Authors: Bernhard C. Geiger, Gernot Kubin

Abstract: In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addresse… ▽ More In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addressed and Fano-type bounds between the information loss and the reconstruction error probability are derived. For systems with infinite information loss a relative measure is defined and shown to be tightly related to Rényi information dimension. Employing another Fano-type argument, the reconstruction error probability is bounded by the relative information loss from below. In view of develo** a system theory from an information-theoretic point-of-view, the theoretical results are illustrated by a few example systems, among them a multi-channel autocorrelation receiver. △ Less

Submitted 17 April, 2013; v1 submitted 26 March, 2013; originally announced March 2013.

Comments: 23 pages, 12 figures; submitted

arXiv:1212.4375 [pdf, other]

doi 10.1239/jap/1421763331

Lum**s of Markov chains, entropy rate preservation, and higher-order lumpability

Authors: Bernhard C. Geiger, Christoph Temmel

Abstract: A lum** of a Markov chain is a coordinate-wise projection of the chain. We characterise the entropy rate preservation of a lum** of an aperiodic and irreducible Markov chain on a finite state space by the random growth rate of the cardinality of the realisable preimage of a finite-length trajectory of the lumped chain and by the information needed to reconstruct original trajectories from thei… ▽ More A lum** of a Markov chain is a coordinate-wise projection of the chain. We characterise the entropy rate preservation of a lum** of an aperiodic and irreducible Markov chain on a finite state space by the random growth rate of the cardinality of the realisable preimage of a finite-length trajectory of the lumped chain and by the information needed to reconstruct original trajectories from their lumped images. Both are purely combinatorial criteria, depending only on the transition graph of the Markov chain and the lum** function. A lum** is strongly k-lumpable, iff the lumped process is a k-th order Markov chain for each starting distribution of the original Markov chain. We characterise strong k-lumpability via tightness of stationary entropic bounds. In the sparse setting, we give sufficient conditions on the lum** to both preserve the entropy rate and be strongly k-lumpable. △ Less

Submitted 20 April, 2015; v1 submitted 18 December, 2012; originally announced December 2012.

MSC Class: 60J10 (60G17 94A17 60G10 65C40)

arXiv:1205.6935 [pdf, ps, other]

Signal Enhancement as Minimization of Relevant Information Loss

Authors: Bernhard C. Geiger, Gernot Kubin

Abstract: We introduce the notion of relevant information loss for the purpose of casting the signal enhancement problem in information-theoretic terms. We show that many algorithms from machine learning can be reformulated using relevant information loss, which allows their application to the aforementioned problem. As a particular example we analyze principle component analysis for dimensionality reductio… ▽ More We introduce the notion of relevant information loss for the purpose of casting the signal enhancement problem in information-theoretic terms. We show that many algorithms from machine learning can be reformulated using relevant information loss, which allows their application to the aforementioned problem. As a particular example we analyze principle component analysis for dimensionality reduction, discuss its optimality, and show that the relevant information loss can indeed vanish if the relevant information is concentrated on a lower-dimensional subspace of the input space. △ Less

Submitted 16 January, 2013; v1 submitted 31 May, 2012; originally announced May 2012.

Comments: 9 pages; 4 figures; accepted for presentation at a conference

Journal ref: Proc. ITG Conf. on Systems, Communication and Coding, 2013, pp. 1-6

arXiv:1204.0429 [pdf, ps, other]

doi 10.1109/ITW.2012.6404738

Relative Information Loss in the PCA

Authors: Bernhard C. Geiger, Gernot Kubin

Abstract: In this work we analyze principle component analysis (PCA) as a deterministic input-output system. We show that the relative information loss induced by reducing the dimensionality of the data after performing the PCA is the same as in dimensionality reduction without PCA. Finally, we analyze the case where the PCA uses the sample covariance matrix to compute the rotation. If the rotation matrix i… ▽ More In this work we analyze principle component analysis (PCA) as a deterministic input-output system. We show that the relative information loss induced by reducing the dimensionality of the data after performing the PCA is the same as in dimensionality reduction without PCA. Finally, we analyze the case where the PCA uses the sample covariance matrix to compute the rotation. If the rotation matrix is not available at the output, we show that an infinite amount of information is lost. The relative information loss is shown to decrease with increasing sample size. △ Less

Submitted 31 July, 2012; v1 submitted 2 April, 2012; originally announced April 2012.

Comments: 9 pages, 4 figure; extended version of a paper accepted for publication

Journal ref: Proc. IEEE Information Theory Workshop, 2012, pp. 562 - 566

Showing 1–50 of 55 results for author: Geiger, B C