Bayesian Entropy Neural Networks for Physics-Aware Prediction

Rahul Rathnakumar, Yongming Liu
School for Engineering of Matter, Transport and Energy
Arizona State University
Tempe
{rrathnak, yongming.liu}@asu.edu
&Jiayu Huang, Hao Yan
School of Computing and Augmented Intelligence
Arizona State University
Tempe
{jhuan178, haoyan}@asu.edu
Abstract

This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods.

Keywords Bayesian Neural Networks  \cdot Deep Learning  \cdot Constrained Learning  \cdot Uncertainty Quantification

1 Introduction

Neural networks have achieved remarkable success in various domains, ranging from image recognition to natural language processing (Goldberg, 2016; Khan et al., 2020). Despite these advancements, traditional deep learning approaches often fall short in scenarios that demand adherence to specific constraints (Zhu et al., 2019; Hosseini-Asl et al., 2015) or a deeper understanding of model uncertainty (Wilson and Izmailov, 2020). This limitation becomes particularly pronounced in fields such as surrogate modeling, learning with limited or partial data, and applications where flexible adaptation to non-sample-based information is crucial. For surrogate models, the neural network model is expected to conform to behavior governed by the system model, and without enough data across the domain of interest, neural network performance falls short. Integrating domain knowledge into these neural networks remains challenging. Existing works on integrating domain knowledge use either Bayesian Networks, Physics-Informed Neural Networks (PINNs) or more general regularization schemes.

Bayesian Networks are probabilistic graphical models that represent a set of variables and their conditional dependencies via a directed acyclic graph (DAG). They are particularly useful for modeling complex systems where various factors influence each other in a probabilistic manner. In the context of physics and data-driven models, Bayesian Networks can be employed to incorporate prior knowledge and physical laws as constraints, ensuring that the models adhere to known scientific principles. Wang et al. (2020) presents a method for information fusion combining the maximum entropy (ME) method with the classical Bayesian network, termed the Bayesian-Entropy Network (BEN). This method is particularly adept at handling various types of information for classification and updating, such as point data, statistical information, and range data. The BEN method extends the Bayesian approach by integrating additional information in the form of constraints into the entropy part, while the Bayesian part handles classical point observation data. This integration allows for a more comprehensive approach to modeling and decision-making. The BEN method demonstrates a flexible adaptation to non-sample-based information, making it particularly relevant in fields like surrogate modeling and learning with limited or partial data. For example, Wang et al. (2021a) applies this method to the Gaussian Process model by adding information as constraints. Another work that uses such surrogate modeling techniques in conjuction with Bayesian Networks is by Lu and Zhang (2022) for the risk assessment of a power distribution system. In this work, the authors use the physical properties of the individual system components and the network topology to derive a physics-based fragility analysis for the whole system. Using Bayesian Networks, they calculate the time-dependent failure rate function of the system. In this paper, such modeling efforts are extended to apply to neural network models, showcasing its potential in tasks such as information constraints for high-dimensional problems such as microstructure generation, but also in controlling the predicted uncertainty, which is an important avenue of research. The reason for the latter is motivated by wanting to output high uncertainty for predictions outside the training domain, or if the information constraints are conflicting, as will be shown in the later stages of this paper.

It is not widely known in the community that artificial neural networks have been effectively applied to physical systems for problem-solving. For instance, Lagaris et al. (1998) utilized a multi-layer perceptron (MLP) to address partial differential equations (PDEs). This approach involves using the MLP to generate trial solutions that meet initial and boundary conditions by minimizing an energy function derived from the differential equation. However, one of the limitations of this approach to incorporating physics information is that it is sensitive to mesh density. This also implies that there are limitations on the dimensions of the problems that can be solved. Recently, (Sirignano and Spiliopoulos, 2017) proposed a mesh-free, deep learning based technique that takes inspiration from Galerkin method for solving PDEs that is accurate at high dimensions, and alleviates this issue. The idea here is to use a neural network to approximate the solution instead of basis function combinations. While these methods are focused on solving differential equations, it is still relevant to scientific machine learning and building surrogate models for physical systems. A landmark contribution in this line of work was Raissi et al. (2019), which presented physics-informed neural networks (PINNs) that directly incorporate physical laws in the form of differential equations, directly into the loss function of the neural network. This is achieved by constructing a composite loss function that not only measures the difference between predictions and data but also quantifies the deviation from the specified physical laws. PINNs are specialized models that have been shown to model differential equation constraints accurately using regularization. However, these types of techniques focus on compelling PDE and ODE constraints, which limits their use for general knowledge constraints, which is the major focus of this paper.

In the third area, Posterior-regularised BNN (Huang et al., 2022) and prior-regularized BNN (Yang et al., 2020a; Sam et al., 2024; Tran et al., 2022) are contemporary papers that are related to this work. Huang et al. (2022) focuses on embedding soft and hard knowledge constraints into the posterior, offering enhanced model robustness and adaptability. Yang et al. (2020b) modifies the prior, where the prior is optimized to comply with the constraints. Where OC-BNNs (Yang et al., 2020a) primarily focus on defining equality and inequality constraints by multiplying likelihoods and priors in the probability space, and PR-BNNs directly regularize the posterior predictive distribution to comply with the constraint. Modifying the prior does have it’s advantages- For instance, by using a domain knowledge loss on unlabeled samples, Sam et al. (2024) modifies a low-rank gaussian prior into an informative prior. The idea is that the learned informative prior obtained during the pre-training stage should transfer its properties to the posterior, without having to add a loss to the posterior, which needs sampling. But it isn’t clear that the prior can be specified correctly and easily for complex problems - the predictive distribution is more intuitive to work with when the modeler wants to incorporate constraints. Moreover, the fact that we already require posterior samples for evaluating the KL-Divergence means that the addition of a constraint term on the posterior is just another term that needs to be evaluated along side the data loss. However, Sam et al. (2024) argues that since the informative prior is low-rank Gaussian, the computational overhead of using an informative prior isn’t higher than using an isotropic Gaussian. Be it using posterior regularization or prior regularization, the common theme across these papers is the use of loss functions to enforce compliance constraints on model behavior. However, these papers have also not studied extensively the variety of constraints that are explored in this paper. For example, while PR-BNN can decide the amount of constraints penalty be added in the model, it lacks the power to balance multiple knowledge constraints, especially if they are conflicting. Furthermore, the work presented in this paper also studies variance constraints and constraints applied to high dimensional image generation problems, where the constraints are not obvious and highlights some nuances in constraint modeling for neural network-based tasks in the real world.

In this paper, we propose a framework that generalizes existing approaches by integrating a broad range of constraints and automatically determining their strength using the Maximum Entropy (MaxEnt) principle. The core of Bayesian Entropy Neural Networks (BENN) is their ability to combine the MaxEnt principle with the Bayesian Neural Network framework. This results in a model that offers constrained and uncertain predictions while remaining computationally feasible. Bayesian Neural Networks (BNNs) are founded on the principles of Bayesian inference and provide an elegant solution to the over-reliance on point estimates inherent in traditional neural networks. By leveraging posterior distributions over parameters, BNNs offer a more nuanced view of model uncertainty, essential for many real-world applications. However, the practical implementation of Bayesian updating in neural networks is challenged by computational limitations, especially with large parameter spaces. To address this issue, we use Variational Inference (VI), an approach that significantly reduces computational overhead by approximating posterior distributions. This approximation is crucial for applying BNNs in more complex and realistic scenarios. Works such as Wang et al. (2021b) use a sequential optimization approach to iteratively find model parameters for the Gaussian Process and the Lagrange multipliers. Neural networks, however, provide backpropagation machinery to handle optimization under constraints. Therefore, this work utilizes the Modified Differential Method of Multipliers (MDMM) to efficiently learn constraint weights.

The rest of the paper is organized as follows:

  1. 1.

    Section 2 provides background information on Bayesian Neural Networks, the Maximum Entropy method, and concludes with a description of the Bayesian Entropy Neural Network (BENN) framework.

  2. 2.

    Section 3 covers three experiments:

    1. (a)

      1D Regression: Implementing value, derivative, and predictive variance constraints.

    2. (b)

      Beam Deflection: Demonstrating the performance of the BENN approach in a beam deflection problem, where part of the domain is observed.

    3. (c)

      Microstructure Generation: Showcasing the benefit of constrained learning to enhance the performance of microstructure generation using suitable constraints on the microstructure properties.

  3. 3.

    Section 4 summarizes the work done in this paper and highlights the importance of the main results obtained.

2 Method

Here, we will present the major methodology of the proposed BENN method. We will start with a brief overview of Bayesian Neural Networks (BNNs) in Section 2.1. We will then discuss the integration of maximum entropy principles into Bayesian inference and introduce the Bayesian Entropy Neural Network (BENN) framework in Section 2.2.

2.1 Overview of Bayesian Neural Networks

Bayesian Neural Networks arose out of a need to have a principled alternative to point estimates of parameters in standard neural networks. Given a neural network f(θ)𝑓𝜃f(\theta)italic_f ( italic_θ ), and dataset D=(xi,yi)i=1n𝐷superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛D={(x_{i},y_{i})}_{i=1}^{n}italic_D = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, one can formulate the process as updating a prior belief p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ) using a likelihood p(D|θ)𝑝conditional𝐷𝜃p(D|\theta)italic_p ( italic_D | italic_θ ), to obtain a posterior p(θ|D)𝑝conditional𝜃𝐷p(\theta|D)italic_p ( italic_θ | italic_D ):

p(θ|D)=p(D|θ)p(θ)p(D)𝑝conditional𝜃𝐷𝑝conditional𝐷𝜃𝑝𝜃𝑝𝐷p(\theta|D)=\frac{p(D|\theta)\cdot p(\theta)}{p(D)}italic_p ( italic_θ | italic_D ) = divide start_ARG italic_p ( italic_D | italic_θ ) ⋅ italic_p ( italic_θ ) end_ARG start_ARG italic_p ( italic_D ) end_ARG (1)

In the above equation, the denominator p(D)𝑝𝐷p(D)italic_p ( italic_D ) is the marginal likelihood of the data, that is obtained after marginalization of the parameter θ𝜃\thetaitalic_θ. While the Bayes rule is valid for any family of models, implementing updating schemes for neural networks presents a challenge. The dimension of the parameter set θ𝜃\thetaitalic_θ can be large, leading to computational difficulties in calculating the marginal likelihood. Moreover, the goal of inferring parameters is to predict on test samples xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, one again confronts a need to marginalize the parameters using p(y|x,D)=p(y|x,θ)p(θ|D)𝑑θ𝑝conditionalsuperscript𝑦superscript𝑥𝐷𝑝conditionalsuperscript𝑦superscript𝑥𝜃𝑝conditional𝜃𝐷differential-d𝜃p(y^{*}|x^{*},D)=\int p(y^{*}|x^{*},\theta)p(\theta|D)d\thetaitalic_p ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_D ) = ∫ italic_p ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ ) italic_p ( italic_θ | italic_D ) italic_d italic_θ to get an averaged prediction. Sampling approaches such as Markov Chain Monte Carlo (MCMC) are often used in smaller models to get accurate estimates of the true posterior distributions. However, for neural networks, this sampling approach is disfavored due to excessive computational requirements. Instead, variational inference techniques are used.

VI approximates the posterior by finding a proposal distribution qϕ(θ)subscript𝑞italic-ϕ𝜃q_{\phi}(\theta)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) that minimizes the KL-divergence between the two. The KL divergence equation is given by:

KL(qϕ(θ)||p(θ|D))=Eq[log(qϕ(θ))]Eq[log(p(θ|D))]\text{KL}(q_{\phi}(\theta)||p(\theta|D))=E_{q}[\log(q_{\phi}(\theta))]-E_{q}[% \log(p(\theta|D))]KL ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ | italic_D ) ) = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) ) ] - italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( italic_p ( italic_θ | italic_D ) ) ] (2)

From the Bayes rule, it is seen that the second term is still dependent on the marginal likelihood, which is intractable. However, p(D)𝑝𝐷p(D)italic_p ( italic_D ) is independent of the proposal distribution qϕ(θ)subscript𝑞italic-ϕ𝜃q_{\phi}(\theta)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ). Taking advantage of this fact, variational methods maximize an equation that is proportional to the KL-divergence, the Evidence Lower Bound (ELBO):

ELBO(q)=Eq[log(p(D|θ))]+Eq[log(p(θ))]E[log(qϕ(θ))]=Eq[log(p(D|θ))]KL(qϕ(θ)||p(θ))\begin{split}\text{ELBO}(q)&=E_{q}[\log(p(D|\theta))]+E_{q}[\log(p(\theta))]-E% [\log(q_{\phi}(\theta))]\\ &=E_{q}[\log(p(D|\theta))]-\text{KL}(q_{\phi}(\theta)||p(\theta))\end{split}start_ROW start_CELL ELBO ( italic_q ) end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( italic_p ( italic_D | italic_θ ) ) ] + italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( italic_p ( italic_θ ) ) ] - italic_E [ roman_log ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( italic_p ( italic_D | italic_θ ) ) ] - KL ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) end_CELL end_ROW (3)

The two components present in the final form consist of an expected log-likelihood term that is data-dependent, and a term that encourages the proposal distribution to not deviate from the prior.

The ELBO equation has gradients that can be estimated using Monte Carlo sampling of the parameter θisuperscript𝜃𝑖\theta^{i}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the approximated posterior after each iteration of the training loop.

LVI=i=1nlog(qϕ(θi))P(θi)P(D|θi)subscript𝐿𝑉𝐼superscriptsubscript𝑖1𝑛𝑙𝑜𝑔subscript𝑞italic-ϕsuperscript𝜃𝑖𝑃superscript𝜃𝑖𝑃conditional𝐷superscript𝜃𝑖L_{VI}=\sum_{i=1}^{n}log(q_{\phi}(\theta^{i}))-P(\theta^{i})-P(D|\theta^{i})italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_l italic_o italic_g ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) - italic_P ( italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_P ( italic_D | italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (4)

Stochastic gradients are accomplished using a reparameterization:

z=μ+ϵexp(12log(σ2))𝑧𝜇italic-ϵ12superscript𝜎2z=\mu+\epsilon\cdot\exp\left(\frac{1}{2}\log(\sigma^{2})\right)italic_z = italic_μ + italic_ϵ ⋅ roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) (5)

The variational approximation can also be used to generate samples from a latent encoding q(z)𝑞𝑧q(z)italic_q ( italic_z ) of an input x𝑥xitalic_x, using the commonly used Variational Auto-Encoder model (VAE). Further details on this experimental setup are given in Section 3.3.

2.2 Incorporating Constraint Information in Bayesian Entropy Method

The method of maximum entropy, seen in (Giffin et al., 2007) was proposed by Jaynes to perform inference of model parameters subject to constraints. In other words, MaxEnt picks parameters that maximize the entropy of its distribution insofar as the constraints allow. As a corollary, it is also obtained that given no information about the parameters, a flat distribution is the one that maximizes entropy. This line of reasoning suggests that this method fits well within the Bayesian updating paradigm, as constraints are a form of information about the parameters.

If we have a joint prior p((θ),x)𝑝𝜃𝑥p((\theta),x)italic_p ( ( italic_θ ) , italic_x ) on a model given some information F𝐹Fitalic_F about the model parameters, we can maximize the entropy between the prior and the posterior update considering the constraints as an optimization problem:

argminθP(θ,xc,x)logP(θ,xc,x)P(θ,x)𝑑x𝑑θs.t.𝑑x𝑑θp(θ,xc,x)=1p(xc,x,θ)f(θ)𝑑θ𝑑x=Fsubscriptargmin𝜃𝑃𝜃subscript𝑥𝑐𝑥𝑙𝑜𝑔𝑃𝜃subscript𝑥𝑐𝑥𝑃𝜃𝑥differential-d𝑥differential-d𝜃s.t.differential-d𝑥differential-d𝜃𝑝𝜃subscript𝑥𝑐𝑥1𝑝subscript𝑥𝑐𝑥𝜃𝑓𝜃differential-d𝜃differential-d𝑥𝐹\begin{split}\text{argmin}_{\theta}-\int P(\theta,x_{c},x)log\frac{P(\theta,x_% {c},x)}{P(\theta,x)}dxd\theta\\ \text{s.t.}\int dxd\theta p(\theta,x_{c},x)&=1\\ \int p(x_{c},x,\theta)f(\theta)d\theta dx&=F\end{split}start_ROW start_CELL argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT - ∫ italic_P ( italic_θ , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ) italic_l italic_o italic_g divide start_ARG italic_P ( italic_θ , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ) end_ARG start_ARG italic_P ( italic_θ , italic_x ) end_ARG italic_d italic_x italic_d italic_θ end_CELL end_ROW start_ROW start_CELL s.t. ∫ italic_d italic_x italic_d italic_θ italic_p ( italic_θ , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ) end_CELL start_CELL = 1 end_CELL end_ROW start_ROW start_CELL ∫ italic_p ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x , italic_θ ) italic_f ( italic_θ ) italic_d italic_θ italic_d italic_x end_CELL start_CELL = italic_F end_CELL end_ROW (6)

The above equation formalizes the entropy maximization problem subject to a constraint on the value of p and a normalization constraint on p. The solution is derived using the method of Lagrangian multipliers. If one extends the above-constrained system to m constraints, the objective is to solve for m+1𝑚1m+1italic_m + 1 multipliers:

L(p,λ,ν)=pilog(pi)dpk=1m(λk(pifk(xi)Fk)dpν(pi1)dp\begin{split}L(p,\lambda,\nu)&=\int p_{i}log(p_{i})dp-\sum_{k=1}^{m}(\lambda_{% k}\int(p_{i}f_{k}(x_{i})-F_{k})dp-\nu\int(p_{i}-1)dp\end{split}start_ROW start_CELL italic_L ( italic_p , italic_λ , italic_ν ) end_CELL start_CELL = ∫ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d italic_p - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_d italic_p - italic_ν ∫ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) italic_d italic_p end_CELL end_ROW (7)

The solution takes the form of an exponential function normalized by a partition function Z:

p(θ|x,xc)=1Z(λ1,,λm)exp(k=1mλkfk(x))Z(λ1,,λm)=exp(k=1mλkfk(x))𝑝conditional𝜃𝑥subscript𝑥𝑐1𝑍subscript𝜆1subscript𝜆𝑚superscriptsubscript𝑘1𝑚subscript𝜆𝑘subscript𝑓𝑘𝑥𝑍subscript𝜆1subscript𝜆𝑚superscriptsubscript𝑘1𝑚subscript𝜆𝑘subscript𝑓𝑘𝑥\begin{split}p(\theta|x,x_{c})&=\frac{1}{Z(\lambda_{1},...,\lambda_{m})}\cdot% \exp(-\sum_{k=1}^{m}\lambda_{k}f_{k}(x))\\ Z(\lambda_{1},...,\lambda_{m})&=\int\exp(-\sum_{k=1}^{m}\lambda_{k}f_{k}(x))% \end{split}start_ROW start_CELL italic_p ( italic_θ | italic_x , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ roman_exp ( - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) end_CELL end_ROW start_ROW start_CELL italic_Z ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_CELL start_CELL = ∫ roman_exp ( - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) end_CELL end_ROW (8)

From the above description, it is seen that the inference procedure requires a prior that captures no new information apart from the constraints and the data. Therefore, one choice to model this is to use a uniform prior. For a distribution qϕ(θ)subscript𝑞italic-ϕ𝜃q_{\phi}(\theta)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) and a uniform prior distribution p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ), the Kullback-Leibler (KL) divergence from qϕ(θ)subscript𝑞italic-ϕ𝜃q_{\phi}(\theta)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) to p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ) can be calculated as:

DKL(qϕ(θ)||p(θ))=qϕ(θ)logqϕ(θ)p(θ)dwD_{KL}(q_{\phi}(\theta)||p(\theta))=\int q_{\phi}(\theta)\log\frac{q_{\phi}(% \theta)}{p(\theta)}\,dwitalic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) = ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) end_ARG start_ARG italic_p ( italic_θ ) end_ARG italic_d italic_w (9)

With p(θ)=1|Θ|𝑝𝜃1Θp(\theta)=\frac{1}{|\Theta|}italic_p ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG | roman_Θ | end_ARG for all θθ𝜃𝜃\theta\in\thetaitalic_θ ∈ italic_θ, the KL divergence simplifies to:

DKL(qϕ(θ)||p(θ))=qϕ(θ)log(qϕ(θ)|Θ|)dθD_{KL}(q_{\phi}(\theta)||p(\theta))=\int q_{\phi}(\theta)\log(q_{\phi}(\theta)% \cdot|\Theta|)\,d\thetaitalic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) = ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) roman_log ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) ⋅ | roman_Θ | ) italic_d italic_θ (10)

This can be decomposed into:

DKL(qϕ(θ)||p(θ))=qϕ(θ)logqϕ(θ)dθ+log|Θ|qϕ(θ)dθD_{KL}(q_{\phi}(\theta)||p(\theta))=\int q_{\phi}(\theta)\log q_{\phi}(\theta)% \,d\theta+\log|\Theta|\int q_{\phi}(\theta)\,d\thetaitalic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) = ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) italic_d italic_θ + roman_log | roman_Θ | ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) italic_d italic_θ (11)

Given that qϕ(θ)subscript𝑞italic-ϕ𝜃q_{\phi}(\theta)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) is a probability distribution, the second integral evaluates to 1, yielding:

DKL(qϕ(θ)||p(θ))=qϕ(θ)logqϕ(θ)dθ+log|Θ|D_{KL}(q_{\phi}(\theta)||p(\theta))=\int q_{\phi}(\theta)\log q_{\phi}(\theta)% \,d\theta+\log|\Theta|italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) = ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) italic_d italic_θ + roman_log | roman_Θ | (12)

The integral represents the negative of the entropy of qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, denoted as H(qϕ)𝐻subscript𝑞italic-ϕ-H(q_{\phi})- italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ). Therefore, the KL divergence can be rewritten as:

DKL(qϕ(θ)||p(θ))=H(qϕ)+log|Θ|D_{KL}(q_{\phi}(\theta)||p(\theta))=-H(q_{\phi})+\log|\Theta|italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) = - italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) + roman_log | roman_Θ | (13)

Since log|Θ|Θ\log|\Theta|roman_log | roman_Θ | is constant, the negative KL divergence is effectively proportional to the entropy of the variational distribution. This has indeed been concurrently investigated by de Mathelin et al. (2023), where the model uses just the entropy of the proposal distribution for KL-divergence minimization. This is motivated by the above reasoning using the MaxEnt principle. Effectively, MaxEnt ensures that the prior assumptions are minimal, as the uniform prior contributes only a constant term to the KL divergence, making no further assumptions about the distribution of the parameters.

2.3 Bayesian Entropy Neural Networks

Building on the MaxEnt principle and Bayesian Neural Networks (BNNs), this section introduces Bayesian Entropy Neural Networks (BENN). The previous section modeled constraints on the posterior distribution. Now, constraint enforcement on both the posterior predictive and the parameters is demonstrated using this framework.

Given a Bayesian Neural Network fBNN(θ)subscript𝑓𝐵𝑁𝑁𝜃f_{BNN}(\theta)italic_f start_POSTSUBSCRIPT italic_B italic_N italic_N end_POSTSUBSCRIPT ( italic_θ ), prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and a constraint function g𝑔gitalic_g, the learning objective can be formulated a:

θ=superscript𝜃absent\displaystyle\theta^{*}=italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argminθ(KL(qϕ(θ)p(θ))Eqϕ(θ)[logp(Dθ)])subscriptargmin𝜃𝐾𝐿conditionalsubscript𝑞italic-ϕ𝜃𝑝𝜃subscript𝐸subscript𝑞italic-ϕ𝜃delimited-[]𝑝conditional𝐷𝜃\displaystyle\,\text{argmin}_{\theta}\left(KL(q_{\phi}(\theta)\|p(\theta))-E_{% q_{\phi}(\theta)}[\log p(D\mid\theta)]\right)argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_K italic_L ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) ∥ italic_p ( italic_θ ) ) - italic_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_D ∣ italic_θ ) ] ) (14)
s.t.g(x)0s.t.𝑔𝑥0\displaystyle\text{s.t.}\quad g(x)\leq 0s.t. italic_g ( italic_x ) ≤ 0
h(x)=0𝑥0\displaystyle h(x)=0italic_h ( italic_x ) = 0

For brevity, we write LVI=KL(qϕ(θ)||p(θ))Eq[logp(D|θ)]L_{VI}=KL(q_{\phi}(\theta)||p(\theta))-E_{q}[\log p(D|\theta)]italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT = italic_K italic_L ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ ) | | italic_p ( italic_θ ) ) - italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log italic_p ( italic_D | italic_θ ) ]. Using Lagrange multipliers, the loss function can be formulated as:

L=argminθLVI+λh(x)+μg(x)𝐿subscriptargmin𝜃subscript𝐿𝑉𝐼𝜆𝑥𝜇𝑔𝑥L=\text{argmin}_{\theta}L_{VI}+\lambda\cdot h(x)+\mu\cdot g(x)italic_L = argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT + italic_λ ⋅ italic_h ( italic_x ) + italic_μ ⋅ italic_g ( italic_x ) (15)

The above hard-constrained problem must strictly satisfy the inequality and equality constraints while also minimizing the loss function. In the case of a soft constraint, there can be violations allowed on the constraints using a slack variable ξ𝜉\xiitalic_ξ on the constraint infeasibility term:

L=argminθLVI+λh(x)+μ(g(x)ξ2)𝐿subscriptargmin𝜃subscript𝐿𝑉𝐼𝜆𝑥𝜇𝑔𝑥superscript𝜉2L=\text{argmin}_{\theta}L_{VI}+\lambda\cdot h(x)+\mu\cdot(g(x)-\xi^{2})italic_L = argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT + italic_λ ⋅ italic_h ( italic_x ) + italic_μ ⋅ ( italic_g ( italic_x ) - italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (16)

It is easy to see that this regularization form can theoretically satisfy both the constraint and the data, provided the neural network has enough parameters. However, a loss that includes the constraint as a regularizer, as written above, may not always result in a local minimum but can also be a stationary point. Oftentimes, practitioners perform a grid search on the values of the constraint weights to select one that maximizes a metric of interest or use a value to assign importance to the constraint compared to the data. To improve this, a more principled, robust updating scheme is utilized for parameters and the Lagrangian multipliers. The Modified Differential Method of Multipliers approach (MDMM), proposed by Platt et al. (1987), is found to be well-suited, both from a theoretical perspective, as detailed in the original paper that proposed this optimization technique, and from a practical standpoint, for fitting the constraint weights. MDMM adds an extra dam** term, similar to penalty methods, to the loss function and has the following form:

argminθL(p,λ,c)=argminθLVI+λh(x)+c12h(x)2+μ(g(x)ξ2)+c22g(x)2subscriptargmin𝜃𝐿𝑝𝜆𝑐subscriptargmin𝜃subscript𝐿𝑉𝐼𝜆𝑥subscript𝑐12superscript𝑥2𝜇𝑔𝑥superscript𝜉2subscript𝑐22𝑔superscript𝑥2\begin{split}\text{argmin}_{\theta}L(p,\lambda,c)&=\text{argmin}_{\theta}L_{VI% }+\lambda h(x)+\frac{c_{1}}{2}h(x)^{2}+\mu(g(x)-\xi^{2})+\frac{c_{2}}{2}g(x)^{% 2}\end{split}start_ROW start_CELL argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_p , italic_λ , italic_c ) end_CELL start_CELL = argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT + italic_λ italic_h ( italic_x ) + divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_h ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ ( italic_g ( italic_x ) - italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_g ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (17)

From Section 2.2, it is seen that the constrained optimization problem leads to an updated posterior of the form given by Equation 8. Note that BENN gives us an easy way to sample from the posterior distribution, so one can extract N𝑁Nitalic_N samples from the posterior to evaluate the constraint.

Compared to (Wang et al., 2021b) work on Bayesian Entropy Gaussian Process (BEGP), the BENN leverages the high dimensional representation capacity of neural networks, and uses the same underlying principle to enforce constraints. Moreover, the optimization framework obtained as a result of the backpropagation algorithm makes it possible to simultaneously optimize for both the model parameters and the Lagrangian multipliers. Gaussian Processes, however, have inherent advantages in uncertainty quantification that Bayesian Neural Networks do not, and this paper attempts to incorporate the principles from MaxEnt to demonstrate constraining predictive variance as well.

3 Experiments

3.1 1D-Regression

In this section, the data, architecture, and loss used for the regression problem are first described. Following this, a demonstration of value and derivative constraints is shown. Finally, demonstrations to constrain the behavior of the predictive variance for 1-D regression are shown.

The dataset for the 1D-regression problem models a polynomial function with added noise and is defined over two distinct regions. In both regions, the function is given by:

y=p0x2+p1x+p2+0.15sin(2πx)+𝒩(0,σ)𝑦subscript𝑝0superscript𝑥2subscript𝑝1𝑥subscript𝑝20.152𝜋𝑥𝒩0𝜎y=p_{0}x^{2}+p_{1}x+p_{2}+0.15\sin(2\pi x)+\mathcal{N}(0,\sigma)italic_y = italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 0.15 roman_sin ( 2 italic_π italic_x ) + caligraphic_N ( 0 , italic_σ ) (18)

where the coefficients are p0=0.2subscript𝑝00.2p_{0}=0.2italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.2, p1=0.05subscript𝑝10.05p_{1}=0.05italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.05, and p2=0.01subscript𝑝20.01p_{2}=0.01italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.01. The first region, from x=3𝑥3x=-3italic_x = - 3 to x=2𝑥2x=-2italic_x = - 2, has a noise level with a standard deviation of σ=0.01𝜎0.01\sigma=0.01italic_σ = 0.01. The second region, from x=2.0𝑥2.0x=2.0italic_x = 2.0 to x=3.0𝑥3.0x=3.0italic_x = 3.0, has a higher noise level with standard deviation σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1. The training data Xtrainsubscript𝑋𝑡𝑟𝑎𝑖𝑛X_{train}italic_X start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and corresponding labels Ytrainsubscript𝑌𝑡𝑟𝑎𝑖𝑛Y_{train}italic_Y start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT consist of samples from both regions.

The architecture of the neural network used in these experiments is a Bayesian Neural Network (BNN). The network comprises one hidden layer with 100 units and employs the Rectified Linear Unit (ReLU) as the activation function. The output layer of the network is designed to produce two outputs: the mean prediction of the target variable and a parameter termed as log(noise)noise\log(\text{noise})roman_log ( noise ) or σ𝜎\sigmaitalic_σ, which represents the log-transformed noise level of the predictions. For the expected likelihood loss, it is assumed that the data generated is from a Gaussian, and a negative log-likelihood loss is minimized:

(𝐲,𝐲^,σ)=1Ni=1N((yiy^i)22exp(σ)+σ2)𝐲^𝐲𝜎1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖subscript^𝑦𝑖22𝜎𝜎2\mathcal{L}(\mathbf{y},\hat{\mathbf{y}},\sigma)=\frac{1}{N}\sum_{i=1}^{N}\left% (\frac{(y_{i}-\hat{y}_{i})^{2}}{2\exp(\sigma)}+\frac{\sigma}{2}\right)caligraphic_L ( bold_y , over^ start_ARG bold_y end_ARG , italic_σ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_exp ( italic_σ ) end_ARG + divide start_ARG italic_σ end_ARG start_ARG 2 end_ARG ) (19)

Here, 𝐲𝐲\mathbf{y}bold_y represents the actual values, 𝐲^^𝐲{\mathbf{\hat{y}}}over^ start_ARG bold_y end_ARG denotes the predicted mean values, N𝑁Nitalic_N is the number of data points, and σ𝜎\sigmaitalic_σ is the log-noise parameter. This loss function not only penalizes the deviation of predictions from the actual values but also incorporates the uncertainty associated with the predictions as indicated by σ𝜎\sigmaitalic_σ.

From the point of view of Maximum Entropy (MaxEnt), one seeks to make minimal assumptions about the nature of the model before training. In the Bayesian framework, this can be achieved by using a prior that is uniform over the space of parameters.

3.1.1 Incorporating Value and Derivative Constraints to the Prediction

In this section, the effects of incorporating additional constraints, specifically value and derivative constraints, into the 1D regression model are demonstrated. These constraints are particularly focused on areas outside the training data range. Several experiments to understand how these constraints influence the model’s predictions are conducted, especially in regions where training data is sparse or absent.

Value Constraints Outside Training Range:

In the first demonstration, the two value constraints are placed at x=5.0𝑥5.0x=5.0italic_x = 5.0 and x=7.5𝑥7.5x=7.5italic_x = 7.5, both outside the training data range. The loss function for this is obtained as the sum of LVIsubscript𝐿𝑉𝐼L_{VI}italic_L start_POSTSUBSCRIPT italic_V italic_I end_POSTSUBSCRIPT and the infeasibility constraints for both points. The infeasibility is calculated using:

val=E[y(x)]ygt(x)subscriptval𝐸delimited-[]𝑦𝑥subscript𝑦gt𝑥\mathcal{L}_{\text{val}}=E[y(x)]-y_{\text{gt}}(x)caligraphic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT = italic_E [ italic_y ( italic_x ) ] - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ( italic_x ) (20)

This setup allows us to observe the model’s response to multiple external constraints and their impact on the overall prediction accuracy. The results, shown in Figure 1(a), show that the proposed approach is able to accurately capture the value constraint. The compliance of the network to the constraints is strongly satisfied compared to the data, owing to the representational capacity of the model.

Derivative constraints

The derivative constraint is implemented by using central difference δ𝛿\deltaitalic_δ around a small neighborhood ϵitalic-ϵ\epsilonitalic_ϵ of the point (or region) in consideration:

derivative=E[δϵ(y(x))]ygt(x)subscriptderivative𝐸delimited-[]subscript𝛿italic-ϵ𝑦𝑥subscript𝑦gt𝑥\mathcal{L}_{\text{derivative}}=E[\delta_{\epsilon}(y(x))]-y_{\text{gt}}(x)caligraphic_L start_POSTSUBSCRIPT derivative end_POSTSUBSCRIPT = italic_E [ italic_δ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_y ( italic_x ) ) ] - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ( italic_x ) (21)

It is seen in Figure 1(d) that this approach is sufficient to capture derivative information. There is a caveat here that the

Variance constraints

The BE technique can encode any type of information as a constraint. This includes constraints on the predictive variance. While this is generally modulated by the data using the Gaussian likelihood term, it is possible to enforce constraints on its behavior given expert knowledge about it. Examples of such cases might be to specify the level of decay in prediction confidence outside the training data range. The variance constraint loss term is:

var=λ(σ2(x)σgt2(x))subscriptvar𝜆superscript𝜎2𝑥subscriptsuperscript𝜎2gt𝑥\mathcal{L}_{\text{var}}=\lambda(\sigma^{2}(x)-\sigma^{2}_{\text{gt}}(x))caligraphic_L start_POSTSUBSCRIPT var end_POSTSUBSCRIPT = italic_λ ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ( italic_x ) ) (22)

Figure 3 showcases three such examples of variance constraints on the output variance neuron, applied in regions outside the training data, with increasing uncertainty.

Conflicting Constraints

Model behavior in the presence of conflicting constraints is shown in Figure 1(b). Two different value constraints are applied at the same location x=7.5𝑥7.5x=7.5italic_x = 7.5. This experiment is designed to evaluate how the model handles conflicting information and to understand the role of variance constraints in such situations. The variance at the constraint location is meant to express the uncertainty in the data provided in the form of a constraint. It is also reasonable to want to express arbitrary values of uncertainty at such points. However, the model needs to account for the variance that is admissible in these situations, so the predictive variance is constrained to be proportionate to the uncertain value constraints.

Bound Constraints

Next, an evaluation of the model with bound constraints to ensure that the predictions fall within a certain range is shown. For a given input x𝑥xitalic_x, the bound constraints are defined by two functions, representing the upper bound ub(x)ub𝑥\text{ub}(x)ub ( italic_x ) and lower bound lb(x)lb𝑥\text{lb}(x)lb ( italic_x ). These are given by:

ub(x)=1.0,lb(x)=0.5,xϵ[0.5,0.5]formulae-sequenceub𝑥1.0lb𝑥0.5for-all𝑥italic-ϵ0.50.5\text{ub}(x)=1.0,\quad\text{lb}(x)=0.5,\forall x\epsilon[-0.5,0.5]ub ( italic_x ) = 1.0 , lb ( italic_x ) = 0.5 , ∀ italic_x italic_ϵ [ - 0.5 , 0.5 ] (23)

where x𝑥xitalic_x is the input to the model. n𝑛nitalic_n bound constraints are enforced, each represented as a hard constraint. For each test input xiXtestsubscript𝑥𝑖subscript𝑋testx_{i}\in X_{\text{test}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, the constraints are formulated as:

lb(xi)prediub(xi),i{1,,n}.formulae-sequencelbsubscript𝑥𝑖subscriptpred𝑖ubsubscript𝑥𝑖for-all𝑖1𝑛\text{lb}(x_{i})\leq\text{pred}_{i}\leq\text{ub}(x_{i}),\quad\forall i\in\{1,% \ldots,n\}.lb ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ pred start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ ub ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_i ∈ { 1 , … , italic_n } . (24)

Like the equality constraint, the infeasibility, or violation, of the bounds is quantified by how far a prediction is from satisfying the constraints. The bound constraints are assessed by computing the deviation of the bounds from the original prediction:

Lbound={minf(x)if f(x)<min0if minf(x)maxmaxf(x)if f(x)>maxsubscript𝐿𝑏𝑜𝑢𝑛𝑑casesmin𝑓𝑥if 𝑓𝑥min0if min𝑓𝑥maxmax𝑓𝑥if 𝑓𝑥maxL_{bound}=\begin{cases}\text{min}-f(x)&\text{if }f(x)<\text{min}\\ 0&\text{if }\text{min}\leq f(x)\leq\text{max}\\ \text{max}-f(x)&\text{if }f(x)>\text{max}\end{cases}italic_L start_POSTSUBSCRIPT italic_b italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT = { start_ROW start_CELL min - italic_f ( italic_x ) end_CELL start_CELL if italic_f ( italic_x ) < min end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if roman_min ≤ italic_f ( italic_x ) ≤ max end_CELL end_ROW start_ROW start_CELL max - italic_f ( italic_x ) end_CELL start_CELL if italic_f ( italic_x ) > max end_CELL end_ROW (25)

The variance constraints modify the predicted aleatoric variance to minimize the absolute difference between the predicted and expected variance values.

Table 1 showcases a comparison of evaluation scores produced by PR-BNN (Huang et al., 2022) and BENN on constrained regression experiments. The evaluation scores are computed as the absolute value of the constraint violations, |ypredygt|subscript𝑦𝑝𝑟𝑒𝑑subscript𝑦𝑔𝑡|y_{pred}-y_{gt}|| italic_y start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT |, where ypredsubscript𝑦𝑝𝑟𝑒𝑑y_{pred}italic_y start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT is averaged from 250 evaluations of the model. In the table, Inf-1, Inf-2 indicate the constraint violations of each constraint. For instance, the value constraints for both the conflicting and non-conflicting cases incorporate two separate constraints that need to be satisfied. In the case of the derivative constraint, we only have one constraint, whose violation is listed under Inf-1. Finally, for the bound constraint, the violation is averaged through the constraint domain points and listed under Inf-1. Figure 1 and Figure 2 shows a visual comparison of these two approaches. BENN shows significant differences with PR-BNN in the way conflicting constraints are handled. As a result, PR-BNN biases its prediction towards one of the two value constraints, whereas the BENN prediction occurs in between the two specified value constraints. The addition of the variance constraints in this work also allows experts to express uncertainties in the region of the specified constraints. The results of the derivative and bound constraints, however, do not show significant differences. It is to be noted, however, that the PR-BNN method does have lower constraint violations than the BENN method.

Table 1: Comparison of BENN and PR-BNN methods for value, bound and derivative constraints
Experiment BENN PR-BNN
Inf-1 Inf-2 Inf-1 Inf-2
Value Constraints - No Conflict 0.023 0.041 0.001 0.038
Conflicting Value Constraint 1.017 0.982 0.001 2.001
Derivative Constraint 0.0016 - 4.216e-06 -
Bound Constraint 0.0 - 0.0 -
Refer to caption
(a) Value constraints set with σale=0.0subscript𝜎𝑎𝑙𝑒0.0\sigma_{ale}=0.0italic_σ start_POSTSUBSCRIPT italic_a italic_l italic_e end_POSTSUBSCRIPT = 0.0
Refer to caption
(b) Conflicting value constraints with σale=2.0subscript𝜎𝑎𝑙𝑒2.0\sigma_{ale}=2.0italic_σ start_POSTSUBSCRIPT italic_a italic_l italic_e end_POSTSUBSCRIPT = 2.0
Refer to caption
(c) Bound constraint in x = [-0.5, 0.5]
Refer to caption
(d) Derivative constraint dydx=0,x=9.5formulae-sequence𝑑𝑦𝑑𝑥0𝑥9.5\frac{dy}{dx}=0,x={9.5}divide start_ARG italic_d italic_y end_ARG start_ARG italic_d italic_x end_ARG = 0 , italic_x = 9.5
Figure 1: 1-D constrained regression demonstrations with value, derivative and bound constraints
Refer to caption
(a) Value constraints at x = 5, 7.5
Refer to caption
(b) Conflicting value constraints at x = 7.5
Refer to caption
(c) Bound constraint in x = [-0.5, 0.5]
Refer to caption
(d) Derivative constraint dydx=0,x=9.5formulae-sequence𝑑𝑦𝑑𝑥0𝑥9.5\frac{dy}{dx}=0,x={9.5}divide start_ARG italic_d italic_y end_ARG start_ARG italic_d italic_x end_ARG = 0 , italic_x = 9.5
Figure 2: 1-D constrained regression demonstrations using PR-BNN with value, derivative and bound constraints
Refer to caption
(a) Low variance outside the training data
Refer to caption
(b) Moderate variance outside the training data
Refer to caption
(c) High variance outside the training data
Figure 3: 1-D constrained regression demonstrations with variance constraints outside the training data

3.2 Beam Deflection Problem

In this section, a classical problem in structural engineering is addressed: predicting the deflection of a beam under a given load. The data for this experiment is obtained from the theoretical deflection of a beam subjected to a load. The deflection of a beam, y𝑦yitalic_y, as a function of its position, x𝑥xitalic_x, is given by the function y(x)𝑦𝑥y(x)italic_y ( italic_x ), where the deflection depends on the material’s Young’s modulus (E𝐸Eitalic_E), the length of the beam (L𝐿Litalic_L), the moment of inertia of the beam’s cross-section (I𝐼Iitalic_I), and the applied load (P𝑃Pitalic_P). Specifically, the deflection function is defined piecewise for a beam of length 2L2𝐿2L2 italic_L as:

y(x)={P8EIx3+PL4EIx2,if 0x2LP6EIx33PL2EIx2+7PL22EIx7PL33EI,otherwise𝑦𝑥cases𝑃8𝐸𝐼superscript𝑥3𝑃𝐿4𝐸𝐼superscript𝑥2if 0𝑥2𝐿𝑃6𝐸𝐼superscript𝑥33𝑃𝐿2𝐸𝐼superscript𝑥27𝑃superscript𝐿22𝐸𝐼𝑥7𝑃superscript𝐿33𝐸𝐼otherwisey(x)=\begin{cases}\frac{-P}{8EI}x^{3}+\frac{PL}{4EI}x^{2},&\text{if }0\leq x% \leq 2L\\ \frac{P}{6EI}x^{3}-\frac{3PL}{2EI}x^{2}+\frac{7PL^{2}}{2EI}x-\frac{7PL^{3}}{3% EI},&\text{otherwise}\end{cases}italic_y ( italic_x ) = { start_ROW start_CELL divide start_ARG - italic_P end_ARG start_ARG 8 italic_E italic_I end_ARG italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + divide start_ARG italic_P italic_L end_ARG start_ARG 4 italic_E italic_I end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL if 0 ≤ italic_x ≤ 2 italic_L end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_P end_ARG start_ARG 6 italic_E italic_I end_ARG italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - divide start_ARG 3 italic_P italic_L end_ARG start_ARG 2 italic_E italic_I end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 7 italic_P italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_E italic_I end_ARG italic_x - divide start_ARG 7 italic_P italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_E italic_I end_ARG , end_CELL start_CELL otherwise end_CELL end_ROW (26)

In this experiment, the values for E𝐸Eitalic_E, L𝐿Litalic_L, I𝐼Iitalic_I, are set to typical values found in steel beams and P𝑃Pitalic_P is set to 2000 N. For training purposes, a small sample of 10 points is observed by sampling points between 0.5L0.5𝐿0.5L0.5 italic_L and 1.3L1.3𝐿1.3L1.3 italic_L. To simulate real-world measurements, a small Gaussian noise N(0,0.001)𝑁00.001N(0,0.001)italic_N ( 0 , 0.001 ) is added to the deflection values, resulting in the observed deflections, ytrain(x)subscript𝑦train𝑥y_{\text{train}}(x)italic_y start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ( italic_x ). It should be noted that the problem setting indicates that the system is only partially observable (i.e., between 0.5L0.5𝐿0.5L0.5 italic_L and 1.3L1.3𝐿1.3L1.3 italic_L), and the other portion of the beam is non-observable. This is not uncommon in engineering practice due to the inaccessibility of the structure for equipment/sensors or due to the limited number of sensors/resources to measure the entire structure.

The Bayesian Neural Network (BNN) used for regression predicts both the mean and the variance of the target variable. This network has an input layer with 1 neuron (representing the position x𝑥xitalic_x), and an output layer with 2 neurons, outputting both the predicted mean deflection and the log-transformed noise level. Further, it has 1 hidden layer, consisting of 2048 units. The GELU activation function xΦ(x)𝑥Φ𝑥x*\Phi(x)italic_x ∗ roman_Φ ( italic_x ) is used for this task, where Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) is the Cumulative Distribution Function (CDF) of the Gaussian. This activation helps significantly improve performance in the mean sense for the test set, evaluated from L=0𝐿0L=0italic_L = 0 to L=3𝐿3L=3italic_L = 3.

Constraint Implementation

To enhance the performance of the BENN model, specific constraints are applied where the direct measurements are not available due to the reasons mentioned above. These constraints are formulated to ensure that the model’s predictions adhere to known physical principles and boundary conditions associated with beam deflection. Two types of constraints are implemented: value constraints and derivative constraints.

  1. 1.

    Value Constraint: The value constraint is applied to ensure that the predicted deflection at specific points matches known values. This is particularly relevant at the beam’s left end and the pin support at x=2.0𝑥2.0x=2.0italic_x = 2.0, where the deflection is zero:

    f(x=0.0;θ)=0,f(x=2.0;θ)=0.formulae-sequence𝑓𝑥0.0𝜃0𝑓𝑥2.0𝜃0f(x=0.0;\theta)=0,\quad f(x=2.0;\theta)=0.italic_f ( italic_x = 0.0 ; italic_θ ) = 0 , italic_f ( italic_x = 2.0 ; italic_θ ) = 0 . (27)
  2. 2.

    Derivative Constraint: The derivative constraint is applied to ensure that the rate of change of the deflection at a certain point is known. In this case, the derivative is enforced to be zero at the point x=0.0𝑥0.0x=0.0italic_x = 0.0, corresponding to the beam’s left end. Using finite differences at x=0.0𝑥0.0x=0.0italic_x = 0.0, we constrains this derivative to be zero.

Refer to caption
Figure 4: Beam configuration. The observable regions are given in the dashed boxes.
Refer to caption
Figure 5: Predicted beam deflection with BENN and BNN.

Figure 5 shows that the proposed BENN model prediction significantly outperforms the BNN model.

3.3 Microstructure Generation

In material science, predicting the properties of heterogeneous materials such as alloys, composites, polymers, and porous media is critical. These materials exhibit complex microstructures that influence their mechanical, thermal, electromagnetic, and other physical properties. Thus, efficient and scalable microstructure generation methods are critical for the structure-property analysis in material science. For example, Gao et al. (2021) proposes a method using a mixture random field model to generate a non-Gaussian field. This statistical method can generate binary phase microstructures that can either be anisotropic or isotropic and require the specification of two-point correlation functions. This approach belongs to the statistical descriptor approach and has an interpretable physical meaning of the generated microstructure. However, this approach can only handle second-order statistical descriptors and cannot generate some microstructures that show very complex patterns and high-order statistics. Alternatively, Neural Network-based microstructure generation can handle arbitrary complex microstructure patterns due to the universal approximation nature of NNs. For example, autoencoders, leveraging advancements in machine learning, provide a modern computational tool that offers significant benefits in the simulation and analysis of microstructures.

Autoencoders facilitate the fast and efficient simulation of microstructures, reducing reliance on extensive experimental datasets and accelerating the material design process. This capability is crucial for designing materials with optimized properties tailored for specific applications, essential in sectors like aerospace, automotive, and biomedical engineering. Furthermore, the computational generation of microstructures allows for the quantification and management of the inherent uncertainties in materials, which is vital for develo** reliable materials and implementing robust design principles.Moreover, generating microstructures using autoencoders fits seamlessly into broader computational frameworks such as Integrated Computational Materials Engineering (ICME), as discussed by Gao et al. (2021). However, traditional autoencoders do not necessarily reproduce the known statistical descriptors from classical microstructure quantification. Thus, the motivation of the proposed autoencoder-based approach is aimed at incorporating the classical statistical descriptors as constraints in the NN architecture using the proposed BENN methodology.

3.3.1 Bayesian Entropy Convolutional Variational Auto-Encoder (BE-CVAE)

The proposed architecture integrates a Convolutional Variational Autoencoder (CVAE) focusing on image generation while accommodating specific constraints using the BE framework. The encoder is a sequence of convolutional layers, each followed by a Rectified Linear Unit (ReLU) activation function, MaxPooling for spatial downsampling, and Batch Normalization. The convolutional layers extract hierarchical features to finally obtain the latent space Z𝑍Zitalic_Z.

The latent space is characterized by two fully connected layers, one for the mean μ𝜇\muitalic_μ and the other for the log-variance log(σ2)superscript𝜎2\log(\sigma^{2})roman_log ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), of a Gaussian distribution. The model uses the reparameterization trick, generating a sample z𝑧zitalic_z from the latent space by where ϵitalic-ϵ\epsilonitalic_ϵ is a random noise sampled from a standard normal distribution.

The decoder reconstructs the input image using transposed convolutional layers with ReLU activations and Batch Normalization. A Sigmoid function in the final layer outputs pixel values in the [0,1] range. Post-decoding, the output is resized to the original input dimensions using interpolation.

3.3.2 Dataset

The dataset is generated using a non-Gaussian random field, without using techniques such as simulated annealing, which are computationally intensive. The use of this technique allows for the simultaneous generation of both the microstructure and the latent material property field. Using this approach, up to 150 training samples are generated for the microstructure generative model experiments.

3.3.3 Constraint - Two-point correlation function

The Two Point Correlation Function (TPCF) serves as a fundamental constraint in the analysis and synthesis of binary microstructure images. This statistical tool quantifies the spatial correlation between pairs of points within a given distance in a microstructure, providing insight into its heterogeneity and spatial distribution features. In the proposed model, the TPCF is calculated for each generated microstructure image. The function evaluates the probability of finding two points, separated by a certain distance, that are both in the same state (either material or void). The TPCF constraint ensures that the generated microstructures possess spatial characteristics similar to those of the training set. This includes features like the distribution of phases, the degree of homogeneity or clustering, and the overall geometric properties of the material. By applying this constraint, it is aimed to produce microstructures that are not only statistically representative of the real material system but also maintain critical physical and mechanical properties.

Ensuring fidelity to the TPCF is particularly significant in materials science and engineering, where the microstructural arrangement greatly influences the macroscopic properties of materials. The accurate reproduction of TPCF in synthetic microstructures is thus essential for reliable material property predictions and subsequent applications.

The Two Point Correlation Function (TPCF) in this study is computed using a Fourier Transform Autocorrelation approach, which is an efficient method for analyzing spatial patterns in microstructures. This method is based on the principle that the Fourier transform of an autocorrelation function is the squared magnitude of the Fourier transform of the original function. The steps of the computation, described in (Suankulova, 2020) are summarized as follows:

  1. 1.

    Fourier Transform: The Fourier Transform of the binary microstructure is computed. This step is facilitated by the use of the Fast Fourier Transform (FFT), which is computationally efficient for digital images.

  2. 2.

    Power Spectrum: Next, the power spectrum of the Fourier-transformed image is obtained by calculating its squared magnitude. This power spectrum reflects the frequency components of the spatial pattern of the microstructure.

  3. 3.

    Inverse Fourier Transform: The inverse Fourier Transform is applied to the power spectrum. This operation converts the frequency-domain representation back into the spatial domain, resulting in an autocorrelation function.

  4. 4.

    Radial Averaging: Finally, for the TPCF, radial averaging is performed on the autocorrelation function. This process involves averaging the values over circles (in 2D) of radius r𝑟ritalic_r, which provides the final TPCF.

This Fourier Transform approach to computing the TPCF is advantageous due to its computational efficiency, particularly for large and complex microstructures. It also inherently accommodates periodic boundary conditions, which is beneficial in materials with repeating structures.

The resultant TPCF from this method provides insights into the heterogeneity and spatial distribution of features within the microstructure, which is crucial for understanding how these features influence the material’s macroscopic properties.

To facilitate gradient-based optimization in the process of image binarization, a differentiable binarization function is employed, defined as follows:

b(x)=σ(s(xθ)),b𝑥𝜎𝑠𝑥𝜃\text{b}(x)=\sigma(s\cdot(x-\theta)),b ( italic_x ) = italic_σ ( italic_s ⋅ ( italic_x - italic_θ ) ) , (28)

where σ𝜎\sigmaitalic_σ denotes the sigmoid function, s𝑠sitalic_s is the steepness parameter, θ𝜃\thetaitalic_θ is the threshold, and x𝑥xitalic_x is the input value. The sigmoid function σ(x)=11+ex𝜎𝑥11superscript𝑒𝑥\sigma(x)=\frac{1}{1+e^{-x}}italic_σ ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG is used to provide a smooth transition between 0 and 1. The steepness parameter s𝑠sitalic_s, set to a high value (e.g., 100), ensures a sharp transition around the threshold θ𝜃\thetaitalic_θ. This approach allows for an approximate binary representation while retaining differentiability, which is essential for gradient-based optimization methods.

The TPCF is now ready to be used as a functional constraint. This amounts to either solving multiple value constraints for the TPCF sequentially, or to compute the overall absolute deviation from the expected curve at once. The latter option is computationally cheaper than evaluating multiple value constraints per iteration.

3.3.4 Constraint - Porosity

The porosity of a microstructure is a critical characteristic because it directly influences the material’s mechanical, thermal, and transport properties. High porosity can lead to reduced strength and increased permeability, which are important factors in applications such as structural components and filtration systems. Accurate control and prediction of porosity are important for material design workflows that have specific performance criteria. The porosity function is given by:

e=VpVp+Vs,𝑒subscript𝑉𝑝subscript𝑉𝑝subscript𝑉𝑠e=\frac{V_{p}}{V_{p}+V_{s}},italic_e = divide start_ARG italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG , (29)

where Vpsubscript𝑉𝑝V_{p}italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the total number of voxels in the void phase and Vssubscript𝑉𝑠V_{s}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the total number of voxels in the solid phase.

3.3.5 Results

The proposed model was first tested to compare how fast convergence occurred in generating quality samples. The baseline case with no constraints took far longer to converge compared to the ones using constraints. Furthermore, the model that used both porosity and the TPCF constraint converged significantly faster than both the baseline and the model that only incorporated porosity constraints. Next, the effect of training data on performance is studied. The performance of the generative model is measured by how well the generated structures conform to the expected TPCF function. Table 2 shows strong evidence of improvement in compliance with the TPCF requirements as more constraints are added. As expected, an improvement is also seen when the number of training samples are increased. It is to be noted, however, that the evolution of constraint compliance during training paints a more nuanced picture in high dimensional problems. The resulting absolute errors from the TPCF and porosity constraints are lower those seen in the baseline model, it is still non-zero. This is because the TPCF and porosity constraints end up guiding the model to produce faster convergence with better reconstruction adherence with these two constraints.

Refer to caption
Figure 6: Microstructure generation using the proposed model after training the model with 260 epochs with (Top Row) No constraints (the baseline CVAE model) (Middle Row) TPCF Constraint (Bottom Row) TPCF + Porosity Constraints. Note that the model converges faster with two constraints compared to the baseline or just using the TPCF constraint
Training samples No constraint TPCF TPCF + Porosity
25 0.85 0.67 0.60
50 0.59 0.46 0.38
100 0.59 0.41 0.39
150 0.47 0.29 0.34
Table 2: TPCF Constraint compliance as L1 error on generated samples: Analysis along training data and number of constraints

4 Conclusion

In this paper, the Bayesian Entropy Neural Networks (BENN) framework was introduced, and its application in various settings, including 1D regression, beam deflection modeling, and microstructure generation, was demonstrated. The integration of the Maximum Entropy (MaxEnt) principle with the Bayesian framework allowed for the imposition of constraints on predictions, thereby offering a novel approach to constrained deep learning. Note that this method is a general framework that explains the approaches used in similar works such as (Huang et al., 2022) and (Yang et al., 2020a).

The experiments with 1D regression emphasized the BENN framework’s versatility in handling value, derivative and variance constraints. It was observed that the model could effectively accommodate external constraints, even in areas with sparse or absent training data. This was particularly evident in experiments involving conflicting constraints and variance control, illustrating BENN’s ability to manage complex and uncertain information. The model can, therefore, be used for problems that require adherence to boundary conditions through constraints. An interesting demonstration of BENN in this work was in the domain of microstructure generation. The integration of constraints like the Two-Point Correlation Function (TPCF) and porosity into a Convolutional Variational Autoencoder showed substantial improvements in generating microstructures that are not only statistically representative but also improve adherence to critical physical properties. This highlights the framework’s potential in materials science, where accurate microstructure characterization is crucial.

This paper provides some areas for improvement and further investigation. The computational complexity associated with performing simultaneous optimization of model parameters and Lagrangian multipliers is a challenge, particularly for large-scale problems. Another avenue for advancement lies in the integration of epistemic uncertainty constraints within the BENN framework. Epistemic, or model, uncertainty arises from a lack of knowledge about the best model to represent a process. In many real-world applications, especially in fields like climate modeling and epidemic modeling, understanding and quantifying epistemic uncertainty is crucial for making reliable predictions and decisions. Incorporating these constraints could enhance the model’s ability to express uncertainty in predictions, especially in scenarios with limited data or where the data does not capture the entire spectrum of the underlying distribution. This could involve develo** new methodologies for characterizing and quantifying epistemic uncertainty and devising novel ways to embed this information into the learning process. Finally, papers that work on applying the BENN framework to a wider array of practical applications would not only demonstrate its versatility but also uncover specific challenges and opportunities for improvement.

5 Acknowledgment

The work was partially supported by funds from the National Science Foundation (Award Number: 2331781) and by funds from Arizona State University. The support is greatly acknowledged.

References

  • Gao et al. (2021) Gao, Y., Jiao, Y., Liu, Y., 2021. Ultra-efficient reconstruction of 3d microstructure and distribution of properties of random heterogeneous materials containing multiple phases. Acta Materialia 204, 116526.
  • Giffin et al. (2007) Giffin, A., Caticha, A., Caticha, A., 2007. Updating probabilities with data and moments. arXiv: Data Analysis, Statistics and Probability doi:10.1063/1.2821302.
  • Goldberg (2016) Goldberg, Y., 2016. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57, 345–420.
  • Hosseini-Asl et al. (2015) Hosseini-Asl, E., Zurada, J.M., Nasraoui, O., 2015. Deep learning of part-based representation of data using sparse autoencoders with nonnegativity constraints. IEEE transactions on neural networks and learning systems 27, 2486–2498.
  • Huang et al. (2022) Huang, J., Huang, J., Pang, Y., Pang, Y., Liu, Y., Liu, Y., Yan, H., Yan, H., 2022. Posterior regularized bayesian neural network incorporating soft and hard knowledge constraints. Cornell University - arXiv doi:10.48550/arxiv.2210.08608.
  • Khan et al. (2020) Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S., 2020. A survey of the recent architectures of deep convolutional neural networks. Artificial intelligence review 53, 5455–5516.
  • Lagaris et al. (1998) Lagaris, I.E., Likas, A., Fotiadis, D.I., 1998. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks 9, 987–1000.
  • Lu and Zhang (2022) Lu, Q., Zhang, W., 2022. Integrating dynamic bayesian network and physics-based modeling for risk analysis of a time-dependent power distribution system during hurricanes. Reliability Engineering & System Safety 220, 108290. URL: https://www.sciencedirect.com/science/article/pii/S0951832021007614, doi:https://doi.org/10.1016/j.ress.2021.108290.
  • de Mathelin et al. (2023) de Mathelin, A., Deheeger, F., Mougeot, M., Vayatis, N., 2023. Maximum weight entropy. arXiv preprint arXiv:2309.15704 .
  • Platt et al. (1987) Platt, J., Platt, J., Barr, A.H., Barr, A.H., 1987. Constrained differential optimization. Neural Information Processing Systems doi:null.
  • Raissi et al. (2019) Raissi, M., Raissi, M., Perdikaris, P., Perdikaris, P., Karniadakis, G., Karniadakis, G.E., 2019. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics doi:10.1016/j.jcp.2018.10.045.
  • Sam et al. (2024) Sam, D., Pukdee, R., Jeong, D.P., Byun, Y., Kolter, J.Z., 2024. Bayesian neural networks with domain knowledge priors. arXiv preprint arXiv:2402.13410 .
  • Sirignano and Spiliopoulos (2017) Sirignano, J.A., Spiliopoulos, K.V., 2017. Dgm: A deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364. URL: https://api.semanticscholar.org/CorpusID:126213877.
  • Suankulova (2020) Suankulova, K., 2020. Correlation functions in determining the structure of chromosomes.
  • Tran et al. (2022) Tran, B.H., Rossi, S., Milios, D., Filippone, M., 2022. All you need is a good functional prior for bayesian deep learning. arXiv:2011.12829.
  • Wang et al. (2021a) Wang, Y., Wang, Y., Gao, Y., Gao, Y., Gao, Y., Liu, Y., Liu, Y., Ghosh, S., Ghosh, S., Subber, W., Subber, W., Pandita, P., Pandita, P., Wang, L., Wang, L., 2021a. Bayesian-entropy gaussian process for constrained metamodeling. Reliability Engineering and System Safety doi:10.1016/j.ress.2021.107762.
  • Wang et al. (2021b) Wang, Y., Wang, Y., Gao, Y., Gao, Y., Gao, Y., Liu, Y., Liu, Y., Ghosh, S., Ghosh, S., Subber, W., Subber, W., Pandita, P., Pandita, P., Wang, L., Wang, L., 2021b. Bayesian-entropy gaussian process for constrained metamodeling. Reliability Engineering and System Safety doi:10.1016/j.ress.2021.107762.
  • Wang et al. (2020) Wang, Y., Wang, Y., Liu, Y., 2020. Bayesian entropy network for fusion of different types of information. Reliability Engineering and System Safety doi:10.1016/j.ress.2019.106747.
  • Wilson and Izmailov (2020) Wilson, A.G., Izmailov, P., 2020. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems 33, 4697–4708.
  • Yang et al. (2020a) Yang, W., Yang, W., Lorch, L., Lorch, L., Graule, M.A., Graule, M.A., Lakkaraju, H., Lakkaraju, H., Doshi-Velez, F., Doshi-Velez, F., 2020a. Incorporating interpretable output constraints in bayesian neural networks. arXiv: Learning doi:null.
  • Yang et al. (2020b) Yang, W., Yang, W., Lorch, L., Lorch, L., Graule, M.A., Graule, M.A., Lakkaraju, H., Lakkaraju, H., Doshi-Velez, F., Doshi-Velez, F., 2020b. Incorporating interpretable output constraints in bayesian neural networks. arXiv: Learning doi:null.
  • Zhu et al. (2019) Zhu, Y., Zabaras, N., Koutsourelakis, P.S., Perdikaris, P., 2019. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics 394, 56–81.