ULLER: A Unified Language
for Learning and Reasoning

Emile van Krieken^∗
University of Edinburgh
&

Samy Badreddine^∗
Sony AI
Fondazione Bruno Kessler
UniTrento
&

Robin Manhaeve
KU Leuven
&

Eleonora Giunchiglia
TU Wien

Abstract

The field of neuro-symbolic artificial intelligence (NeSy), which combines learning and reasoning, has recently experienced significant growth. There now are a wide variety of NeSy frameworks, each with its own specific language for expressing background knowledge and how to relate it to neural networks. This heterogeneity hinders accessibility for newcomers and makes comparing different NeSy frameworks challenging. We propose a language for NeSy, which we call ULLER, a Unified Language for LEarning and Reasoning. ULLER encompasses a wide variety of settings, while ensuring that knowledge described in it can be used in existing NeSy systems. ULLER has a first-order logic syntax specialised for NeSy for which we provide example semantics including classical FOL, fuzzy logic, and probabilistic logic. We believe ULLER is a first step towards making NeSy research more accessible and comparable, paving the way for libraries that streamline training and evaluation across a multitude of semantics, knowledge bases, and NeSy systems.

^*^*footnotetext: Equal contribution.
Correspondence to [email protected] and [email protected].
Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning.

1 Introduction

Deep learning has driven innovation in many fields for the past decade. Among the many reasons behind its central role is the ease with which it can be applied to a multitude of problems. Recently, neuro-symbolic (NeSy) methods (see, e.g., [30, 4, 50, 21, 35, 57, 24]), which belong to the NeSy subfield informed machine learning, [20, 53] have overcome some well-known problems affecting deep learning models by exploiting background knowledge available for the problem at hand. For example, knowledge can help to train models with fewer data points and/or incomplete supervisions, to create models that comply by design to a set of requirements, and to be more robust in out-of-distribution prediction tasks.

However, background knowledge makes it more challenging to obtain “frictionless reproducibility” [17] which characterises machine learning (ML). Indeed, shared datasets and clear evaluation metrics allow ML practitioners to quickly get started with evaluating new methods and comparing it to existing work. To achieve this goal for NeSy research, we also need frictionless sharing of knowledge. Current NeSy frameworks all have different approaches to encode the background knowledge: some use logical languages, like first-order [4, 35] and propositional logic [2, 56, 21], logic programming [30] or answer set programming [57] - with a wide array of different syntaxes - while other methods use plain Python programs [14]. See Section 5 for an overview. To compare the performance of different NeSy systems, a researcher needs to specify the same knowledge in many languages. This is a significant barrier both for researchers new to the field or even for experts, as it is a time-consuming and error-prone task.

ULLER, a Unified Language for LEarning and Reasoning

We take a first step towards frictionless sharing of knowledge in the NeSy field by proposing a Unified Language for Learning and Reasoning (ULLER, pronounced “OOH-ler” like the god of the Norse mythology). ULLER allows us to express the knowledge used in informed machine learning. Our long-term goal is to create a Python library implementing ULLER to be shared among the significant NeSy systems. First, the user expresses the knowledge in ULLER. Then, they load the data, after which they call different NeSy systems with a single line of code to train neural networks, or to use the knowledge at prediction time. This allows the NeSy community $(\mathit{i})$ to define benchmarks that include both data and knowledge, $(\mathit{ii})$ to easily compare the available NeSy systems on such benchmarks, and $(\mathit{iii})$ to lower the entry barrier to NeSy research for the broader machine learning community.

To achieve the above requires us to decouple the syntax of the knowledge representation from the semantics given by the NeSy system of interest. The syntax of ULLER, defined in Section 2, is based on first-order logic (FOL). However, we introduce statements for ULLER. Statements simplify the process of writing down function application and composition - and hence dealing with data sampling and processing pipelines. We opt for a FOL syntax because it generalises propositional logic, while being a common language for declaring general constraints. Secondly, FOL with statements is more familiar to ML researchers, who are mostly used to writing procedural statements like in Python, while having a well-defined semantics for logicians. Finally, FOL is highly expressive: We believe that it can express all knowledge currently used in NeSy methods.

The semantics of ULLER (Section 3), depends on $(\mathit{i})$ an interpretation, often referred to as a “symbol grounding” [23], which maps symbols to meanings, and $(\mathit{ii})$ a “NeSy system”, which takes knowledge and its interpretation, and computes loss functions and outputs accordingly. We formalise the differences between NeSy systems by what they compute given a program in ULLER and an interpretation. For classical boolean semantics, ULLER is equivalent to standard FOL. However, we also provide example semantics for several common NeSy frameworks of fuzzy logic (such as Logic Tensor Networks [4]) and probabilistic logic (such as Semantic Loss [56] and DeepProbLog [30]). This highlights the flexibility of our language, as it can be used to express knowledge in many formalisms.

2 Syntax of ULLER

Let $\mathcal{V}$ be a set of variable symbols, $\mathcal{C}$ be a set of constant symbols, $\mathcal{D}$ be a set of domain symbols, $\mathcal{P}$ be a set of predicate symbols, $\mathcal{T}$ be a set of property symbols, and $\mathcal{F}$ be a set of function symbols. We then define the syntax of ULLER $\mathcal{L}_{\text{{ULLER}}}$ as a context-free grammar:

$\displaystyle F$	$\displaystyle::=\forall x\in D\ (F)\ \|\ \exists x\in D\ (F)$	(1)
$\displaystyle F$	$\displaystyle::=F\land F\ \|\ F\lor F\ \|\ F\Rightarrow F\ \|\ \neg F\ \|\ \mathrm% {P}(T,...,T)\ \|\ (F)$
$\displaystyle F$	$\displaystyle::=x:=f(T,...,T)\ (F)$
$\displaystyle T$	$\displaystyle::=x\ \|\ c\ \|\ T.\text{prop}\ \|\ T+T\ \|\ T-T\ \|\ \dots$

where $D\in\mathcal{D}$ , $x\in\mathcal{V}$ , $c\in\mathcal{C}$ , $f,+,-\in\mathcal{F}$ , $\mathrm{P}\in\mathcal{P}$ and $\text{prop}\in\mathcal{T}$ . The nonterminal symbol $F$ is a formula and $T$ is a term. We call $x:=f(T,\dots,T)(F)$ a statement, or just statement, which we discuss in Section 2.1. Notice that, except for basic arithmetic operations ( $+$ , $-$ , …), functions only appear in statements.

The syntax of ULLER does not include a special syntactic construct for neural networks. Instead, we treat them as functions, where the intended meaning is given by the semantics specified by the NeSy system. We therefore hide how the NeSy system uses the neural networks to the user, so the focus is on specifying constraints rather than implementation details.

Syntactic Sugar. We use $\forall x_{1}\in D_{1},x_{2}\in D_{2}(F)$ as syntactic sugar for $\forall x_{1}\in D_{2}\ (\forall x_{2}\in D_{2}\ (F))$ for the quantifiers. We use $x_{1}:=f_{1}(T,\dots,T),x_{2}:=f_{2}(T,\dots,T)\ (F)$ as syntactic sugar for $x_{1}:=f_{1}(T,\dots,T)\ (x_{2}:=f_{2}(T,\dots,T)\ (F))$ . Finally, we also allow for binary predicates in infix notation, such as $T\leq T$ .

Ty**. ULLER is a dynamically typed language. We do not guarantee syntactically nor via type checker that functions and predicates only take arguments from the domain defined in their interpretations. This mimics the design of the type system of Python.

2.1 Statements

A key design choice of ULLER is the use of special statements $x:=f(T,...,T)\ (F)$ to declare (possibly random) variables obtained by applying (possibly non-deterministic) functions. The function symbols $f$ appear only in statements, and not in the definition of terms $T$ , like in standard FOL. Statements simplify the composition of functions. They give a syntax that is both familiar to ML researchers who are used to writing Python, and gives a clear separation between the machine learning pipeline that processes data and the constraints on the data given by the logic. We will motivate statements with the two following examples.

Example 2.1 (Procedural composition of functions).

Consider the MNISTAdd example from Appendix A. To emphasise the ease of composing functions in ULLER, consider a scenario where the classifier $f$ expects greyscale images while the data points in the dataset $T$ are RGB images. We can easily apply transformations and formulate the new condition using ULLER statements:

\displaystyle\begin{split}&\forall x\in T(\\ &\quad x_{1}^{\prime}:=\mathrm{greyscale}(x.\mathrm{im1}),x_{1}^{\prime\prime}% :=\mathrm{normalise}(x_{1}^{\prime}),\\ &\quad x_{2}^{\prime}:=\mathrm{greyscale}(x.\mathrm{im2}),x_{2}^{\prime\prime}% :=\mathrm{normalise}(x_{2}^{\prime}),\\ &\quad n_{1}:=f(x_{1}^{\prime\prime}),n_{2}:=f(x_{2}^{\prime\prime})\\ &\qquad(n_{1}+n_{2}=x.\mathrm{sum})\\ &)\end{split}

(2)

$\triangleleft$

Example 2.2 (Sco** independence).

Another key feature of ULLER statements is that they explicitly delimit the scopes of variables, giving control over the memoisation and independence assumptions. Consider a non-deterministic function $\mathrm{dice}()$ which associates a probability to each outcome of a six-sided dice throw. Consider the following program written in ULLER:

x:=\mathrm{dice}()\ (x=6\land\mathrm{even}(x)).

(3)

The formula asks whether a die-throw outcome is both a six and even. For a fair dice, the probability of the formula is $\frac{1}{6}$ under probabilistic semantics.

Now consider the alternative ULLER program:

(x:=\mathrm{dice}()\ (x=6))\land(x:=\mathrm{dice}()\ (\mathrm{even}(x))).

(4)

In this program, we throw two independent dice, and check if the first lands on six and the second is even. For fair dice, the probability of this formula is $\frac{1}{6}\cdot\frac{1}{2}=\frac{1}{12}$ .

Consider a similar program in regular FOL (which is not allowed in ULLER):

(\mathrm{dice}()=6)\land\mathrm{even}(\mathrm{dice}())

(5)

Here, it is ambiguous whether the outcomes of the dice are shared like in the ULLER program of (3) or not, like in (4). Many probabilistic NeSy frameworks choose the first option and memoise the outcome of the function. We argue that this behaviour should not be a default assumption from the NeSy system. Instead, dependence and memoisation scopes should be explicitly defined by the program. ULLER statements give researchers control over these scopes. $\triangleleft$

3 Semantics of ULLER

In this Section, we define the semantics of ULLER. In Section 3.1 we discuss how ULLER interprets the symbols in the language, such as the function and domain symbols. Then, in Section 3.2, we discuss how some example NeSy systems interpret the formulas in ULLER.

3.1 Interpretation of the Symbols

To assign meaning to ULLER programs, we need to interpret the non-logical symbols in $\mathcal{L}_{\text{{ULLER}}}$ , that is, $\mathcal{D}$ , $\mathcal{P}$ , $\mathcal{F}$ , and $\mathcal{C}$ , using an interpretation function $I$ .

Definition 3.1.

An interpretation $I$ is a function assigning a meaning to the symbols in $\mathcal{L}_{\text{{ULLER}}}$ under the following rules, where $\Omega_{1},...,\Omega_{n},\Omega_{n+1}$ are sets.

1.

The interpretation of a domain $D\in\mathcal{D}$ is a set $\Omega$ .
2.

The interpretation of a predicate $\mathrm{P}$ of arity $n$ is a function of $n$ domains to $\{0,1\}$ . That is, $I(\mathrm{P}):\Omega_{1}\times...\times\Omega_{n}\rightarrow\{0,1\}$ .
3.

The interpretation of the predicate $\mathrm{true}\in\mathrm{P}$ is the identity function on $\{0,1\}$ , that is, $I(\mathrm{true}):\{0,1\}\rightarrow\{0,1\}$ such that $I(\mathrm{true})(x)=x$ .
4.

The interpretation of a constant $c$ is an element of a domain $I(c)\in\Omega_{i}$ .
5.

The interpretation of a function $f$ of arity $n$ is a conditional probability distribution¹¹1To be precise, our definition is equivalent to a probability kernel or Markov kernel, which is a formalisation of the concept of a conditional probability distribution in measure theory. $I(f):\Omega_{1}\times...\times\Omega_{n}\rightarrow(\Omega_{n+1}\rightarrow[0,% 1])$ . That is, for any set of inputs $x_{1}\in\Omega_{1},...,x_{n}\in\Omega_{n}$ , $I(f)(x_{1},...,x_{n})$ is a probability distribution on the domain $\Omega_{n+1}$ . If for all $x_{1}\in\Omega_{1},...,x_{n}\in\Omega_{n}$ the probability distribution $I(f)(x_{1},...,x_{n})$ is a deterministic distribution, we say that $I(f)$ is a deterministic function.

Next, we give a probabilistic interpretation to both domains and functions. In particular, we treat functions, such as neural networks, as a conditional distribution given assignments $x_{1}\in\Omega_{1},...,x_{n}\in\Omega_{n}$ to input variables. This allows us to represent the uncertainty of the neural networks, which NeSy systems using, for example, probabilistic and fuzzy semantics can use to compute probabilities and fuzzy truth values. We will also frequently want to use regular (deterministic) functions $f:\Omega_{1}\times...\times\Omega_{n}\rightarrow\Omega_{n+1}$ . A regular function is a special case of a conditional distribution that we refer to as a deterministic function. We define deterministic functions with a conditional distribution using the Dirac delta distribution at $f(x_{1},...,x_{n})$ for continuous distributions, and a distribution that assigns 1 to the output value $f(x_{1},...,x_{n})$ for finite domains, and 0 to the other values.

3.2 Semantics of neuro-symbolic systems

Refer to caption — Figure 1: The meaning of an example ULLER formula under classical, probabilistic and fuzzy semantics. We interpret the function symbols as conditional distributions. $\text{{\includegraphics[width=7.5347pt]{figs/vertical-traffic-light.png}}}:X% \rightarrow(\{\text{{\includegraphics[width=7.5347pt]{figs/red-circle2.png}}},% \text{{\includegraphics[width=7.5347pt]{figs/orange-circle2.png}}},\text{{% \includegraphics[width=7.5347pt]{figs/green-circle2.png}}}\}\rightarrow[0,1])$ detects the concepts for red, orange, and green in the crossing representations. $\text{ {\includegraphics[width=7.5347pt]{figs/automobile.png}} {\includegraphics[width=7.5347pt]{figs/dashing-away.png}} }:(X\times\{\text{{\includegraphics[width=7.5347pt]{figs/red-circle2.png}}},% \text{{\includegraphics[width=7.5347pt]{figs/orange-circle2.png}}},\text{{% \includegraphics[width=7.5347pt]{figs/green-circle2.png}}}\})\rightarrow(\{0,1% \}\rightarrow[0,1])$ takes the decision of continuing given an extracted color light concept and the rest of the crossing scene. With abuse of notation, we ignore $I()$ and $\llbracket\rrbracket$ .

We next define the meaning of a formula in $\mathcal{L}_{\text{{ULLER}}}$ , which requires both an interpretation $I$ and a NeSy system ${\llbracket\rrbracket}$ . Here, ${\llbracket\rrbracket}$ is a function that interprets the semantics of the program statements in $\mathcal{L}_{\text{{ULLER}}}$ . We also need a variable assignment $\eta:\mathcal{V}\rightarrow\mathcal{O}$ that maps variables $v\in\mathcal{V}$ to an element of a domain $\mathcal{O}=\cup_{i}\Omega_{i}$ , where $\Omega_{i}=I(D_{i})$ is a set associated to a domain $D_{i}\in\mathcal{D}$ .

Definition 3.2.

A NeSy structure is a tuple $(I,\eta,\mathcal{B},{\llbracket\rrbracket}_{I,\eta})$ where $I$ is an interpretation, $\eta:\mathcal{V}\rightarrow\mathcal{O}$ is a variable assignment, $\mathcal{B}$ is a set of outputs and ${\llbracket\rrbracket}_{I,\eta}:\mathcal{L}_{\text{{ULLER}}}\rightarrow% \mathcal{B}\cup\mathcal{O}$ is a neuro-symbolic system which is a function that assigns an output in $\mathcal{B}$ to each formula in $\mathcal{L}_{\text{{ULLER}}}$ and a domain element in $\mathcal{O}$ for terms $T$ . If the interpretation and variable assignment are clear from the context, we write ${\llbracket\rrbracket}$ for ${\llbracket\rrbracket}_{I,\eta}$ .

We discuss several NeSy systems and their semantics for the NeSy language in the following sections, and provide a visual overview in Figure 1. Each NeSy system is defined over some set of outputs $\mathcal{B}$ . For example, classical logic is defined over the output $\{0,1\}$ , while fuzzy logics are defined over the interval $[0,1]$ . A neuro-symbolic system ${\llbracket\rrbracket}_{I,\eta}$ defines the semantics of a language expression. When a language expression is a term $T$ , ${\llbracket\rrbracket}_{I,\eta}$ returns an element of the universe $\mathcal{O}$ . When the language expression is a formula $F$ , it returns an element in $\mathcal{B}$ .

Notation. We use $\eta[x\mapsto a]$ to update a variable assignment $\eta$ with the assignment of $a$ to $x$ :

	$\displaystyle\eta[x\mapsto a](x)$	$\displaystyle=a$		(6)
	$\displaystyle\eta[x\mapsto a](x^{\prime})$	$\displaystyle=\eta(x^{\prime})\quad\text{for}\ x^{\prime}\neq x$		(6)

We also define ${p}_{f}(a|T_{1},...,T_{n})=I(f)(\llbracket T_{1}\rrbracket,...,\llbracket T_{n% }\rrbracket)(a)$ , which computes the probability of the element $a\in\Omega_{n+1}$ under the distribution $I(f)$ , conditioned on the interpretation of the terms $T_{1}$ to $T_{n}$ . That is, under $\llbracket T_{1}\rrbracket,...,\llbracket T_{n}\rrbracket$ . In the coming sections, we will frequently use this shorthand to talk about the semantics of the different NeSy systems.

3.3 Classical semantics

We first define the semantics of the NeSy language if we “choose” an option deterministically from a conditional distribution. Then, under the classical (boolean) semantics of the logical symbols, ULLER is a regular first-order logic: It becomes exactly as expressive as FOL. In this paper, we will make the deterministic choice from a distribution by taking the mode, that is, the most likely output of the conditional distribution. However, other choices are also possible.

Definition 3.3.

The classical structure $(I,\eta,\{0,1\},\llbracket\rrbracket^{\text{C}}_{I,\eta})$ is defined on boolean outputs $\{0,1\}$ as:

$\displaystyle\llbracket\forall x\in D\ (F)\rrbracket^{\text{C}}_{I,\eta}$	$\displaystyle=\min_{a\in I(D)}\llbracket F\rrbracket^{\text{C}}_{I,\eta[x% \mapsto a]}$	(7)
$\displaystyle\llbracket\exists x\in D\ (F)\rrbracket^{\text{C}}$	$\displaystyle=\llbracket\neg\forall x\in D\ (\neg F)\rrbracket^{\text{C}}$	(8)
$\displaystyle\llbracket F_{1}\wedge F_{2}\rrbracket^{\text{C}}=\min(\llbracket F% _{1}\rrbracket^{\text{C}},\llbracket F_{2}\rrbracket^{\text{C}})$	$\displaystyle,\ \llbracket F_{1}\lor F_{2}\rrbracket^{\text{C}}=\llbracket\neg% (\neg F_{1}\land\neg F_{2})\rrbracket^{\text{C}}$	(9)
$\displaystyle\llbracket\neg F\rrbracket^{\text{C}}=1-\llbracket F\rrbracket^{% \text{C}}$	$\displaystyle,\ \llbracket F_{1}\Rightarrow F_{2}\rrbracket^{\text{C}}=% \llbracket\neg F_{1}\lor F_{2}\rrbracket^{\text{C}}$	(10)
$\displaystyle\llbracket P(T_{1},\dots,T_{n})\rrbracket^{\text{C}}_{I,\eta}$	$\displaystyle=I(P)(\llbracket T_{1}\rrbracket^{\text{C}},\dots,\llbracket T_{n% }\rrbracket^{\text{C}})$	(11)
$\displaystyle\llbracket x\rrbracket^{\text{C}}_{I,\eta}=\eta(x)$	$\displaystyle,\ \llbracket c\rrbracket^{\text{C}}=I(c)$	(12)
$\displaystyle\llbracket T_{1}+T_{2}\rrbracket^{\text{C}}$	$\displaystyle=\llbracket T_{1}\rrbracket^{\text{C}}+\llbracket T_{2}\rrbracket% ^{\text{C}}$	(13)
$\displaystyle\llbracket T.\mathrm{prop}\rrbracket^{\text{C}}$	$\displaystyle=\mathrm{get}(\llbracket T\rrbracket^{\text{C}},\mathrm{prop})$	(14)
$\displaystyle\llbracket x:=f(T_{1},...,T_{n})(F)\rrbracket^{\text{C}}_{I,\eta}$	$\displaystyle=\llbracket F\rrbracket^{\text{C}}_{I,\eta[x\mapsto\operatorname*% {argmax}_{a\in\Omega_{n+1}}{p}_{f}(a\|T_{1},...,T_{n})]}$	(15)

In Equation 14, $\mathrm{get}(\llbracket T\rrbracket^{\text{C}},\mathrm{prop})$ is a deterministic function that retrieves the value of an object property.

Equation 15 demands some explanation. The $\arg\max$ takes the probability distribution given by the interpretation of the function $f$ and chooses a value from the codomain $\Omega_{n+1}$ . In the classical structure, this choice is made deterministically by picking the mode of the distribution: the most likely element $a$ . Then we assign this element $a$ to the variable $x$ through the variable assignment $\eta[x\mapsto a]$ , and evaluate the rest of the formula $F$ under this new assignment.

Importantly, the classic semantics allows us to prove whether a neuro-symbolic system “is faithful” to classical logic when all functions are deterministic. We formally introduce this notion by noting we can transform any program into a deterministic program by choosing the mode of the distribution like in Equation 15.

Definition 3.4.

For some interpretation $I$ , the mode interpretation $\hat{I}$ is another interpretation such that for all function symbols $f\in\mathcal{F}$ , $\hat{I}(f)$ returns the mode of ${p}_{f}$ . That is, $\hat{p}_{f}(a|T_{1},...,T_{n})=\delta(a-\arg\max_{a^{\prime}}{p}_{f}(a^{\prime% }|T_{1},...,T_{n}))$ , where $\delta$ is the Dirac delta distribution. Then a neuro-symbolic system ${\llbracket\rrbracket}$ is classical in the limit if for all language statements $L\in\mathcal{L}_{\text{{ULLER}}}$ , $\llbracket L\rrbracket_{\hat{I},\eta}=\llbracket L\rrbracket^{\text{C}}_{I,\eta}$ .

3.4 Probabilistic Semantics

Probabilistic semantics, also known as weighted model counting or possible world semantics in the literature, computes the probability that a formula is true. This is done by iterating over all possible assignments to the variables. We give a straightforward implementation of a probabilistic semantics for ULLER in Definition 3.5, but note that other probabilistic semantics exist which would require different NeSy systems.

In the upcoming definitions, we will not redefine semantics whenever it is equal to the classical semantics, up to domain differences. For instance, we will not repeat constants and variable semantics.

Definition 3.5.

The probabilistic structure $(I,\eta,[0,1],\llbracket\rrbracket^{\text{P}})$ is defined on probabilities $[0,1]$ as:

$\displaystyle\llbracket\forall x\in D\ (F)\rrbracket^{\text{P}}$	$\displaystyle=\prod_{a\in I(D)}\llbracket F\rrbracket^{\text{P}}_{I,\eta[x% \rightarrow a]}$	(16)
$\displaystyle\llbracket F_{1}\wedge F_{2}\rrbracket^{\text{P}}$	$\displaystyle=\llbracket F_{1}\rrbracket^{\text{P}}\cdot\llbracket F_{2}% \rrbracket^{\text{P}}$	(17)
$\displaystyle\llbracket x:=f(T_{1},...,T_{n})\ (F)\rrbracket^{\text{P}}$	$\displaystyle=\mathbb{E}_{a\sim{p}_{f}(\cdot\|T_{1},...,T_{n})}\left[\llbracket F% \rrbracket^{\text{P}}_{I,\eta[x\mapsto a]}\right]$	(18)

In probabilistic semantics, a function $f(x)$ is interpreted as a conditional distribution conditioned on $x$ . In this case, we require computing the expectation of the formulas being true under the interpreted functions. This happens in Equation 18. We also define universal aggregation as a product of independent probabilities in Equation 16, reflecting the common i.i.d. assumption in Machine Learning. This assumption may not hold [43], in which case users can define more sophisticated NeSy systems to model a different universal aggregation behaviour.

The computation of the expectation depends on whether the output domain $\Omega_{n+1}$ is discrete or continuous. For discrete domains, Equation 18 equals

\llbracket x:=f(T_{1},...,T_{n})\ (F)\rrbracket^{\text{P}}=\displaystyle\sum_{% a\in\Omega_{n+1}}{p}_{f}(a|T_{1},...,T_{n})\cdot\llbracket F\rrbracket^{\text{% P}}_{I,\eta[x\mapsto a]},

(19)

while for continuous domains it equals

\llbracket x:=f(T_{1},...,T_{n})\ (F)\rrbracket^{\text{P}}=\displaystyle\int_{% a\in\Omega_{n+1}}{p}_{f}(a|T_{1},...,T_{n})\cdot\llbracket F\rrbracket^{\text{% P}}_{I,\eta[x\mapsto a]}\text{d}a.

(20)

We should note that probabilistic semantics in most practical cases will be intractable because of the exponential recursion introduced in Equation 19, not to mention the usually intractable integral in Equation 20 [5]. We can speed this up with techniques that compile formulas into representations where computing the probability of the formula is tractable [8, 11]. The probabilistic semantics is classical in the limit (Appendix C.1), and is connected to the standard weighted model counting semantics used in, for example, Semantic Loss [56], SPL [2] and DeepProbLog [30]. See Appendix D for details.

We can generalise the probabilistic semantics to algebraic model counting [25, 15] by considering semirings $\mathcal{B}$ together with a product and a sum operation. This, for example, allows us to compute the most likely assignment to the variables in a formula, or to compute the gradient of the probabilistic semantics using dual numbers.

3.5 Fuzzy Semantics

Our definition for a fuzzy semantics is very similar to that of the probabilistic semantics. The two differences are using t-norms and t-conorms to connect fuzzy truth values, and the interpretation of sampling from boolean distributions.

Definition 3.6.

The fuzzy structure $(I_{F},\eta,[0,1],\llbracket\rrbracket^{\text{F}})$ , where $I_{F}$ is an interpretation $I$ except that the predicate symbol $\mathrm{true}$ is interpreted as the identity function on $[0,1]$ , is defined on fuzzy truth values $[0,1]$ as:

$\displaystyle\llbracket\forall x\in D\ (F)\rrbracket^{\text{F}}_{I_{F},\eta}$	$\displaystyle=\bigotimes_{a\in I(D)}\llbracket F\rrbracket^{\text{F}}_{I_{F},% \eta[x\rightarrow a]}$	(21)
$\displaystyle\llbracket\exists x\in D\ (F)\rrbracket^{\text{F}}_{I_{F},\eta}$	$\displaystyle=\bigoplus_{a\in I(D)}\llbracket F\rrbracket^{\text{F}}_{I_{F},% \eta[x\rightarrow a]}$	(22)
$\displaystyle\llbracket F_{1}\wedge F_{2}\rrbracket^{\text{F}}$	$\displaystyle=\llbracket F_{1}\rrbracket^{\text{F}}\operatorname{\otimes}\ % \llbracket F_{2}\rrbracket^{\text{F}},\quad\llbracket F_{1}\lor F_{2}% \rrbracket^{\text{F}}=\llbracket F_{1}\rrbracket^{\text{F}}\operatorname{% \oplus}\ \llbracket F_{2}\rrbracket^{\text{F}}$	(23)
$\displaystyle\llbracket\mathrm{true}(x)\rrbracket^{\text{F}}_{I,\eta}$	$\displaystyle=\eta(x),\quad\text{if }\eta(x)\in[0,1]$	(24)
$\displaystyle\llbracket x\coloneqq f(T_{1},...,T_{n})(F)\rrbracket^{\text{F}}$	$\displaystyle=\begin{cases}\llbracket F\rrbracket^{\text{F}}_{I_{F},\eta[x% \mapsto{p}_{f}(1\|T_{1},...,T_{n})]}&\text{if }\Omega_{n+1}=\{0,1\}\\ \displaystyle\bigoplus_{a\in\Omega_{n+1}}{p}_{f}(a\|T_{1},...,T_{n})\otimes% \llbracket F\rrbracket^{\text{F}}_{I_{F},\eta[x\mapsto a]}&\text{if $\Omega_{n% +1}$ is finite}\end{cases}$	(25)

where $\operatorname{\otimes}:[0,1]\times[0,1]\mapsto[0,1]$ is a fuzzy t-norm and $\operatorname{\oplus}:[0,1]\times[0,1]\mapsto[0,1]$ is a fuzzy t-conorm [4, 49].

In the first case of Equation 25, fuzzy semantics manipulates distributions over boolean codomains $\Omega_{n+1}=\{0,1\}$ as a single truth value ${p}_{f}(1|T_{1},...,T_{n})$ . The second case is defined for discrete, non boolean codomains. Fuzzy semantics reasons disjointly over all possible outcomes $a\in\Omega_{n+1}$ by interpreting the probability ${p}_{f}(a|T_{1},\dots,T_{n})\in[0,1]$ as truth degrees. This truth degree is then conjoined with the interpretation of the rest of the formula $F$ . Intuitively, they ask if there “exists $a$ such that $f(T_{1},\dots,T_{n})$ maps to $a$ and that $a$ verifies the rest of the formula $F$ ”. We do not give a semantics for continuous or infinite domains in the fuzzy semantics, as we do not know of a standard definition in the neuro-symbolic literature. The fuzzy semantics is classical in the limit (see Appendix C.2), and is closely related to differentiable fuzzy logics such as Logic Tensor Networks [49, 4] (see Appendix E).

In addition to Fuzzy Logics with t-norms and t-conorms for conjunction and disjunction, other NeSy frameworks such as DL2 [18] and STL [52] can also be implemented with this semantics. While fuzzy logic acts on truth values in $[0,1]$ , DL2 acts on truth values in $[-\infty,0]$ and STL in $[-\infty,\infty]$ . They choose appropriate differentiable operators to implement the conjunction and disjunction. We refer the reader to [42] for details.

3.6 Sampling semantics

The sampling semantics ${\llbracket\rrbracket}^{S}$ is a simple modification to the classical semantics. It samples a value from each conditional distribution and uses that value to evaluate the formula. Therefore, the only difference in ${\llbracket\rrbracket}^{\text{S}}$ with classical semantics in Definition 3.3 is in Equation 15:

\llbracket x:=f(T_{1},...,T_{n})\ (F)\rrbracket^{\text{S}}=\llbracket F% \rrbracket^{\text{S}}_{I,\eta[x\mapsto\mathsf{sample}({p}_{f}(\cdot|T_{1},...,% T_{n}))]}

(26)

Here, $\mathsf{sample}$ is a (random) function that takes a probability distribution and samples a value from the codomain $\Omega_{n+1}$ under the distribution ${p}_{f}(\cdot|T_{1},...,T_{n})$ . We can repeat the computation of the sampling semantics ${\llbracket\rrbracket}^{\text{S}}$ to reduce variance, like in standard Monte Carlo methods. This semantics can be combined with gradient estimation methods to learn the parameters of neural networks [39, 51]. A recent implementation of gradient estimation in the context of NeSy is the CatLog derivative trick [14], but any type of estimator based on the score function (commonly known as REINFORCE) can be used [26]. See Appendix B for a short discussion.

4 Learning and Reasoning

This section describes how to use ULLER for neuro-symbolic learning and reasoning. For a learning setting, we extend the definition of an interpretation (Definition 3.1) to a parameterised interpretation. A parameterised implementation allows us to implement neural networks with learnable parameters. For instance, a function $\text{model}()$ can be interpreted as a neural network $I_{\boldsymbol{\theta}}(\text{model})=\mathit{NN}_{\boldsymbol{\theta}}$ .

Definition 4.1.

A parameterised interpretation is an interpretation $I_{\boldsymbol{\theta}}$ that is uniquely defined by a set of parameters ${\boldsymbol{\theta}}\in\mathbb{R}^{d}$ .

Let $F\in\mathcal{L}_{\text{{ULLER}}}$ denote a ULLER formula that has a quantifier ranging over a dataset symbol $T$ (for instance Example 2.1). Learning a parameterised interpretation typically involves searching for an optimal set of parameters ${\boldsymbol{\theta}}^{*}\in\mathbb{R}^{d}$ maximising the neuro-symbolic system on $F$ over a dataset $\Omega_{T}$ . In most machine learning settings, we are interested in minimising a loss function over a random minibatch $x_{1},...,x_{n}\sim\Omega_{T}$ . We can define such a loss function $L({\boldsymbol{\theta}})$ and corresponding minimisation problem for finding parameters ${\boldsymbol{\theta}}^{*}$ with

L({\boldsymbol{\theta}})=-\llbracket F\rrbracket_{I_{{\boldsymbol{\theta}}}% \cup\{T\mapsto\{x_{1},...,x_{n}\}\},\{\}},\quad{\boldsymbol{\theta}}^{*}=\arg% \max_{{\boldsymbol{\theta}}\in\mathbb{R}^{d}}L({\boldsymbol{\theta}}).

(27)

To allow for minibatching, we interpret the domain symbol $T$ as the minibatch $\{x_{1},...,x_{n}\}$ . We can easily implement variations of this loss. For instance, we can combine multiple formulas and give each different weights. Notice that, for probabilistic and fuzzy semantics, $L({\boldsymbol{\theta}})$ is differentiable, allowing us to use common optimisers. However, not all NeSy structures can be optimised over: This loss only makes sense when a semantics returns a value in an ordered set $\mathcal{B}$ , but we also allow NeSy structures to return other kinds of values.

A different pattern, more related to reasoning, is to find the input $x$ that maximises (or minimises) the neuro-symbolic system:

x^{*}=\arg\max_{x^{\prime}\in X}\llbracket F\rrbracket_{I_{\boldsymbol{\theta}% }\cup\{T\mapsto\{x\}\},\{\}}

(28)

This strategy can be combined with adversarial learning to first find the input that most violates the background knowledge, and then corrects that input [34].

5 Related Work

The last decade has seen the rise of neuro-symbolic frameworks that allow for specifying knowledge about the behaviour of neural networks symbolically [31]. However, unlike ULLER they are restricted to a single semantics, usually variations of probabilistic (Section 3.4) or fuzzy semantics (Section 3.5). The majority of current frameworks use the syntactic neural predicate construct as discussed in Section 3.1. DeepProbLog [30] is a probabilistic logic programming language [12] with neural predicates. Variations of its syntax are used in multiple follow-up works [13, 54, 28]. Scallop [24] chooses to restrict its language to Datalog to improve scalability, among others [29]. For ULLER, we choose to use an expressive first-order language, leaving scalable inference to the implementation of the NeSy system. Other NeSy frameworks are based on Answer Set Programming [57, 41, 3], relational languages [36, 33, 9], temporal logics [46] and description logics [55, 44, 45], while Logic Tensor Networks [4] is also based on first-order logic, among others [32, 16]. Finally, many commonly used NeSy frameworks are restricted to propositional logic [56, 2, 27, 10, 21, 18].

Logic of Differentiable Logics (LDL) [42] defines a first-order language to compare formal properties of several NeSy frameworks. Compared to ULLER, LDL is strongly typed, while ULLER is weakly typed, and LDL does not model probabilistic semantics. In LDL, uncertainty comes from predicates, rather than functions, and does not have a syntactic construct like ULLERs statement blocks. Pylon [1] is a Python library similar in goal to ULLER. It also allows for expressing propositional logic (CNF) formulas, which can then get compiled into a Semantic Loss or fuzzy loss functions. However, by being restricted to a propositional language, Pylon is limited in expressiveness, and requires the user to manually ground out formulas.

ULLER is also heavily inspired by probabilistic programming languages [22] such as Stan [7] that specify probabilistic models in a high-level language. In particular, ULLER can be considered a first-order probabilistic programming language (FOPPL) [47] defined on boolean outputs. These boolean outputs represent the conditioning (observations) of the probabilistic model. By being first-order, the language is restricted to having a finite number of random variables. Other FOPPL languages centred on neural networks include Pyro [6] and ProbTorch [40]. These languages enforce a probabilistic semantics corresponding to that of ULLER defined in Section 3.4. However, ULLER does not enforce this semantics and also supports, for instance, fuzzy semantics. We leave an in-depth analysis of the relations between ULLER and aforementioned probabilistic programming languages for future work.

Other related work attempts to define building blocks for neuro-symbolic AI [48] or to categorise existing approaches [38]. We instead focus on a particular set of informed machine learning approaches, and develop a unifying language to allow communicating with them.

6 Conclusion

We introduced ULLER, a Unified Language for LEarning and Reasoning. ULLER is a first-order logic language designed for neuro-symbolic learning and reasoning, with a special statement syntax for constraining neural networks. We showed how to implement the common fuzzy and probabilistic semantics in ULLER, allowing for easy comparison between different NeSy systems. For future work, we want to implement ULLER as an easy-to-use Python library to increase the “frictionless reproducibility” of NeSy research. In this library, a researcher can easily write and share knowledge, and develop new NeSy benchmarks. We also believe such a library is a good avenue for reducing the barrier of entry into NeSy research.

Acknowledgements

We would like to thank Frank van Harmelen, Tarek Richard Besold, Luciano Serafini, Antonio Vergari, Pasquale Minervini, Thiviyan Thanapalasingam, Guy van den Broeck, Connor Pryor, Patrick Koopmann, and Mihaela Stoian for fruitful discussions during the writing of this paper. We also thank the anonymous reviewers of NeSy 2024 for their valuable feedback. This work was supported by the EU H2020 ICT48 project “TAILOR” under contract #952215. Emile van Krieken was funded by ELIAI (The Edinburgh Laboratory for Integrated Artificial Intelligence), EPSRC (grant no. EP/W002876/1).

References

[1] Ahmed, K., Li, T., Ton, T., Guo, Q., Chang, K., Kordjamshidi, P., Srikumar, V., den Broeck, G.V., Singh, S.: PYLON: A pytorch framework for learning with constraints. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. pp. 13152–13154. AAAI Press (2022). https://doi.org/10.1609/AAAI.V36I11.21711, https://doi.org/10.1609/aaai.v36i11.21711
[2] Ahmed, K., Teso, S., Chang, K., den Broeck, G.V., Vergari, A.: Semantic probabilistic layers for neuro-symbolic learning. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022), http://papers.nips.cc/paper_files/paper/2022/hash/c182ec594f38926b7fcb827635b9a8f4-Abstract-Conference.html
[3] Aspis, Y., Broda, K., Lobo, J., Russo, A.: Embed2Sym - Scalable Neuro-Symbolic Reasoning via Clustered Embeddings. In: Proceedings of the Nineteenth International Conference on Principles of Knowledge Representation and Reasoning. pp. 421–431. International Joint Conferences on Artificial Intelligence Organization, Haifa, Israel (Jul 2022). https://doi.org/10.24963/kr.2022/44
[4] Badreddine, S., d’Avila Garcez, A., Serafini, L., Spranger, M.: Logic Tensor Networks. Artificial Intelligence 303, 103649 (Feb 2022). https://doi.org/10.1016/j.artint.2021.103649
[5] Belle, V., Passerini, A., Van den Broeck, G.: Probabilistic inference in hybrid domains by weighted model integration. In: Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI). vol. 2015, pp. 2770–2776. IJCAI-INT JOINT CONF ARTIF INTELL (2015)
[6] Bingham, E., Chen, J.P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P.A., Horsfall, P., Goodman, N.D.: Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research 20, 28:1–28:6 (2019)
[7] Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P., Riddell, A.: Stan: A probabilistic programming language. Journal of statistical software 76 (2017)
[8] Chavira, M., Darwiche, A.: On probabilistic inference by weighted model counting. Artificial Intelligence 172(6), 772–799 (Apr 2008). https://doi.org/10.1016/j.artint.2007.11.002
[9] Cohen, W.W.: TensorLog: A Differentiable Deductive Database. arXiv:1605.06523 [cs] (Jul 2016)
[10] Daniele, A., van Krieken, E., Serafini, L., van Harmelen, F.: Refining neural network predictions using background knowledge. Mach. Learn. 112(9), 3293–3331 (2023). https://doi.org/10.1007/S10994-023-06310-3, https://doi.org/10.1007/s10994-023-06310-3
[11] Darwiche, A.: SDD: A new canonical representation of propositional knowledge bases. IJCAI International Joint Conference on Artificial Intelligence pp. 819–826 (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-143
[12] De Raedt, L., Kimmig, A.: Probabilistic (logic) programming concepts. Machine Learning 100(1), 5–47 (Jul 2015). https://doi.org/10.1007/s10994-015-5494-z
[13] De Smet, L., Martires, P.Z.D., Manhaeve, R., Marra, G., Kimmig, A., De Raedt, L.: Neural Probabilistic Logic Programming in Discrete-Continuous Domains (Mar 2023). https://doi.org/10.48550/arXiv.2303.04660
[14] De Smet, L., Sansone, E., Zuidberg Dos Martires, P.: Differentiable sampling of categorical distributions using the catlog-derivative trick. Advances in Neural Information Processing Systems 36 (2024)
[15] Derkinderen, V., Manhaeve, R., Dos Martires, P.Z., De Raedt, L.: Semirings for probabilistic and neuro-symbolic logic programming. International Journal of Approximate Reasoning p. 109130 (2024)
[16] Diligenti, M., Gori, M., Sacca, C.: Semantic-based regularization for learning and inference. Artificial Intelligence 244, 143–165 (2017)
[17] Donoho, D.: Data science at the singularity. arXiv preprint arXiv:2310.00865 (2023)
[18] Fischer, M., Balunovic, M., Drachsler-Cohen, D., Gehr, T., Zhang, C., Vechev, M.: Dl2: training and querying neural networks with logic. In: International Conference on Machine Learning. pp. 1931–1941. PMLR (2019)
[19] Foerster, J., Farquhar, G., Al-Shedivat, M., Rocktäschel, T., Xing, E., Whiteson, S.: DiCE: The infinitely differentiable monte carlo estimator. In: International Conference on Machine Learning. pp. 1529–1538 (2018)
[20] Giunchiglia, E., Stoian, M.C., Lukasiewicz, T.: Deep Learning with Logical Constraints. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022. pp. 5478–5485. ijcai.org (2022). https://doi.org/10.24963/ijcai.2022/767
[21] Giunchiglia, E., Tatomir, A., Stoian, M.C.ă., Lukasiewicz, T.: CCN+: A neuro-symbolic framework for deep learning with requirements. International Journal of Approximate Reasoning p. 109124 (2024). https://doi.org/10.1016/j.ijar.2024.109124
[22] Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Probabilistic programming. In: Future of software engineering proceedings, pp. 167–181 (2014)
[23] Harnad, S.: The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3), 335–346 (1990)
[24] Huang, J., Li, Z., Chen, B., Samel, K., Naik, M., Song, L., Si, X.: Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning. In: Advances in Neural Information Processing Systems (May 2021)
[25] Kimmig, A., Van den Broeck, G., De Raedt, L.: Algebraic model counting. Journal of Applied Logic 22, 46–62 (2017)
[26] Kool, W., van Hoof, H., Welling, M.: Buy 4 REINFORCE samples, get a baseline for free! p. 14 (2019)
[27] van Krieken, E., Thanapalasingam, T., Tomczak, J.M., van Harmelen, F., ten Teije, A.: A-nesi: A scalable approximate method for probabilistic neurosymbolic inference. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023), http://papers.nips.cc/paper_files/paper/2023/hash/4d9944ab3330fe6af8efb9260aa9f307-Abstract-Conference.html
[28] Maene, J., Raedt, L.D.: Soft-Unification in Deep Probabilistic Logic. In: Thirty-Seventh Conference on Neural Information Processing Systems (Nov 2023)
[29] Magnini, M., Ciatto, G., Omicini, A.: On the Design of PSyKI: A Platform for Symbolic Knowledge Injection into Sub-symbolic Predictors. In: Calvaresi, D., Najjar, A., Winikoff, M., Frä mling, K. (eds.) Explainable and Transparent AI and Multi-Agent Systems, vol. 13283, pp. 90–108. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-15565-9-6
[30] Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: DeepProbLog: Neural probabilistic logic programming. In: Proceedings of NeurIPS (2018)
[31] Marra, G., Dumančić, S., Manhaeve, R., De Raedt, L.: From Statistical Relational to Neural Symbolic Artificial Intelligence: A Survey. arXiv:2108.11451 [cs] (Aug 2021)
[32] Marra, G., Giannini, F., Diligenti, M., Gori, M.: Lyrics: A general interface layer to integrate logic inference and deep learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 283–298. Springer (2019)
[33] Marra, G., Kuželka, O.: Neural markov logic networks. In: Uncertainty in Artificial Intelligence. pp. 908–917. PMLR (2021)
[34] Minervini, P., Riedel, S.: Adversarially regularising neural NLI models to integrate logical background knowledge. In: Korhonen, A., Titov, I. (eds.) Proceedings of the 22nd Conference on Computational Natural Language Learning. pp. 65–74. Association for Computational Linguistics, Brussels, Belgium (Oct 2018). https://doi.org/10.18653/v1/K18-1007, https://aclanthology.org/K18-1007
[35] Pryor, C., Dickens, C., Augustine, E., Albalak, A., Wang, W.Y., Getoor, L.: NeuPSL: Neural Probabilistic Soft Logic. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. pp. 4145–4153. International Joint Conferences on Artificial Intelligence Organization, Macau, SAR China (Aug 2023). https://doi.org/10.24963/ijcai.2023/461
[36] Pryor, C., Dickens, C., Augustine, E., Albalak, A., Wang, W.Y., Getoor, L.: Neupsl: Neural probabilistic soft logic. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. pp. 4145–4153. ijcai.org (2023). https://doi.org/10.24963/IJCAI.2023/461, https://doi.org/10.24963/ijcai.2023/461
[37] Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1–2), 107–136 (Jan 2006). https://doi.org/10.1007/s10994-006-5833-1, http://dx.doi.org/10.1007/s10994-006-5833-1
[38] Sarker, M.K., Zhou, L., Eberhart, A., Hitzler, P.: Neuro-symbolic artificial intelligence. AI Communications 34(3), 197–209 (2021). https://doi.org/10.3233/AIC-210084
[39] Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: Advances in Neural Information Processing Systems (2015)
[40] Siddharth, N., Paige, B., van de Meent, J.W., Desmaison, A., Goodman, N.D., Kohli, P., Wood, F., Torr, P.: Learning disentangled representations with semi-supervised deep generative models. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30. pp. 5927–5937. Curran Associates, Inc. (2017)
[41] Skryagin, A., Stammer, W., Ochs, D., Dhami, D.S., Kersting, K.: SLASH: Embracing Probabilistic Circuits into Neural Answer Set Programming. arXiv:2110.03395 [cs] (Oct 2021)
[42] Slusarz, N., Komendantskaya, E., Daggitt, M.L., Stewart, R., Stark, K.: Logic of differentiable logics: Towards a uniform semantics of dl. In: Proceedings of 24th International Conference on Logic. vol. 94, pp. 473–493 (2023)
[43] Stol, M.C., Mileo, A.: Iid relaxation by logical expressivity: A research agenda for fitting logics to neurosymbolic requirements (2024)
[44] Tang, Z., Hinnerichs, T., Peng, X., Zhang, X., Hoehndorf, R.: Falcon: faithful neural semantic entailment over alc ontologies. arXiv preprint arXiv:2208.07628 (2022)
[45] Tang, Z., Pei, S., Peng, X., Zhuang, F., Zhang, X., Hoehndorf, R.: TAR: Neural Logical Reasoning across TBox and ABox (Aug 2022)
[46] Umili, E., Capobianco, R., De Giacomo, G.: Grounding ltlf specifications in image sequences. In: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning. vol. 19, pp. 668–678 (2023)
[47] van de Meent, J.W., Paige, B., Yang, H., Wood, F.: An Introduction to Probabilistic Programming (Oct 2021). https://doi.org/10.48550/arXiv.1809.10756
[48] van Harmelen, F., ten Teije, A.: A Boxology of Design Patterns for Hybrid Learning and Reasoning Systems. Journal of Web Engineering 18(1), 97–124 (2019). https://doi.org/10.13052/jwe1540-9589.18133
[49] van Krieken, E., Acar, E., van Harmelen, F.: Analyzing differentiable fuzzy logic operators. Artificial Intelligence 302, 103602 (2022). https://doi.org/10.1016/j.artint.2021.103602
[50] van Krieken, E., Thanapalasingam, T., Tomczak, J., van Harmelen, F., Ten Teije, A.: A-NeSI: A scalable approximate method for probabilistic neurosymbolic inference. In: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 24586–24609. Curran Associates, Inc. (2023)
[51] van Krieken, E., Tomczak, J., Ten Teije, A.: Storchastic: A framework for general stochastic automatic differentiation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 7574–7587. Curran Associates, Inc. (2021)
[52] Varnai, P., Dimarogonas, D.V.: On robustness metrics for learning stl tasks. In: 2020 American Control Conference (ACC). pp. 5394–5399. IEEE (2020)
[53] von Rueden, L., Mayer, S., Beckh, K., Georgiev, B., Giesselbach, S., Heese, R., Kirsch, B., Pfrommer, J., Pick, A., Ramamurthy, R., Walczak, M., Garcke, J., Bauckhage, C., Schuecker, J.: Informed Machine Learning – A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35(1), 614–633 (Jan 2023). https://doi.org/10.1109/TKDE.2021.3079836
[54] Winters, T., Marra, G., Manhaeve, R., De Raedt, L.: Deepstochlog: Neural stochastic logic programming. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 10090–10100 (2022)
[55] Wu, X., Zhu, X., Zhao, Y., Dai, X.: Differentiable Fuzzy $\mathcal{}ALC {}$: A Neural-Symbolic Representation Language for Symbol Grounding (Dec 2022)
[56] Xu, J., Zhang, Z., Friedman, T., Liang, Y., den Broeck, G.V.: A semantic loss function for deep learning with symbolic knowledge. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceedings of Machine Learning Research, vol. 80, pp. 5498–5507. PMLR (2018), http://proceedings.mlr.press/v80/xu18h.html
[57] Yang, Z., Ishay, A., Lee, J.: NeurASP: Embracing neural networks into answer set programming. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. pp. 1755–1762. International Joint Conferences on Artificial Intelligence Organization (Jul 2020). https://doi.org/10.24963/ijcai.2020/243

Appendix A Practical Examples

Example A.1 (MNIST Addition).

Suppose we want to express the standard (single-digit) MNIST addition program using ULLER. In this setting, we have a domain $T$ that represents a training dataset $I(T)$ . In Section 4, we discuss how this training dataset can also be a minibatch of examples.

Each data point $x$ consists of a pair of images (which we access with the properties $\mathrm{im1}$ and $\mathrm{im2}$ ) associated to a label representing the value of their sum (which we can intuitively access via the property $\mathrm{sum}$ ). Finally, we have a function $f$ that we interpret as a neural network classifying MNIST images. Then, if we want to write that for every input the outputs of the neural network should be equal to the sum of the inputs, we can write:

	$\displaystyle\forall x\in T$
	$\displaystyle(n1$	$\displaystyle:=f(x.\mathrm{im1}),n2:=f(x.\mathrm{im2})$
	$\displaystyle(n1$	$\displaystyle+n2=x.\mathrm{sum}))$

Example A.2 (Smokes Friends Cancer).

In this classical example of Statistical Relational Learning introduced by [37], uncertain facts in a population group are modeled using the neural predicates $\mathrm{Friends}(x,y)$ for friendship, $\mathrm{Smokes}(x)$ for smoking, and $\mathrm{Cancer}(x)$ for cancer. As ULLER relies on functions rather than predicates to model uncertainty, we must use $\mathrm{true}(a)$ to formalise the problem in our language as explained in Section 3.1. For simplicity, we use $(A\Leftrightarrow B)\equiv((A\Rightarrow B)\land(B\Rightarrow A))$ to denote logical equivalences.

Here is an example of a knowledge base for this problem. Friends of friends are friends:

	$\displaystyle\forall x$	$\displaystyle\in\mathrm{People},y\in\mathrm{People},z\in\mathrm{People}$
		$\displaystyle(a_{1}:=\mathrm{Friends}(x,y),a_{2}:=\mathrm{Friends}(y,z),a_{3}:% =\mathrm{Friends}(x,z)$
		$\displaystyle((\mathrm{true}(a_{1})\land\mathrm{true}(a_{2}))\Rightarrow% \mathrm{true}(a_{3})))$

If two people are friends, either both smoke or neither does:

	$\displaystyle\forall x$	$\displaystyle\in\mathrm{People},y\in\mathrm{People}\$
		$\displaystyle(a_{1}:=\mathrm{Friends}(x,y),a_{2}:=\mathrm{Smokes}(x),a_{3}:=% \mathrm{Smokes}(y)$
		$\displaystyle(\mathrm{true}(a_{1})\Rightarrow(\mathrm{true}(a_{2})% \Leftrightarrow\mathrm{true}(a_{3}))))$

Friendless people smoke:

	$\displaystyle\forall x$	$\displaystyle\in\mathrm{People}$
		$\displaystyle(\neg\exists y\in\mathrm{People}\ (a_{1}:=\mathrm{Friends}(x,y)(% \mathrm{true}(a_{1})))$
		$\displaystyle\phantom{(}\Rightarrow a_{2}:=\mathrm{Smokes}(x)(\mathrm{true}(a_% {2})))$

Smoking causes cancer:

	$\displaystyle\forall x$	$\displaystyle\in\mathrm{People}$
		$\displaystyle(a_{1}:=\mathrm{Smokes}(x),a_{2}:=\mathrm{Cancer}(x)$
		$\displaystyle(\mathrm{true}(a_{1})\Rightarrow\mathrm{true}(a_{2})))$

Notice that, according to the definitions of Section 3.1, the probabilistic interpretation of the above formula will assume conditional independence between $a_{1}\sim{p}_{\mathrm{Smokes}}(\cdot|x)$ and $a_{2}\sim{p}_{\mathrm{Cancer}}(\cdot|x)$ . To model a dependence of cancer on smoking, i.e. $a_{2}\sim{p}_{\mathrm{Cancer}}(\cdot|x,a_{1})$ , we can make the probability explicitly depend on the previous variable:

	$\displaystyle\forall x$	$\displaystyle\in\mathrm{People}$
		$\displaystyle(a_{1}:=\mathrm{Smokes}(x),a_{2}:=\mathrm{Cancer}(x,a_{1})$
		$\displaystyle(\mathrm{true}(a_{1})\Rightarrow\mathrm{true}(a_{2})))$

Next, we have labelled examples for each relationship. For example, for $\mathrm{Friends}()$ , drawing examples from a dataset $T_{\mathrm{Friends}}$ :

	$\displaystyle\forall t$	$\displaystyle\in T_{\mathrm{Friends}}$
		$\displaystyle(l:=\mathrm{Friends}(t.\mathrm{x},t.\mathrm{y})$
		$\displaystyle(l=t.\mathrm{label}))$

Appendix B Gradient estimation

The sampling semantics in Equation 26 is a simple way to estimate the truth value of a formula. However, since sampling is not a differentiable operation, it is not possible to use this semantics to train the neural networks. Instead, we can use the score function gradient estimation method [39] to estimate the gradient of the truth value of a formula with respect to the parameters of the neural networks. However, this requires adapting the evaluation of the formula to incorporate score function terms.

One way to implement gradient estimation methods for simple ULLER programs is to use the DiCE estimator [19] which introduces the MagicBox operator ${\drawdie{2}}(x)=\exp(x-\bot(x))$ , where $\bot$ is the StopGradient operator used in deep learning frameworks. This operator allows us to add a term that only appears when we differentiate it, and equals 1 during the forward pass. To incorporate DiCE for Unified Language for LEarning and Reasoning, we have to modify Equation 15

\llbracket x:=f(T_{1},\dots,T_{n})(F)\ \rrbracket^{\text{S}}=\llbracket F% \rrbracket^{\text{C}}_{I,{\mathcal{A}}(\eta,S)}\cdot\sum_{i=1}^{n}{\drawdie{2}% }(\log{p}_{f_{i}}({\mathcal{A}}(\eta,S)[x_{i}]))

(29)

Extensions of the DiCE estimator can be used to implement a wide variety of gradient estimation methods [51].

Appendix C Classical in the limit

C.1 Probabilistic semantics

The probabilistic semantics is classical in the limit. To show this, we note that we require that the domain becomes $\{0,1\}$ instead of probabilities $[0,1]$ . Under this domain, the product is equal to the $\min$ function. We can use this to rewrite all but the interpretation of statements into the classical semantics.

Next, take for a statement $x:=f(T_{1},...,T_{n})(F)$ the induction assumption that $\llbracket F\rrbracket^{\text{P}}_{\hat{I},\eta}=\llbracket F\rrbracket^{\text% {C}}_{I,\eta}$ , where $\hat{I}$ is defined as in Definition 3.4. Then the interpretation of a statement is:

\displaystyle\mathbb{E}_{a\sim{p}_{\hat{f}}(\cdot|T_{1},...,T_{n})}[\llbracket F% \rrbracket^{\text{P}}_{\hat{I},\eta[x\mapsto a]}]=\llbracket F\rrbracket^{% \text{P}}_{\hat{I},\eta[x\mapsto\arg\max_{a\in\Omega_{n+1}}{p}_{f}(a|T_{1},...% ,T_{n}))]}

Here, we reduce the expectation by noting that since $p_{\hat{f}}(a|T_{1},...,T_{n})=\delta(a-\arg\max_{a^{\prime}}{p}_{f}(a^{\prime% }|T_{1},...,T_{n}))$ , exactly one element gets 1 probability (or a single element with non-zero probability, in the case of continuous distributions). This single element is chosen on the right side. Then, we use the induction assumption to find that this is equal to the classical semantics of statements given in Equation 15.

C.2 Fuzzy semantics

Using the axioms of t-norms, we find that the fuzzy semantics is also classical in the limit. This again can be proven by induction. For Equations 21 and 23, we use the boundary conditions of t-norms, which states that $x\operatorname{\otimes}1=x$ for $x\in[0,1]$ . Therefore, if $x=0$ , $0\operatorname{\otimes}1=0$ and if $x=1$ , $1\operatorname{\otimes}1=1$ , meaning t-norms act as the $\min$ operator under the domain $\{0,1\}$ .

Next, consider the program fragment $x:=f(T_{1},...,T_{n})\ (F)$ and take the induction assumption $\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta}=\llbracket F\rrbracket^{\text% {C}}_{I,\eta}$ . First, assume the domain $\Omega_{n+1}=\{0,1\}$ and assume $\arg\max_{a\in\{0,1\}}{p}_{f}(a|T_{1},...,T_{n})=1$ . Then, the interpretation of the statement is $\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto{p}_{\hat{f}}(a=1|T_{1% },...,T_{n})]}=\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto 1]}$ , since the Dirac distribution will put all its mass on the output $1$ . Similarly, if $\arg\max_{a\in\{0,1\}}{p}_{f}(a|T_{1},...,T_{n})=0$ , then the interpretation is $\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto 0]}$ . Then we can simply use the induction assumption.

Finally, if $\Omega_{n+1}\neq\{0,1\}$ , then we know there is a unique output $a\in\Omega_{n+1}$ such that $p_{\hat{f}}(a|T_{1},...,T_{n})=1$ , while for the other outputs $p_{\hat{f}}(a|T_{1},...,T_{n})=0$ . Then, using associativity and commutativity of the t-conorm $\oplus$ , the interpretation of the statement is

	$\displaystyle\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto a]}% \operatorname{\otimes}{p}_{\hat{f}}(a\|T_{1},...,T_{n})\oplus\bigoplus_{a^{% \prime}\in\Omega_{n+1}\setminus\{a\}}\llbracket F\rrbracket^{\text{F}}_{\hat{I% },\eta[x\mapsto a^{\prime}]}\operatorname{\otimes}{p}_{\hat{f}}(a^{\prime}\|T_{% 1},...,T_{n})$
	$\displaystyle\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto a]}% \operatorname{\otimes}1\oplus\bigoplus_{a^{\prime}\in\Omega_{n+1}\setminus\{a% \}}\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto a^{\prime}]}% \operatorname{\otimes}0$
	$\displaystyle\llbracket F\rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto a]}% \oplus\bigoplus_{a^{\prime}\in\Omega_{n+1}\setminus\{a\}}0=\llbracket F% \rrbracket^{\text{F}}_{\hat{I},\eta[x\mapsto a]}$

where we again use the boundary conditions of the t-norm $\operatorname{\otimes}$ ( $1\operatorname{\otimes}x=x$ ) and t-conorm ( $0\oplus x=x$ ).

Appendix D Relation of Probabilistic semantics to the Semantic Loss

Here, we show why the probabilistic semantics is equivalent to the weighted model counting semantics used in, for instance, the Semantic Loss. Let $F$ be a closed formula without any statements $x:=f(T_{1},\dots,T_{n})(F^{\prime})$ that only involves variables $x_{1},...,x_{n}$ over finite domains. The weighted model count (WMC) is the evaluation of the classical semantics weighted by probabilities of the assignments to variables. These probabilities are often assumed to be independent, although our framework also allows for the probabilities to depend on previous variables. This is illustrated in Example A.2. The definition of the WMC is

	$\displaystyle\mathsf{WMC}$	$\displaystyle=\sum_{a_{1}\in\Omega_{1}}...\sum_{a_{n}\in\Omega_{n}}\prod_{i=1}% ^{n}{p}_{f_{i}}(a_{i})\llbracket F\rrbracket^{\text{C}}_{I,\{x_{1}\mapsto a_{1% },...,x_{n}\mapsto a_{n}\}}$		(30)
		$\displaystyle=\sum_{a_{1}\in\Omega_{1}}{p}_{f_{1}}(a_{1})...\sum_{a_{n}\in% \Omega_{n}}{p}_{f_{n}}(a_{n})\llbracket F\rrbracket^{\text{C}}_{I,\{x_{1}% \mapsto a_{1},...,x_{n}\mapsto a_{n}\}}.$		(30)

Next, we rewrite this into a program $x_{1}:=f_{1}(),...,x_{n}:=f_{n}()\ (F)$ such that the probabilistic semantics in Definition 3.5 is equal to the weighted model count. For ease of notation, let us denote $S_{i}$ each statement $x_{i}:=f_{i}()$ for $i=1,...,n$ . Then, we find the probabilistic semantics of the program by sequentially expanding the interpretation of the statements:

$\displaystyle\llbracket S_{1},...,S_{n}(F)\rrbracket^{\text{P}}_{I,\{\}}$	$\displaystyle=\sum_{a_{1}\in\Omega_{1}}{p}_{f_{1}}(a_{1})\cdot\llbracket S_{2}% ,...,S_{n}(F)\rrbracket^{\text{P}}_{I,\{x_{1}\mapsto a_{1}\}}$	(31)
	$\displaystyle\dots$
	$\displaystyle=\sum_{a_{1}\in\Omega_{1}}{p}_{f_{1}}(a_{1})...\sum_{a_{n}\in% \Omega_{n}}{p}_{f_{n}}(a_{n})\llbracket F\rrbracket^{\text{C}}_{I,\{x_{1}% \mapsto a_{1},...,x_{n}\mapsto a_{n}\}}$
	$\displaystyle=\mathsf{WMC}$

where in the last step we use that since the domains are finite and $F$ does not contain statements, the probabilistic semantics of $F$ is equal to the classic one.

Appendix E Relation of Fuzzy Semantics to Differentiable Fuzzy Logics

Fuzzy logics are actively used in NeSy [4, 49, 10, 21]. We show how existing NeSy systems using fuzzy logics arise from the fuzzy semantics of ULLER. Existing fuzzy logics systems align with our interpretations of terms and logical operators, but differ in their use of fuzzy predicates, which are interpreted as functions to $[0,1]$ , that is, $I_{\mathrm{NeSy}}(P):\Omega_{1}\times\dots\times\Omega_{n}\rightarrow[0,1]$ . Then, the truth value of a formula is computed by evaluating the formula with the fuzzy semantics.

We can emulate this in our fuzzy semantics with the $\mathrm{true}()$ predicate and proof by induction. For each neural predicate $I_{\mathrm{NeSy}}(P_{i}):\Omega^{i}_{1}\times\dots\times\Omega^{i}_{n_{i}}% \rightarrow[0,1]$ , we define a ULLER function $I(f_{i}):\Omega^{i}_{1}\times\dots\times\Omega^{i}_{n_{i}}\rightarrow(\{0,1\}% \rightarrow[0,1])$ such that:

I_{\mathrm{NeSy}}(P_{i})(T^{i}_{1},\dots,T^{i}_{n_{i}})=I(f_{i})(T^{i}_{1},% \dots,T^{i}_{n_{i}})(1)

(32)

Let $F$ be a first-order logic formula with no statements nor functions, and $\llbracket F\rrbracket^{\text{NeSy}}$ be its interpretation in a fuzzy NeSy system. Let $F$ contain $k$ neural atoms $P_{i}(T^{i}_{1},\dots,T^{i}_{n_{i}})$ , $i=1\dots k$ . Let $S_{1},\dots,S_{k}\ (F^{\prime})$ be a ULLER program with $k$ statements where $S_{i}$ defines $x_{i}:=f_{i}(T^{i}_{1},\dots,T^{i}_{n_{i}})$ , $i=1,\dots,k$ , and $F^{\prime}$ is $F$ where we replace every mention of $P_{i}(T^{i}_{1},\dots,T^{i}_{n_{i}})$ by $\mathrm{true}(x_{i})$ . We have:

$\displaystyle\llbracket S_{1},\dots,S_{k}\ (F^{\prime})\rrbracket^{\text{F}}$	$\displaystyle=\llbracket F^{\prime}\rrbracket^{\text{F}}_{I,\eta[x_{1}\mapsto p% _{f_{1}}(1\|T^{1}_{1},\dots,T^{1}_{n_{1}}),\dots,x_{k}\mapsto p_{f_{k}}(1\|T^{k}% _{1},\dots,T^{k}_{n_{k}})]}$	(33)
	$\displaystyle=\llbracket F^{\prime}\rrbracket^{\text{F}}_{I,\eta[x_{1}\mapsto I% _{\mathrm{NeSy}}(P_{1})(T^{1}_{1},\dots,T^{1}_{n_{1}}),\dots,x_{k}\mapsto I_{% \mathrm{NeSy}}(P_{k})(T^{k}_{1},\dots,T^{k}_{n_{k}})]}$	(34)
	$\displaystyle=\llbracket F\rrbracket^{\text{NeSy}}$	(35)

Equality (34) stems from definition (32). We derive equality (35) by induction. First, note that according to the definition of $I(\mathrm{true})$ and the assignment in (34), we have:

\llbracket\mathrm{true}(x_{i})\rrbracket^{\text{F}}=I_{\mathrm{NeSy}}(P_{i})(T% ^{i}_{1},\dots,T^{i}_{n_{i}})=\llbracket P(T^{i}_{1},\dots,T^{i}_{n_{i}})% \rrbracket^{\text{NeSy}}\quad\text{for }i=1,\dots,k

(36)

If our semantics use the same t-norm operator $\operatorname{\otimes}$ as the NeSy system, then:

\llbracket F_{1}\land F_{2}\rrbracket^{\text{F}}=\llbracket F_{1}\rrbracket^{% \text{F}}\operatorname{\otimes}\llbracket F_{2}\rrbracket^{\text{F}}=% \llbracket F_{1}^{\prime}\rrbracket^{\text{NeSy}}\operatorname{\otimes}% \llbracket F_{2}^{\prime}\rrbracket^{\text{NeSy}}=\llbracket F_{1}^{\prime}% \land F_{2}^{\prime}\rrbracket^{\text{NeSy}}

(37)

where in the second equality we use the induction hypothesis $\llbracket F_{1}\rrbracket^{\text{F}}=\llbracket F_{1}^{\prime}\rrbracket^{% \text{NeSy}}$ and $\llbracket F_{2}\rrbracket^{\text{F}}=\llbracket F_{2}^{\prime}\rrbracket^{% \text{NeSy}}$ . The same can naturally be derived for other logical connectives. It follows that we can emulate any formula $F$ built with the neural predicates $P(T^{i}_{1},\dots,T^{i}_{n_{i}})$ , by building formula $F^{\prime}$ with the equivalently interpreted $\mathrm{true}(x_{i})$ (see Equation (34)) and the same logical constructs, such that $\llbracket F\rrbracket^{\text{NeSy}}=\llbracket S_{1},\dots,S_{k}\ (F^{\prime}% )\rrbracket^{\text{F}}$ .

ULLER: A Unified Language for Learning and Reasoning