PDFA Distillation via String Probability Queries

\nameRobert Baumgartner \email[email protected]
\nameSicco Verwer \email[email protected]
\addrDepartment for Electrical Engineering, Mathematics, and Computer Science
Algorithmics Group
Technical University of Delft, Netherlands
Abstract

Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as language models. In this work we present an algorithm to distill PDFA from neural networks. Our algorithm is a derivative of the L# algorithm and capable of learning PDFA from a new type of query, in which the algorithm infers conditional probabilities from the probability of the queried string to occur. We show its effectiveness on a recent public dataset by distilling PDFA from a set of trained neural networks.

Keywords: Active Automata Learning, PDFA Distillation, Probability Inference

1 Introduction

Neural networks (NNs) are a powerful means for sequence modeling, in which they have gained strong interest from both commercial companies as well as the research community. A commonly recognised drawback to their expressive power is their inherent difficulty for humans to understand their decision processes (see e.g. Guidotti et al. (2018)). Attempts to explain neural networks that are trained for sequence modeling, returning the conditional probability P(a|x)𝑃conditional𝑎𝑥P(a|x)italic_P ( italic_a | italic_x ) over a corpus aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ and xΣ𝑥superscriptΣx\in\Sigma^{*}italic_x ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, have led to several works in distilling probabilistic finite automata (PFA) and probabilistic deterministic finite automata (PDFA) as surrogate models from such NNs (Weiss et al., 2019; Okudono et al., 2019; Eyraud and Ayache, 2021; Mayr et al., 2022, 2023). Advantages of PFA and PDFA are that, once distilled, they provide a computationally cheap model compared with the actual neural networks. Moreover, they are naturally visualizable and can be interpreted by an individual with expert knowledge, and can in theory be learned under PAC constraints (Clark and Thollard, 2004).

Algorithms to distill PFAs and PDFAs from sequence modeling NNs all work by asking the NN conditional probabilities P(a|x)𝑃conditional𝑎𝑥P(a|x)italic_P ( italic_a | italic_x ). There are 3 kinds of approaches taken so far: The first group (Weiss et al., 2019) adapts the L* algorithm by constructing a real table made of the responses of the network. The main difference to the L* algorithm’s observation table (Angluin, 1987) is that this table contains real valued numbers rather. Then a notion of similarity over rows is defined, and the automaton is constructed similar to L*. The second group constructs a similar kind of table, but uses spectral learning (Balle et al., 2014) to extract an automaton from this table (Okudono et al., 2019; Eyraud and Ayache, 2021). A drawback of this group is that the resulting models are non-deterministic, making them hard to be interpreted. Recently a new group emerged, building an observation tree from the queries rather than a table, and subsequently minimizing this tree (Mayr et al., 2022, 2023).

We want to build on the idea to learn on the observation tree and add another type of query that can be asked. Instead of asking the next symbol probabilities P(a|x)𝑃conditional𝑎𝑥P(a|x)italic_P ( italic_a | italic_x ), we ask the full string probability P(xa)𝑃𝑥𝑎P(xa)italic_P ( italic_x italic_a ) directly. We employ an observation tree representing the answers and probabilities to infer the conditional probabilities P(a|x)𝑃conditional𝑎𝑥P(a|x)italic_P ( italic_a | italic_x ) and use the inferred probabilities to minimize the tree. We employ a merge heuristic that enforces an error bound μ>0𝜇0\mu>0italic_μ > 0 during our minimization procedure to make sure predicted probabilities for each already seen sequence x𝑥xitalic_x stay within those error bounds. We evaluate our algorithm on the TAYSIR competition dataset (Eyraud et al., 2023). All code is made available through our own public repository111https://github.com/tudelft-cda-lab/FlexFringe. A mathematical motivation for how we infer probabilities is provided in the appendix A.4.

2 PDFAs

A PDFA is typically depicted via a 5-tuple 𝒜={q0,Q,Σ,τ,π}𝒜subscript𝑞0𝑄Σ𝜏𝜋\mathcal{A}=\{q_{0},Q,\Sigma,\tau,\pi\}caligraphic_A = { italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Q , roman_Σ , italic_τ , italic_π }, where q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the unique starting state, and Q𝑄Qitalic_Q denotes the set of all states q𝑞qitalic_q. ΣΣ\Sigmaroman_Σ is a finite set of tokens, and an individual token is written shorthand via a𝑎aitalic_a. x𝑥xitalic_x denotes an arbitrary string over ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the set of all possible strings over ΣΣ\Sigmaroman_Σ with finite length, and we write |x|𝑥|x|| italic_x | to denote the length of string x𝑥xitalic_x. Concatenations of strings are simply written in the form of a0a1subscript𝑎0subscript𝑎1a_{0}a_{1}...italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … for strings of tokens, or ax𝑎𝑥axitalic_a italic_x and xa𝑥𝑎xaitalic_x italic_a for a symbol a𝑎aitalic_a preceding and following a string x𝑥xitalic_x respectively. λ𝜆\lambdaitalic_λ denotes the empty string with length |λ|=0𝜆0|\lambda|=0| italic_λ | = 0, and λx=x=xλ𝜆𝑥𝑥𝑥𝜆\lambda x=x=x\lambdaitalic_λ italic_x = italic_x = italic_x italic_λ.

Traversing the automaton is done via τ:Q×ΣQ:𝜏𝑄Σ𝑄\tau:Q\times\Sigma\rightarrow Qitalic_τ : italic_Q × roman_Σ → italic_Q. τ𝜏\tauitalic_τ can be recursively defined via τ(q,λ)=q𝜏𝑞𝜆𝑞\tau(q,\lambda)=qitalic_τ ( italic_q , italic_λ ) = italic_q and τ(q,ax)=τ(τ(q,a),x)𝜏𝑞𝑎𝑥𝜏𝜏𝑞𝑎𝑥\tau(q,ax)=\tau(\tau(q,a),x)italic_τ ( italic_q , italic_a italic_x ) = italic_τ ( italic_τ ( italic_q , italic_a ) , italic_x ). We say that a state qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is reachable from state q𝑞qitalic_q if there exists at least one string x𝑥xitalic_x s.t. q=τ(q,x)superscript𝑞𝜏𝑞𝑥q^{\prime}=\tau(q,x)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q , italic_x ). We write in short Qτ(q)subscript𝑄𝜏𝑞Q_{\tau(q)}italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT to denote the set of all states reachable from Q𝑄Qitalic_Q, and we call Xτ(q)subscript𝑋𝜏𝑞X_{\tau(q)}italic_X start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT the set of shortest strings to reach them. A shortest string x𝑥xitalic_x in Xτ(q)subscript𝑋𝜏𝑞X_{\tau(q)}italic_X start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT is a string s.t. q=τ(q,x)superscript𝑞𝜏𝑞𝑥q^{\prime}=\tau(q,x)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q , italic_x ) and there is no string xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT s.t. q=τ(q,x)superscript𝑞𝜏𝑞superscript𝑥q^{\prime}=\tau(q,x^{\prime})italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and |x|<|x|superscript𝑥𝑥|x^{\prime}|<|x|| italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < | italic_x |. We call a PDFA an observation tree 𝒪𝒯𝒪𝒯\mathcal{OT}caligraphic_O caligraphic_T iff each state q𝑞qitalic_q has a unique and only one sequence xqAsuperscriptsubscript𝑥𝑞𝐴x_{q}^{A}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT s.t. q=τ(xqA)𝑞𝜏superscriptsubscript𝑥𝑞𝐴q=\tau(x_{q}^{A})italic_q = italic_τ ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ). We call xqAsuperscriptsubscript𝑥𝑞𝐴x_{q}^{A}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT the access sequence of q𝑞qitalic_q. We note that on an observation tree 𝒪𝒯𝒪𝒯\mathcal{OT}caligraphic_O caligraphic_T Xτ(q)subscript𝑋𝜏𝑞X_{\tau(q)}italic_X start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT is clearly defined for each state of the tree.

Finally π:Q×Σ[0,1]:𝜋𝑄Σ01\pi:Q\times\Sigma\rightarrow[0,1]italic_π : italic_Q × roman_Σ → [ 0 , 1 ] and π:Q[0,1]:𝜋𝑄01\pi:Q\rightarrow[0,1]italic_π : italic_Q → [ 0 , 1 ] assigns probabilities to transitions within the automaton. We call π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) the stop** probability of state q𝑞qitalic_q, denoting the event that a string reaches q𝑞qitalic_q with its last possible transition. A PDFA requires qQ: aΣπ(q,a)+π(q)=1:for-all𝑞𝑄 subscript𝑎Σ𝜋𝑞𝑎𝜋𝑞1\forall q\in Q:\text{ }\sum_{a\in\Sigma}\pi(q,a)+\pi(q)=1∀ italic_q ∈ italic_Q : ∑ start_POSTSUBSCRIPT italic_a ∈ roman_Σ end_POSTSUBSCRIPT italic_π ( italic_q , italic_a ) + italic_π ( italic_q ) = 1. The probability of a string x=a0a1a2an𝑥subscript𝑎0subscript𝑎1subscript𝑎2subscript𝑎𝑛x=a_{0}a_{1}a_{2}...a_{n}italic_x = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can then be computed via π(x)=π(q0,a0)π(τ(q0,a0),a1)π(τ(qn1,an1),an)π(qn+1)𝜋𝑥𝜋subscript𝑞0subscript𝑎0𝜋𝜏subscript𝑞0subscript𝑎0subscript𝑎1𝜋𝜏subscript𝑞𝑛1subscript𝑎𝑛1subscript𝑎𝑛𝜋subscript𝑞𝑛1\pi(x)=\pi(q_{0},a_{0})\cdot\pi(\tau(q_{0},a_{0}),a_{1})\cdot...\cdot\pi(\tau(% q_{n-1},a_{n-1}),a_{n})\cdot\pi(q_{n+1})italic_π ( italic_x ) = italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ italic_π ( italic_τ ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ … ⋅ italic_π ( italic_τ ( italic_q start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ italic_π ( italic_q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ). Note here that we introduced shorthand notation for the string probability as π(x)𝜋𝑥\pi(x)italic_π ( italic_x ). Fig. 1 depicts an easy example of a PDFA.

q1/0.3subscript𝑞10.3q_{1}/0.3italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 0.3q2/0.1subscript𝑞20.1q_{2}/0.1italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 0.1q0/0.1subscript𝑞00.1q_{0}/0.1italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / 0.1a/0.3𝑎0.3a/0.3italic_a / 0.3b/0.6𝑏0.6b/0.6italic_b / 0.6a/0.2𝑎0.2a/0.2italic_a / 0.2a/0.2, b/0.7𝑎0.2 𝑏0.7a/0.2,\mbox{ }b/0.7italic_a / 0.2 , italic_b / 0.7b/0.5𝑏0.5b/0.5italic_b / 0.5
Figure 1: An example of a PDFA consisting of three states. q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial states, and the stop** probabilities are π(q0)=0.1𝜋subscript𝑞00.1\pi(q_{0})=0.1italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0.1, π(q1)=0.3𝜋subscript𝑞10.3\pi(q_{1})=0.3italic_π ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.3, and π(q2)=0.1𝜋subscript𝑞20.1\pi(q_{2})=0.1italic_π ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0.1.

3 Learning algorithm

In adaptation of the MAT framework (Angluin, 1987) our algorithm distinguishes two abstract entities: The first one is the system under learning (SUL), abstracted away by the teacher, and the learner, whose goal it is to find a model representative of the SUL. We assume the teacher to provide answers of the form P(x)𝑃𝑥P(x)italic_P ( italic_x ): Given a string x𝑥xitalic_x, it returns the probability for x𝑥xitalic_x to occur. Furthermore, the learner can ask the teacher for equivalence: Given a hypothesis \mathcal{H}caligraphic_H, does it model the behavior of the SUL 𝒯𝒯\mathcal{T}caligraphic_T sufficiently close?

The core idea of our learner is to build an observation tree that stores the observed probabilities. The algorithm propagates new string probabilities throughout relevant branches of the observations tree to compute the already seen probability mass in each node, and periodically updates π𝜋\piitalic_π with the new estimates. We first explain the observation tree and how the algorithm constructs it, and then explain how it turns it into a hypothesis.

3.1 Growing the observation tree

In order for the learner to find a surrogate model for the SUL it generates one or more hypotheses and tests each of them. Similar to the L# algorithm (Vaandrager et al., 2022) it first builds an observation tree 𝒪𝒯𝒪𝒯\mathcal{OT}caligraphic_O caligraphic_T, representing the set of already asked input strings, as well as the answers the teacher provided. Initially the observation tree consists of a single node, which will be q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, xq0A=λsuperscriptsubscript𝑥subscript𝑞0𝐴𝜆x_{q_{0}}^{A}=\lambdaitalic_x start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = italic_λ. The tree is then grown from q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by asking queries and creating respective new nodes, which we encapsulate in the ExtendFringe𝐸𝑥𝑡𝑒𝑛𝑑𝐹𝑟𝑖𝑛𝑔𝑒ExtendFringeitalic_E italic_x italic_t italic_e italic_n italic_d italic_F italic_r italic_i italic_n italic_g italic_e-operation (Alg. 1). Upon creating a node within the observation tree, the learner assigns it four attributes:

  1. 1.

    A field for modeling the probabilities π(q,a)𝜋𝑞𝑎\pi(q,a)italic_π ( italic_q , italic_a ).

  2. 2.

    A field for modeling the probability π(q)𝜋𝑞\pi(q)italic_π ( italic_q ).

  3. 3.

    An attribute for saving the probability P(x1A)𝑃superscriptsubscript𝑥1𝐴P(x_{1}^{A})italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ), which we call the access probability of q𝑞qitalic_q and write PA(q)subscript𝑃𝐴𝑞P_{A}(q)italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_q ).

  4. 4.

    A weight attribute used to estimate probabilities for each aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ, denoted by m(q,a)𝑚𝑞𝑎m(q,a)italic_m ( italic_q , italic_a ).

Initialization of a node is done according to Alg. 2, which initializes its attributes. Additionally, each time a new node is created and initialized, it provides better estimates for the probabilities of nodes that are part of T(xqA)𝑇superscriptsubscript𝑥𝑞𝐴T(x_{q}^{A})italic_T ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ). Therefore, each time a new node is created and updated, these nodes are updated through the subroutine depicted in Alg. 3. At certain points of the algorithm it needs to update π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) and π(q,a)𝜋𝑞𝑎\pi(q,a)italic_π ( italic_q , italic_a ). It does this via Alg. 5. Our goal is to keep the probabilities π(xqA)𝜋superscriptsubscript𝑥𝑞𝐴\pi(x_{q}^{A})italic_π ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) in the observation tree precise. A depth-first-search (DFS) (Alg. 4) routine ensures this condition. We provide mathematical motivation for our operations in appendix A.4.

Algorithm 1 Extend fringe
Set of fringe nodes \mathcal{F}caligraphic_F, alphabet ΣΣ\Sigmaroman_Σ
Set of new fringe nodes nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
nsubscript𝑛absent\mathcal{F}_{n}\leftarrowcaligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← empty set
for all Nodes q𝑞qitalic_q in \mathcal{F}caligraphic_F do
     for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ do
         Create node qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfying q=τ(q,a)superscript𝑞𝜏𝑞𝑎q^{\prime}=\tau(q,a)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q , italic_a )
         InitializeNode(q)𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑁𝑜𝑑𝑒superscript𝑞InitializeNode(q^{\prime})italic_I italic_n italic_i italic_t italic_i italic_a italic_l italic_i italic_z italic_e italic_N italic_o italic_d italic_e ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) \triangleright Algorithm 2
         nn{q}subscript𝑛subscript𝑛superscript𝑞\mathcal{F}_{n}\leftarrow\mathcal{F}_{n}\cup\{q^{\prime}\}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∪ { italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }
     end for
end forreturn nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Algorithm 2 Initialize node
Node q𝑞qitalic_q, alphabet ΣΣ\Sigmaroman_Σ
xqAq.getAccessString()formulae-sequencesuperscriptsubscript𝑥𝑞𝐴𝑞𝑔𝑒𝑡𝐴𝑐𝑐𝑒𝑠𝑠𝑆𝑡𝑟𝑖𝑛𝑔x_{q}^{A}\leftarrow q.getAccessString()italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ← italic_q . italic_g italic_e italic_t italic_A italic_c italic_c italic_e italic_s italic_s italic_S italic_t italic_r italic_i italic_n italic_g ( ) \triangleright Can be stored or computed
π(q)P(xqA)𝜋𝑞𝑃subscriptsuperscript𝑥𝐴𝑞\pi(q)\leftarrow P(x^{A}_{q})italic_π ( italic_q ) ← italic_P ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT )
PA(q)P(xqA)subscript𝑃𝐴𝑞𝑃subscriptsuperscript𝑥𝐴𝑞P_{A}(q)\leftarrow P(x^{A}_{q})italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_q ) ← italic_P ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT )
for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ do
     m(q,a)P(xqAa)𝑚𝑞𝑎𝑃subscriptsuperscript𝑥𝐴𝑞𝑎m(q,a)\leftarrow P(x^{A}_{q}a)italic_m ( italic_q , italic_a ) ← italic_P ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_a )
     π(q,a)P(xqAa)𝜋𝑞𝑎𝑃subscriptsuperscript𝑥𝐴𝑞𝑎\pi(q,a)\leftarrow P(x^{A}_{q}a)italic_π ( italic_q , italic_a ) ← italic_P ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_a )
end for
UpdatePath(q,xqA,P(xqA))𝑈𝑝𝑑𝑎𝑡𝑒𝑃𝑎𝑡𝑞superscriptsubscript𝑥𝑞𝐴𝑃superscriptsubscript𝑥𝑞𝐴UpdatePath(q,x_{q}^{A},P(x_{q}^{A}))italic_U italic_p italic_d italic_a italic_t italic_e italic_P italic_a italic_t italic_h ( italic_q , italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ) \triangleright Algorithm 3
Algorithm 3 Update path
Node q𝑞qitalic_q, string xqAsuperscriptsubscript𝑥𝑞𝐴x_{q}^{A}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, probability P(xqA)𝑃superscriptsubscript𝑥𝑞𝐴P(x_{q}^{A})italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT )
σλ𝜎𝜆\sigma\leftarrow\lambdaitalic_σ ← italic_λ
for all i0𝑖0i\leftarrow 0italic_i ← 0 up to |xqA|superscriptsubscript𝑥𝑞𝐴|x_{q}^{A}|| italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT | do
     aai𝑎subscript𝑎𝑖a\leftarrow a_{i}italic_a ← italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleright xqA=a0a1a2superscriptsubscript𝑥𝑞𝐴subscript𝑎0subscript𝑎1subscript𝑎2x_{q}^{A}=a_{0}a_{1}a_{2}...italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT …
     qτ(σ)superscript𝑞𝜏𝜎q^{\prime}\leftarrow\tau(\sigma)italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_τ ( italic_σ )
     m(q,a)m(q,a)+P(xqA)𝑚superscript𝑞𝑎𝑚superscript𝑞𝑎𝑃superscriptsubscript𝑥𝑞𝐴m(q^{\prime},a)\leftarrow m(q^{\prime},a)+P(x_{q}^{A})italic_m ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) ← italic_m ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) + italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT )
     σσa𝜎𝜎𝑎\sigma\leftarrow\sigma aitalic_σ ← italic_σ italic_a
end for
Algorithm 4 DFS update
Root node q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, alphabet ΣΣ\Sigmaroman_Σ
SQsubscript𝑆𝑄absentS_{Q}\leftarrowitalic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ← empty stack
SQ.push(q0)formulae-sequencesubscript𝑆𝑄𝑝𝑢𝑠subscript𝑞0S_{Q}.push(q_{0})italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT . italic_p italic_u italic_s italic_h ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
SPsubscript𝑆𝑃absentS_{P}\leftarrowitalic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ← empty stack
SP.push(1)formulae-sequencesubscript𝑆𝑃𝑝𝑢𝑠1S_{P}.push(1)italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT . italic_p italic_u italic_s italic_h ( 1 )
while SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT not empty do
     qSQ.pop()formulae-sequence𝑞subscript𝑆𝑄𝑝𝑜𝑝q\leftarrow S_{Q}.pop()italic_q ← italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT . italic_p italic_o italic_p ( )
     pSP.pop()formulae-sequence𝑝subscript𝑆𝑃𝑝𝑜𝑝p\leftarrow S_{P}.pop()italic_p ← italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT . italic_p italic_o italic_p ( )
     π(q)PA(q)p𝜋𝑞subscript𝑃𝐴𝑞𝑝\pi(q)\leftarrow\frac{P_{A}(q)}{p}italic_π ( italic_q ) ← divide start_ARG italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_q ) end_ARG start_ARG italic_p end_ARG \triangleright Guarantees that π(xqA)𝜋superscriptsubscript𝑥𝑞𝐴\pi(x_{q}^{A})italic_π ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) be correct for all q𝑞qitalic_q in tree
     NormalizeNode(q)𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑁𝑜𝑑𝑒𝑞NormalizeNode(q)italic_N italic_o italic_r italic_m italic_a italic_l italic_i italic_z italic_e italic_N italic_o italic_d italic_e ( italic_q ) \triangleright Algorithm 5
     for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ do
         if τ(q,a)𝜏𝑞𝑎\tau(q,a)italic_τ ( italic_q , italic_a ) defined then
              SQ.push(τ(q,a))formulae-sequencesubscript𝑆𝑄𝑝𝑢𝑠𝜏𝑞𝑎S_{Q}.push(\tau(q,a))italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT . italic_p italic_u italic_s italic_h ( italic_τ ( italic_q , italic_a ) )
              SP.push(pπ(q,a))formulae-sequencesubscript𝑆𝑃𝑝𝑢𝑠𝑝𝜋𝑞𝑎S_{P}.push(p\cdot\pi(q,a))italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT . italic_p italic_u italic_s italic_h ( italic_p ⋅ italic_π ( italic_q , italic_a ) )
         end if
     end for
end while
Algorithm 5 Normalize node
Node q𝑞qitalic_q, alphabet ΣΣ\Sigmaroman_Σ
s0𝑠0s\leftarrow 0italic_s ← 0
for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ do
     ss+m(q,a)𝑠𝑠𝑚𝑞𝑎s\leftarrow s+m(q,a)italic_s ← italic_s + italic_m ( italic_q , italic_a )
end for
f1π(q)s𝑓1𝜋𝑞𝑠f\leftarrow\frac{1-\pi(q)}{s}italic_f ← divide start_ARG 1 - italic_π ( italic_q ) end_ARG start_ARG italic_s end_ARG
for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ do
     π(q,a)fm(q,a)𝜋𝑞𝑎𝑓𝑚𝑞𝑎\pi(q,a)\leftarrow f\cdot m(q,a)italic_π ( italic_q , italic_a ) ← italic_f ⋅ italic_m ( italic_q , italic_a )
end for

3.2 Finding a hypothesis candidate

Just like the L# algorithm we employ the red-blue-framework (Lang et al., 1998) when growing the observation tree. We separate the tree into three distinct parts: A core of identified red nodes, a fringe of blue nodes, and a set of white nodes that are neither red nor blue. Initially, the algorithm starts with only a red state q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which it adds a fringe to through Alg. 1. These newly created nodes are marked blue. A blue node is a node which is not red, but whose parent node is red. Then the algorithm triggers the DFS routine to estimate the probabilities and tries to minimize the automaton.

We utilize techniques from state machine learning by proposing a merge check. We say that a pair of a red node qrsubscript𝑞𝑟q_{r}italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and a blue node qbsubscript𝑞𝑏q_{b}italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are consistent under a threshold μ[0,1)𝜇01\mu\in[0,1)italic_μ ∈ [ 0 , 1 ) iff

d(qr,qb)=|π(xqbA)π(qb)π(qr)P(xqbA)|μ.𝑑subscript𝑞𝑟subscript𝑞𝑏𝜋subscriptsuperscript𝑥𝐴subscript𝑞𝑏𝜋subscript𝑞𝑏𝜋subscript𝑞𝑟𝑃subscriptsuperscript𝑥𝐴subscript𝑞𝑏𝜇d(q_{r},q_{b})=\left|\frac{\pi(x^{A}_{q_{b}})}{\pi(q_{b})}\cdot\pi(q_{r})-P(x^% {A}_{q_{b}})\right|\leq\mu.italic_d ( italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = | divide start_ARG italic_π ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π ( italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG ⋅ italic_π ( italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_P ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | ≤ italic_μ . (1)

In order to build deterministic models we perform a determinization process whenever we merged two states (explained in e.g. Verwer et al.). We call qrsubscript𝑞𝑟q_{r}italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and qbsubscript𝑞𝑏q_{b}italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT mergeable iff qbQτ(qb)for-allsuperscriptsubscript𝑞𝑏subscript𝑄𝜏subscript𝑞𝑏\forall q_{b}^{\prime}\in Q_{\tau(q_{b})}∀ italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT it holds that either τ(qr,x)𝜏subscript𝑞𝑟𝑥\tau(q_{r},x)italic_τ ( italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_x ), xXqb𝑥subscript𝑋subscript𝑞𝑏x\in X_{q_{b}}italic_x ∈ italic_X start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT is not defined, or qr=τ(qr,x)superscriptsubscript𝑞𝑟𝜏subscript𝑞𝑟𝑥q_{r}^{\prime}=\tau(q_{r},x)italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_x ) and d(qr,qb)μ𝑑superscriptsubscript𝑞𝑟superscriptsubscript𝑞𝑏𝜇d(q_{r}^{\prime},q_{b}^{\prime})\leq\muitalic_d ( italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_μ. This way we ensure that our error bound holds for all strings the algorithm has seen so far. We note that the red-blue-framework ensures that the nodes Qτ(qb)subscript𝑄𝜏subscript𝑞𝑏Q_{\tau(q_{b})}italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT form a tree, thus do not form loops. The resulting access sequence of a merge is xqrAsuperscriptsubscript𝑥subscript𝑞𝑟𝐴x_{q_{r}}^{A}italic_x start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT.

Every time the algorithm minimizes the observation tree the algorithm assumes a red root node q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a fringe of blue nodes qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT s.t. q=τ(q0,a) aΣsuperscript𝑞𝜏subscript𝑞0𝑎 for-all𝑎Σq^{\prime}=\tau(q_{0},a)\text{ }\forall a\in\Sigmaitalic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_τ ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ) ∀ italic_a ∈ roman_Σ, and a possibly empty set of white nodes222We guarantee these starting conditions through the algorithm, a description follows later.. Goal of the algorithm is to find a complete basis \mathcal{B}caligraphic_B, i.e. a set of red nodes \mathcal{B}caligraphic_B s.t. qaΣ: τ(q,a):for-all𝑞for-all𝑎Σ 𝜏𝑞𝑎\forall q\in\mathcal{B}\text{, }\forall a\in\Sigma:\text{ }\tau(q,a)\in% \mathcal{B}∀ italic_q ∈ caligraphic_B , ∀ italic_a ∈ roman_Σ : italic_τ ( italic_q , italic_a ) ∈ caligraphic_B. This is done by minimizing the observation tree through an iterative procedure, where each iteration processes the current set of blue nodes. To do this the algorithm compares each of the currently blue nodes with each of the currently red nodes and chooses an operation for each blue node. The algorithm remembers the chosen operation for each node and performs these at the end of the operation. If a blue node can be merged with a red node it will remember this merge as the operation. If a blue node can be merged with multiple red nodes the merge that introduces minimal error according to Eq. 1 will be preferred. We consider a blue node that cannot be merged with any current red node as a new identified state and turn it red. We describe the minimization of one layer in Alg. 6.

Algorithm 6 Merge layer
Set of red nodes \mathcal{R}caligraphic_R, set of blue nodes \mathcal{B}caligraphic_B, threshold μ𝜇\muitalic_μ
HOsubscript𝐻𝑂absentH_{O}\leftarrowitalic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ← empty hash-table \triangleright Stores the optimal operations for each blue node
for all Nodes q𝑞q\in\mathcal{B}italic_q ∈ caligraphic_B do
     smin0subscript𝑠𝑚𝑖𝑛0s_{min}\leftarrow 0italic_s start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← 0 \triangleright If multiple merges possible use sminsubscript𝑠𝑚𝑖𝑛s_{min}italic_s start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT to select one
     for all Nodes qsuperscript𝑞q^{\prime}\in\mathcal{R}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_R do
         if q𝑞qitalic_q and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT mergeable under μ𝜇\muitalic_μ and score(q,q)<smin𝑠𝑐𝑜𝑟𝑒𝑞superscript𝑞subscript𝑠𝑚𝑖𝑛score(q,q^{\prime})<s_{min}italic_s italic_c italic_o italic_r italic_e ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_s start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT then \triangleright score(q,q)𝑠𝑐𝑜𝑟𝑒𝑞superscript𝑞score(q,q^{\prime})italic_s italic_c italic_o italic_r italic_e ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ): Eq. 1
              HO.insert(q,merge(q,q))formulae-sequencesubscript𝐻𝑂𝑖𝑛𝑠𝑒𝑟𝑡𝑞𝑚𝑒𝑟𝑔𝑒𝑞superscript𝑞H_{O}.insert(q,merge(q,q^{\prime}))italic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT . italic_i italic_n italic_s italic_e italic_r italic_t ( italic_q , italic_m italic_e italic_r italic_g italic_e ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
              sminscore(q,q)<sminsubscript𝑠𝑚𝑖𝑛𝑠𝑐𝑜𝑟𝑒𝑞superscript𝑞subscript𝑠𝑚𝑖𝑛s_{min}\leftarrow score(q,q^{\prime})<s_{min}italic_s start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ← italic_s italic_c italic_o italic_r italic_e ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_s start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT
         end if
     end for
     if No operation in HOsubscript𝐻𝑂H_{O}italic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT for q𝑞qitalic_q then \triangleright True if q𝑞qitalic_q could not be merged
         HO.insert(q,turn_red(q))formulae-sequencesubscript𝐻𝑂𝑖𝑛𝑠𝑒𝑟𝑡𝑞𝑡𝑢𝑟𝑛_𝑟𝑒𝑑𝑞H_{O}.insert(q,turn\_red(q))italic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT . italic_i italic_n italic_s italic_e italic_r italic_t ( italic_q , italic_t italic_u italic_r italic_n _ italic_r italic_e italic_d ( italic_q ) )
     end if
end for
for all Nodes q𝑞q\in\mathcal{B}italic_q ∈ caligraphic_B do
     oHO.search(q)formulae-sequence𝑜subscript𝐻𝑂𝑠𝑒𝑎𝑟𝑐𝑞o\leftarrow H_{O}.search(q)italic_o ← italic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT . italic_s italic_e italic_a italic_r italic_c italic_h ( italic_q )
     Perform o𝑜oitalic_o
end for

If every blue node of one layer was mergeable with at least one red node the resulting set of red nodes will form a complete basis. In this case the current model is forwarded to the hypothesis testing procedure. If all layers of the current automaton have been merged and no complete basis has been found the algorithm resets all performed operations to retrieve the original observation tree again and extends the fringe. This procedure is repeated until a complete basis has been found or until an early stop criterion has been reached, see appendix A.

3.3 Hypothesis testing and counterexample processing

Once a valid hypothesis \mathcal{H}caligraphic_H has been found the learner asks the teacher for equivalence (see section 3). We consider 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{H}caligraphic_H equivalent iff xΣ:|P(x)π(x)|μ:for-all𝑥superscriptΣ𝑃𝑥𝜋𝑥𝜇\forall x\in\Sigma^{*}:|P(x)-\pi(x)|\leq\mu∀ italic_x ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : | italic_P ( italic_x ) - italic_π ( italic_x ) | ≤ italic_μ.

Asking the equivalence query can have two possible outcomes: 1. The teacher deems \mathcal{H}caligraphic_H and 𝒯𝒯\mathcal{T}caligraphic_T to be equivalent. In this case it responds with ‘yes’, and the learner returns \mathcal{H}caligraphic_H and terminates the algorithm. 2. The teacher deems \mathcal{H}caligraphic_H and 𝒯𝒯\mathcal{T}caligraphic_T not equivalent. In this case it responds with ‘no’ and returns a counterexample xcexsubscript𝑥𝑐𝑒𝑥x_{cex}italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT to the learner s.t. |P(xcex)π(xcex)|>μ𝑃subscript𝑥𝑐𝑒𝑥𝜋subscript𝑥𝑐𝑒𝑥𝜇|P(x_{cex})-\pi(x_{cex})|>\mu| italic_P ( italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT ) - italic_π ( italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT ) | > italic_μ. In the latter case the learner gets the chance to process the counterexample to improve \mathcal{H}caligraphic_H.

We follow a simple strategy. We reset \mathcal{H}caligraphic_H to the original observation tree, and subsequently parse the tree via τ(xcex)𝜏subscript𝑥𝑐𝑒𝑥\tau(x_{cex})italic_τ ( italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT ). Because we are guaranteed that all previously asked strings do not violate our equivalence condition (see subsection 3.2), parsing via τ(xcex)𝜏subscript𝑥𝑐𝑒𝑥\tau(x_{cex})italic_τ ( italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT ) will eventually reach a node qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT s.t. τ(qi,xcex,i)𝜏subscript𝑞𝑖subscript𝑥𝑐𝑒𝑥𝑖\tau(q_{i},x_{cex,i})italic_τ ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x , italic_i end_POSTSUBSCRIPT ) will not be defined. Here, xcex,isubscript𝑥𝑐𝑒𝑥𝑖x_{cex,i}italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x , italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th token of xcexsubscript𝑥𝑐𝑒𝑥x_{cex}italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT. In order to process the counterexample the learner iteratively creates nodes s.t. τ(qj,xcex,j)ji𝜏subscript𝑞𝑗subscript𝑥𝑐𝑒𝑥𝑗𝑗𝑖\tau(q_{j},x_{cex,j})\text{, }j\geq iitalic_τ ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x , italic_j end_POSTSUBSCRIPT ) , italic_j ≥ italic_i is defined, and initializes them via Alg. 2. Then, the learner continues searching for a complete basis. The overall algorithm flow is depicted in Alg. 7.

Algorithm 7 Main routine
Threshold μ𝜇\muitalic_μ, alphabet ΣΣ\Sigmaroman_Σ, access to SUL 𝒯𝒯\mathcal{T}caligraphic_T via teacher
q0subscript𝑞0\mathcal{H}\leftarrow q_{0}caligraphic_H ← italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
{q0}subscript𝑞0\mathcal{F}\leftarrow\{q_{0}\}caligraphic_F ← { italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } \triangleright \mathcal{F}caligraphic_F is set of fringe nodes
while No termination signal do \triangleright Explained in appendix A.1
     ExtendFringe(,Σ)𝐸𝑥𝑡𝑒𝑛𝑑𝐹𝑟𝑖𝑛𝑔𝑒Σ\mathcal{F}\leftarrow ExtendFringe(\mathcal{F},\Sigma)caligraphic_F ← italic_E italic_x italic_t italic_e italic_n italic_d italic_F italic_r italic_i italic_n italic_g italic_e ( caligraphic_F , roman_Σ )
     while Blue nodes \mathcal{B}caligraphic_B remain and no complete basis found do
         absent\mathcal{R}\leftarrowcaligraphic_R ← set of currently red nodes
         MergeLayer(,,μ)𝑀𝑒𝑟𝑔𝑒𝐿𝑎𝑦𝑒𝑟𝜇MergeLayer(\mathcal{R},\mathcal{B},\mu)italic_M italic_e italic_r italic_g italic_e italic_L italic_a italic_y italic_e italic_r ( caligraphic_R , caligraphic_B , italic_μ )
     end while
     if Complete basis found then
         Ask teacher equivalence query
         if \mathcal{H}caligraphic_H and 𝒯𝒯\mathcal{T}caligraphic_T equivalent then return \mathcal{H}caligraphic_H
         end if
         Reset \mathcal{H}caligraphic_H to observation tree
         Process counterexample xcexsubscript𝑥𝑐𝑒𝑥x_{cex}italic_x start_POSTSUBSCRIPT italic_c italic_e italic_x end_POSTSUBSCRIPT \triangleright See section 3.3
     else
         Reset \mathcal{H}caligraphic_H to observation tree
     end if
end while

4 Experiments and results

We tested our algorithm on the TAYSIR competition Eyraud et al. (2023). For obvious reasons we focused on the part of the dataset that allows the inference of PDFA, namely track 2. We implemented our algorithm in Flexfringe333https://github.com/tudelft-cda-lab/FlexFringe, and ran it on the respective models. In order to keep the models small and concise, but also for faster inference, we kept the hyperparameter μ𝜇\muitalic_μ relatively large at a value of 0.00010.00010.00010.0001. All experiments have been run on a notebook with Ubuntu 22.04 64-bit, 16GB RAM, and an Intel i7 CPU @ 2.60GHz x 12. The maximum depth that we explored was set to 6666.

Table 1 shows the results the we achieved. Here we show the scenario of track 2 of the competition, mention the size of the alphabet, and compare the size of the resulting PDFA and the achieved mean-squared-error (MSE) with the winners of the competition, who also used automata learning. Additionally, we provide the run-times in table 2.

We can see that we achieve very low MSE already with very few states, indicating that these PDFA are already capable of modeling the language well. The time an inferred observation tree reached the depth of 6666 was scenario 10, which explains its larger run-time.

Scenario |Σ|Σ|\Sigma|| roman_Σ | MSE Winners MSE pL# n𝑛nitalic_n Winners n𝑛nitalic_n pL#
1 33 0.175e60.175superscript𝑒60.175e^{-6}0.175 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.679e60.679superscript𝑒60.679e^{-6}0.679 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 866 21
2 20 0.97e80.97superscript𝑒80.97e^{-8}0.97 italic_e start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 0.645e50.645superscript𝑒50.645e^{-5}0.645 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 131 91
3 7 0.3e100.3superscript𝑒100.3e^{-10}0.3 italic_e start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT 0.749e60.749superscript𝑒60.749e^{-6}0.749 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 110 30
4 15 0.6e110.6superscript𝑒110.6e^{-11}0.6 italic_e start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT 0.146e70.146superscript𝑒70.146e^{-7}0.146 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 105 10
5 20 0.713superscript0.7130.7^{-13}0.7 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT 0.684e70.684superscript𝑒70.684e^{-7}0.684 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 123 11
6 33 0.1971e60.1971superscript𝑒60.1971e^{-6}0.1971 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.843e60.843superscript𝑒60.843e^{-6}0.843 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 318 20
7 20 00 0.364e100.364superscript𝑒100.364e^{-10}0.364 italic_e start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT 170 16
8 66 0.443e70.443superscript𝑒70.443e^{-7}0.443 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 0.123e50.123superscript𝑒50.123e^{-5}0.123 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 162 16
9 20 00 0.674e100.674superscript𝑒100.674e^{-10}0.674 italic_e start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT 55 17
10 33 0.124e60.124superscript𝑒60.124e^{-6}0.124 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.21e40.21superscript𝑒40.21e^{-4}0.21 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1412 38
Table 1: Results on track 2 of the TAYSIR competition set. In reference to the L# algorithm we call our method pL# here. n𝑛nitalic_n indicates the number of states of the resulting hypothesis model.
1 2 3 4 5 6 7 8 9 10
6m31s 10m9s 11m8s 2m57s 29m22s 6m19s 5m24s 8m30s 5m32s 105m56s
Table 2: Run-times of the experiments conducted.

5 Discussion and conclusion

In this work we presented a novel active automata learning algorithm. Inspired by the L# algorithm it learns directly on an observation tree. Capable of asking a new type of query, namely string probability queries, it learns PDFA as a surrogate model of a system under learning. We showed the effectiveness of the algorithm experimentally by distilling PDFA from trained neural networks. Using our method we achieved low inference errors with relatively few states already. We provide mathematical motivation as well as convergence behavior of the inferred probabilities. A possible application of our algorithm lies in reverse engineering, where we can infer next-symbol-probabilities from only whole-string-probabilities when next-symbol-probabilities are not available.


Acknowledgments and Disclosure of Funding

This work is supported by NWO TTW VIDI project 17541 - Learning state machines from infrequent software traces (LIMIT).

Appendix A Practical considerations

A.1 Early stop** criteria

Under normal circumstances the algorithm will terminate and return a hypothesis once it found a complete basis for which the teacher cannot determine a counterexample. In case the SUL is very complicated however the algorithm can potentially run for a very long time. In this case early stop criteria are desired. We introduced a maximum number of ExtendFringe𝐸𝑥𝑡𝑒𝑛𝑑𝐹𝑟𝑖𝑛𝑔𝑒ExtendFringeitalic_E italic_x italic_t italic_e italic_n italic_d italic_F italic_r italic_i italic_n italic_g italic_e-operations the algorithm is allowed to do, which is similar to a maximum depth that the observation tree can grow. An exception is given by the counterexample processing routine, which can grow paths deeper than this depth. Other methods would be to limit the time the learner runs, or to set a limit on the probability mass covered by the membership queries. Whenever such a criterion is met the algorithm returns an early hypothesis.

A.2 Exploding probability estimates

A problem that arose is that in some instances the estimated probabilities π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) became larger than one. This problem arises due to the fact that probabilities that occur in the access sequence of that node get underestimated, forcing π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) to become larger to meet the criterion P(xqA)=π(xqA)π(q)𝑃superscriptsubscript𝑥𝑞𝐴𝜋superscriptsubscript𝑥𝑞𝐴𝜋𝑞P(x_{q}^{A})=\pi(x_{q}^{A})\cdot\pi(q)italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) = italic_π ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ⋅ italic_π ( italic_q ). We want to mention two strategies to deal with this problem. Firstly, the learner can simply continue adding nodes to the observation tree until π(q)1 qQ𝜋𝑞1 for-all𝑞𝑄\pi(q)\leq 1\text{ }\forall q\in Qitalic_π ( italic_q ) ≤ 1 ∀ italic_q ∈ italic_Q. We show convergence in section A.4.

For the second approach we clip the stop** probability to a maximum value π(q)1ϵϵ>0𝜋𝑞1italic-ϵitalic-ϵ0\pi(q)\leq 1-\epsilon\text{, }\epsilon>0italic_π ( italic_q ) ≤ 1 - italic_ϵ , italic_ϵ > 0, and continue learning. A drawback of this approach is that we cannot guarantee P(xqA)=π(xqA)𝑃superscriptsubscript𝑥𝑞𝐴𝜋superscriptsubscript𝑥𝑞𝐴P(x_{q}^{A})=\pi(x_{q}^{A})italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) = italic_π ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) to hold anymore.

A.3 Equivalence testing

In practice it is not feasible to determine equivalence exactly. To approximate equivalence usually an input string strategy is employed, and the challenge is to have an optimal coverage on tested strings. We resorted to a simple random string testing. Then, if for a current hypothesis \mathcal{H}caligraphic_H the equivalence oracle failed to find a counterexample within 10000100001000010000 randomly generated strings, we considered the systems equal and output the hypothesis. Albeit simple we yield good results with our method already.

A.4 Convergence of estimated probabilities

Lemma 1 After each DFS search strategy on the observation tree, the probabilities π(x)𝜋𝑥\pi(x)italic_π ( italic_x ) for each sequence x𝑥xitalic_x for which τ(x)𝜏𝑥\tau(x)italic_τ ( italic_x ) is defined on the observation tree is π(x)=P(x)𝜋𝑥𝑃𝑥\pi(x)=P(x)italic_π ( italic_x ) = italic_P ( italic_x ).

The proof of this follows by design of the algorithm. The next one gives us some idea about the behavior of the distributions in the limit.

Lemma 2 Assuming the teacher returns correct probabilities P(x)𝑃𝑥P(x)italic_P ( italic_x ), π(q0,a)𝜋subscript𝑞0𝑎\pi(q_{0},a)italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ) will converge to the real probabilities P(aΣ)𝑃𝑎superscriptΣP(a\Sigma^{*})italic_P ( italic_a roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for each aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ. Furthermore, after initialization of the root node π(q0)𝜋subscript𝑞0\pi(q_{0})italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) remains correct at all times.

The proof follows the simple definition of probabilities. For the root it goes that for a given aΣ: P(aΣ)=xΣP(ax):𝑎Σ 𝑃𝑎superscriptΣsubscript𝑥superscriptΣ𝑃𝑎𝑥a\in\Sigma:\text{ }P(a\Sigma^{*})=\sum_{x\in\Sigma^{*}}P(ax)italic_a ∈ roman_Σ : italic_P ( italic_a roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a italic_x ). Algorithm 3 ensures that this condition is fulfilled as the number of states n𝑛nitalic_n in the observation tree grows to n𝑛n\rightarrow\inftyitalic_n → ∞. The second statement follows simply from the definition of π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) and that for q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT it is always xq0A=λsuperscriptsubscript𝑥subscript𝑞0𝐴𝜆x_{q_{0}}^{A}=\lambdaitalic_x start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = italic_λ. Note that as a side consequence of this we get aΣm(q0,a)+π(q0)=1subscript𝑎Σ𝑚subscript𝑞0𝑎𝜋subscript𝑞01\sum_{a\in\Sigma}m(q_{0},a)+\pi(q_{0})=1∑ start_POSTSUBSCRIPT italic_a ∈ roman_Σ end_POSTSUBSCRIPT italic_m ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ) + italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 in the limit.

Lemma 3 We again assume the teacher returns correct probabilities P(x)xΣ𝑃𝑥for-all𝑥superscriptΣP(x)\text{, }\forall x\in\Sigma^{*}italic_P ( italic_x ) , ∀ italic_x ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then, for each node qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q of the observation tree with access sequence xqAsuperscriptsubscript𝑥𝑞𝐴x_{q}^{A}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT its respective attribute m(q,a)𝑚𝑞𝑎m(q,a)italic_m ( italic_q , italic_a ) will converge to P(xqAaΣ)=xΣP(xqAax)𝑃superscriptsubscript𝑥𝑞𝐴𝑎superscriptΣsubscriptsuperscript𝑥superscriptΣ𝑃superscriptsubscript𝑥𝑞𝐴𝑎superscript𝑥P(x_{q}^{A}a\Sigma^{*})=\sum_{x^{\prime}\in\Sigma^{*}}P(x_{q}^{A}ax^{\prime})italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_a roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_a italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), again aΣfor-all𝑎Σ\forall a\in\Sigma∀ italic_a ∈ roman_Σ.

Again, the proof follows from the definition. With these two we can now proceed to the following, more interesting theorem.

Theorem 1 Given the algorithm as described in this work, the probabilities π(q)𝜋𝑞\pi(q)italic_π ( italic_q ) and π(q,a)𝜋𝑞𝑎\pi(q,a)italic_π ( italic_q , italic_a ) of each node q𝑞qitalic_q of the observation tree converge to the real probabilities as the number of nodes n𝑛nitalic_n in the observation tree grows towards the limit, n𝑛n\rightarrow\inftyitalic_n → ∞.

Starting from the root node, we know that the stop** probability π(q0)𝜋subscript𝑞0\pi(q_{0})italic_π ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is correct. We also know that P(a|q0)𝑃conditional𝑎subscript𝑞0P(a|q_{0})italic_P ( italic_a | italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the real probability P(aΣ)𝑃𝑎superscriptΣP(a\Sigma^{*})italic_P ( italic_a roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for all aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ. We choose an arbitrary element yΣ𝑦Σy\in\Sigmaitalic_y ∈ roman_Σ. Then, with q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT modeling the probabilities correctly in the limit we know that for y𝑦yitalic_y and the child node q1=τ(q0,y)subscript𝑞1𝜏subscript𝑞0𝑦q_{1}=\tau(q_{0},y)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_τ ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ), the stop** probability π(q1)𝜋subscript𝑞1\pi(q_{1})italic_π ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) must also be correct, since P(y|q0)𝑃conditional𝑦subscript𝑞0P(y|q_{0})italic_P ( italic_y | italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is correct. Now it is to remain to show that the distribution π(q1,a)𝜋subscript𝑞1𝑎\pi(q_{1},a)italic_π ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a ) will converge to its true distribution. We start again with the observation that π(y|q0)=P(yΣ)𝜋conditional𝑦subscript𝑞0𝑃𝑦superscriptΣ\pi(y|q_{0})=P(y\Sigma^{*})italic_π ( italic_y | italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_P ( italic_y roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), i.e. it is the sum of the probability of all strings that start with y𝑦yitalic_y, and we note that for any aΣ𝑎Σa\in\Sigmaitalic_a ∈ roman_Σ, we have

P(a|q1)=P(a|yΣ)P(yΣ).𝑃conditional𝑎subscript𝑞1𝑃conditional𝑎𝑦superscriptΣ𝑃𝑦superscriptΣP(a|q_{1})=P(a|y\Sigma^{*})*P(y\Sigma^{*}).italic_P ( italic_a | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_P ( italic_a | italic_y roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∗ italic_P ( italic_y roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (2)

We further require for q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that aΣπ(a|q1)+π(q1)=1subscript𝑎Σ𝜋conditional𝑎subscript𝑞1𝜋subscript𝑞11\sum_{a\in\Sigma}\pi(a|q_{1})+\pi(q_{1})=1∑ start_POSTSUBSCRIPT italic_a ∈ roman_Σ end_POSTSUBSCRIPT italic_π ( italic_a | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_π ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1. We can expand the term with Eq. 2 and arrive at

aΣP(yaX)P(yX|q0)=1π(q1).subscript𝑎Σ𝑃𝑦𝑎𝑋𝑃conditional𝑦𝑋subscript𝑞01𝜋subscript𝑞1\frac{\sum_{a\in\Sigma}P(yaX)}{P(yX|q_{0})}=1-\pi(q_{1}).divide start_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ roman_Σ end_POSTSUBSCRIPT italic_P ( italic_y italic_a italic_X ) end_ARG start_ARG italic_P ( italic_y italic_X | italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG = 1 - italic_π ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (3)

From this we can see that the normalization operation of Alg. 5 is correct in the limit. The rest of the proof follows by induction.

Appendix B Abbreviations

  • SUL: System under learning

  • P(D)FA: Probabilistic (Deterministic) Finite Automaton

  • RNN: Recurrent Neural Network

  • DFS: Depth First Search

Appendix C Notation

  • ΣΣ\Sigmaroman_Σ: Alphabet

  • a𝑎aitalic_a: Token in ΣΣ\Sigmaroman_Σ

  • x𝑥xitalic_x: String over ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

  • |x|𝑥|x|| italic_x |: Length of string x𝑥xitalic_x

  • λ𝜆\lambdaitalic_λ: Empty string

  • Q𝑄Qitalic_Q: Set of states of automaton

  • qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: A state in Q𝑄Qitalic_Q, indexed by i𝑖iitalic_i

  • q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: A unique initial state in Q𝑄Qitalic_Q

  • τ𝜏\tauitalic_τ: Transition function Q×ΣQ𝑄Σ𝑄Q\times\Sigma\rightarrow Qitalic_Q × roman_Σ → italic_Q

  • Qτ(q)subscript𝑄𝜏𝑞Q_{\tau(q)}italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT: The set of all reachable states from state q𝑞qitalic_q

  • Xτ(q)subscript𝑋𝜏𝑞X_{\tau(q)}italic_X start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT: The set of shortest strings to reach states in Qτ(q)subscript𝑄𝜏𝑞Q_{\tau(q)}italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT s.t. each state qQτ(q)superscript𝑞subscript𝑄𝜏𝑞q^{\prime}\in Q_{\tau(q)}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT has exactly one associated xXτ(q)𝑥subscript𝑋𝜏𝑞x\in X_{\tau(q)}italic_x ∈ italic_X start_POSTSUBSCRIPT italic_τ ( italic_q ) end_POSTSUBSCRIPT: τ(q,x)=q𝜏𝑞𝑥superscript𝑞\tau(q,x)=q^{\prime}italic_τ ( italic_q , italic_x ) = italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

  • π𝜋\piitalic_π: Probability Q×Σ[0,1]𝑄Σ01Q\times\Sigma\rightarrow[0,1]italic_Q × roman_Σ → [ 0 , 1 ]

  • xqAsubscriptsuperscript𝑥𝐴𝑞x^{A}_{q}italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT: The unique access sequence of node q𝑞qitalic_q in the observation tree

  • μ𝜇\muitalic_μ: Error bound for learning

  • \mathcal{H}caligraphic_H: A hypothesis

  • 𝒯𝒯\mathcal{T}caligraphic_T: The target/SUL


References

  • Angluin (1987) Dana Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, nov 1987. ISSN 0890-5401. doi: 10.1016/0890-5401(87)90052-6. URL https://doi.org/10.1016/0890-5401(87)90052-6.
  • Balle et al. (2014) Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral learning of weighted automata: A forward-backward perspective. Machine Learning, pages 33–63, 07 2014. doi: 10.1007/s10994-013-5416-x.
  • Clark and Thollard (2004) Alexander Clark and Franck Thollard. Pac-learnability of probabilistic deterministic finite state automata. J. Mach. Learn. Res., 5:473–497, dec 2004. ISSN 1532-4435.
  • Eyraud et al. (2023) Rémi Eyraud, Dakotah Lambert, Badr Tahri Joutei, Aidar Gaffarov, Mathias Cabanne, Jeffrey Heinz, and Chihiro Shibata. Taysir competition: Transformer+rnn: Algorithms to yield simple and interpretable representations. In François Coste, Faissal Ouardi, and Guillaume Rabusseau, editors, Proceedings of 16th edition of the International Conference on Grammatical Inference, volume 217 of Proceedings of Machine Learning Research, pages 275–290. PMLR, 10–13 Jul 2023. URL https://proceedings.mlr.press/v217/eyraud23a.html.
  • Eyraud and Ayache (2021) Rémi Eyraud and Stéphane Ayache. Distillation of weighted automata from recurrent neural networks using a spectral approach. Machine Learning, 04 2021. doi: 10.1007/s10994-021-05948-1.
  • Guidotti et al. (2018) Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5), aug 2018. ISSN 0360-0300. doi: 10.1145/3236009. URL https://doi.org/10.1145/3236009.
  • Lang et al. (1998) Kevin J. Lang, Barak A. Pearlmutter, and Rodney A. Price. Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 1433, pages 1–12. Springer Verlag, 1998. ISBN 3540647767. doi: 10.1007/bfb0054059.
  • Mayr et al. (2022) Franz Mayr, Sergio Yovine, Federico Pan, Nicolas Basset, and Thao Dang. Towards efficient active learning of pdfa, 2022.
  • Mayr et al. (2023) Franz Mayr, Sergio Yovine, Matías Carrasco, Federico Pan, and Federico Vilensky. A congruence-based approach to active automata learning from neural language models. In François Coste, Faissal Ouardi, and Guillaume Rabusseau, editors, Proceedings of 16th edition of the International Conference on Grammatical Inference, volume 217 of Proceedings of Machine Learning Research, pages 250–264. PMLR, 10–13 Jul 2023. URL https://proceedings.mlr.press/v217/mayr23a.html.
  • Okudono et al. (2019) Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted automata extraction from recurrent neural networks via regression on state spaces, 2019.
  • Vaandrager et al. (2022) Frits Vaandrager, Bharat Garhewal, Jurriaan Rot, and Thorsten Wißmann. A new approach for active automata learning based on apartness. In Dana Fisman and Grigore Rosu, editors, Tools and Algorithms for the Construction and Analysis of Systems, pages 223–243, Cham, 2022. Springer International Publishing. ISBN 978-3-030-99524-9.
  • Verwer et al. (2007) Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen. An algorithm for learning real-time automata. 2007. URL https://api.semanticscholar.org/CorpusID:15561445.
  • Weiss et al. (2019) Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning deterministic weighted automata with queries and counterexamples. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/d3f93e7766e8e1b7ef66dfdd9a8be93b-Paper.pdf.