License: arXiv.org perpetual non-exclusive license
arXiv:2302.05793v3 [cs.LG] 17 Feb 2024

Distributional GFlowNets with Quantile Flows

Dinghuai Zhang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
Mila, University of Montreal Ling Pan*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
Hong Kong University of Science and Technology Ricky T. Q. Chen
Meta AI, Fundamental AI Research Aaron Courville
Mila, University of Montreal Yoshua Bengio
Mila, University of Montreal
Abstract

Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. There have been recent successes in applying GFlowNets to a number of practical domains where diversity of the solutions is crucial, while reinforcement learning aims to learn an optimal solution based on the given reward function only and fails to discover diverse and high-quality solutions. However, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed quantile matching GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.

00footnotetext: The asterisk mark * denotes equal contributions. Correspondence to: Dinghuai Zhang <[email protected]>.

1 Introduction

The success of reinforcement learning (RL) (Sutton & Barto, 2005) has been built on learning intelligent agents that are capable of making long-horizon sequential decisions (Mnih et al., 2015; Vinyals et al., 2019). These strategies are often learned through maximizing rewards with the aim of finding a single optimal solution. That being said, practitioners have also found that being able to generate diverse solutions rather than just a single optimum can have many real-world applications, such as exploration in RL (Hazan et al., 2018; Zhang et al., 2022b), drug-discovery (Huang et al., 2016; Zhang et al., 2021; Jumper et al., 2021), and material design (Zakeri & Syri, 2015; Zitnick et al., 2020). One promising approach to search for a diverse set of high-quality candidates is to sample proportionally to the reward function (Bengio et al., 2021a).

Recently, GFlowNet (Bengio et al., 2021a; b) has been proposed as a novel probabilistic framework to tackle this problem. Taking inspiration from RL, a GFlowNet policy takes a series of decision-making steps to generate composite objects 𝐱𝐱\mathbf{x}bold_x, with probability proportional to its return R(𝐱)𝑅𝐱R(\mathbf{x})italic_R ( bold_x ). The number of particles in each “flow” intuitively denotes the scale of probability along the corresponding path. The use of parametric polices enables GFlowNets to generalize to unseen states and trajectories, making it more desirable than traditional Markov chain Monte Carlo (MCMC) methods (Zhang et al., 2022c) which are known to suffer from mode mixing issues (Desjardins et al., 2010; Bengio et al., 2012). With its unique ability to support off-policy training, GFlowNet has been demonstrated superior to variational inference methods (Malkin et al., 2022b).

Refer to caption
Figure 1: Illustration of a distributional GFlowNet with stochastic edge flows.222Each circle denotes a state; concentric circles on the right side denote terminal states to which rewards are assigned. 𝐬0𝐬1𝐬2subscript𝐬0subscript𝐬1subscript𝐬2\mathbf{s}_{0}\to\mathbf{s}_{1}\to\mathbf{s}_{2}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → bold_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a complete trajectory which starts from the initial state 𝐬0subscript𝐬0\mathbf{s}_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ends at a terminal state 𝐬2subscript𝐬2\mathbf{s}_{2}bold_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In order to cope with a stochastic reward, we represent every edge flow as a random variable, denoted by gray probability curve icons.

Yet the current GFlowNet frameworks can only learn from a deterministic reward oracle, which is too stringent an assumption for realistic scenarios. Realistic environments are often stochastic (e.g. due to noisy observations), where the need for uncertainty modeling (Kendall & Gal, 2017; Guo et al., 2017; Teng et al., 2022) emerges. In this work, we propose adopting a probabilistic approach to model the flow function (see Figure 2) in order to account for this stochasticity. Analogous to distributional RL (Bellemare et al., 2017) approaches, we think of each edge flow as a random variable, and parameterize its quantile function. We then use quantile regression to train the GFlowNet model based on a temporal-difference-like flow constraint. The proposed GFlowNet learning algorithm, dubbed quantile matching (QM), is able to match stochastic reward functions. QM can also output risk-sensitive policies under user-provided distortion risk measures, which allow it to behave more similarly to human decision-making. The proposed method also provides a stronger learning signal during training, which additionally allows it to outperform existing GFlowNet training approaches on standard benchmarks with just deterministic environments. Our code is openly available at https://github.com/zdhNarsil/Distributional-GFlowNets.

To summarize, the contributions of this work are:

  • We propose quantile matching (QM), a novel distributional GFlowNet training algorithm, for handling stochastic reward settings.

  • A risk-sensitive policy can be obtained from QM, under provided distortion risk measures.

  • The proposed method outperforms existing GFlowNet methods even on deterministic benchmarks.

2 Preliminaries

2.1 GFlowNets

Generative Flow Networks (Bengio et al., 2021a; b, GFlowNets) are a family of probabilistic models to generate composite objects with a sequence of decision-making steps. The stochastic policies are trained to generate complex objects 𝐱𝐱\mathbf{x}bold_x in a combinatorial space 𝒳𝒳\mathcal{X}caligraphic_X with probability proportional to a given reward function R(𝐱)𝑅𝐱R(\mathbf{x})italic_R ( bold_x ), where R:𝒳+:𝑅𝒳subscriptR:\mathcal{X}\to\mathbb{R}_{+}italic_R : caligraphic_X → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. The sequential nature of GFlowNets stands upon the fact that its learned policy incrementally modifies a partially constructed state 𝐬𝒮𝒳𝐬𝒮superset-of-or-equals𝒳\mathbf{s}\in\mathcal{S}\supseteq\mathcal{X}bold_s ∈ caligraphic_S ⊇ caligraphic_X with some action (𝐬𝐬)𝒜𝒮×𝒮𝐬superscript𝐬𝒜𝒮𝒮(\mathbf{s}\to\mathbf{s}^{\prime})\in\mathcal{A}\subseteq\mathcal{S}\times% \mathcal{S}( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A ⊆ caligraphic_S × caligraphic_S. To be more specific, let 𝒢=(𝒮,𝒜)𝒢𝒮𝒜\mathcal{G}=(\mathcal{S},\mathcal{A})caligraphic_G = ( caligraphic_S , caligraphic_A ) be a directed acyclic graph (DAG), and a GFlowNet sequentially samples a trajectory τ=(𝐬0𝐬1)𝜏subscript𝐬0subscript𝐬1\tau=(\mathbf{s}_{0}\to\mathbf{s}_{1}\to\ldots)italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → … ) within that DAG with a forward policy PF(𝐬t+1|𝐬t)subscript𝑃𝐹conditionalsubscript𝐬𝑡1subscript𝐬𝑡P_{F}(\mathbf{s}_{t+1}|\mathbf{s}_{t})italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Here the state 𝐬𝐬\mathbf{s}bold_s and action (𝐬𝐬)𝐬superscript𝐬(\mathbf{s}\to\mathbf{s}^{\prime})( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are respectively a vertex and an edge in the GFlowNet trajectory DAG 𝒢𝒢\mathcal{G}caligraphic_G. We also typically assume that in such DAGs, the relationship between action and future state is a one-to-one correspondence. This is unlike in typical RL setups (where the environment is generally stochastic) and is more appropriate for internal actions like attention, thinking, reasoning or generating answers to a question, or candidate solutions to a problem. We say 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a child of 𝐬𝐬\mathbf{s}bold_s and 𝐬𝐬\mathbf{s}bold_s is a parent of 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if (𝐬𝐬)𝐬superscript𝐬(\mathbf{s}\to\mathbf{s}^{\prime})( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is an edge in 𝒢𝒢\mathcal{G}caligraphic_G. We call states without children terminal states. Notice that any object 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X is a terminal state. We also define a special initial state 𝐬0subscript𝐬0\mathbf{s}_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (which has no parent) as an abstraction for the first state of any object generating path.

A trajectory τ=(𝐬0𝐬1𝐬n)𝜏subscript𝐬0subscript𝐬1subscript𝐬𝑛\tau=(\mathbf{s}_{0}\to\mathbf{s}_{1}\to\ldots\mathbf{s}_{n})italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → … bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is complete if it starts at the initial state 𝐬0subscript𝐬0\mathbf{s}_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ends at a terminal state 𝐬n𝒳subscript𝐬𝑛𝒳\mathbf{s}_{n}\in\mathcal{X}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X. We define the trajectory flow on the set of all complete trajectories 𝒯𝒯\mathcal{T}caligraphic_T to be a non-negative function F:𝒯+:𝐹𝒯subscriptF:\mathcal{T}\to\mathbb{R}_{+}italic_F : caligraphic_T → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. It is called flow, as F(τ)𝐹𝜏F(\tau)italic_F ( italic_τ ) could be thought of as the amount of particles flowing from the initial state to a terminal state along the trajectory τ𝜏\tauitalic_τ, similar to the classical notion of network flows (Ford & Fulkerson, 1956). The flow function is an unnormalized measure over 𝒯𝒯\mathcal{T}caligraphic_T and we could define a distribution over complete trajectories PF(τ)=F(τ)/Zsubscript𝑃𝐹𝜏𝐹𝜏𝑍P_{F}(\tau)=F(\tau)/Zitalic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) = italic_F ( italic_τ ) / italic_Z where Z+𝑍subscriptZ\in\mathbb{R}_{+}italic_Z ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the partition function. The flow is Markovian if there exists a forward policy that satisfies the following factorization:

PF(τ=(𝐬0𝐬n))=t=0n1PF(𝐬t+1|𝐬t).subscript𝑃𝐹𝜏subscript𝐬0subscript𝐬𝑛superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐹conditionalsubscript𝐬𝑡1subscript𝐬𝑡P_{F}(\tau=(\mathbf{s}_{0}\to\ldots\to\mathbf{s}_{n}))=\prod_{t=0}^{n-1}P_{F}(% \mathbf{s}_{t+1}|\mathbf{s}_{t}).italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → … → bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1)

Any trajectory distribution arising from a forward policy satisfies the Markov property. On the other hand, Bengio et al. (2021b) show the less obvious fact that any Markovian trajectory distribution arises from a unique forward policy.

We use PT(𝐱)=τ𝐱PF(τ)subscript𝑃𝑇𝐱subscript𝜏𝐱subscript𝑃𝐹𝜏P_{T}(\mathbf{x})=\sum_{\tau\to\mathbf{x}}P_{F}(\tau)italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_τ → bold_x end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) to denote the terminating probability, namely the marginal likelihood of generating the object 𝐱𝐱\mathbf{x}bold_x, where the summation enumerates over all trajectories that terminate at 𝐱𝐱\mathbf{x}bold_x. The learning problem considered by GFlowNets is fitting the flow such that it could sample objects with probability proportionally to a given reward function, i.e., PT(𝐱)R(𝐱)proportional-tosubscript𝑃𝑇𝐱𝑅𝐱P_{T}(\mathbf{x})\propto R(\mathbf{x})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x ) ∝ italic_R ( bold_x ). This could be represented by the reward matching constraint:

R(𝐱)=τ=(𝐬0𝐬n),𝐬n=𝐱F(τ).𝑅𝐱subscriptformulae-sequence𝜏subscript𝐬0subscript𝐬𝑛subscript𝐬𝑛𝐱𝐹𝜏\displaystyle R(\mathbf{x})=\sum_{\tau=(\mathbf{s}_{0}\to\ldots\to\mathbf{s}_{% n}),\mathbf{s}_{n}=\mathbf{x}}F(\tau).italic_R ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → … → bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x end_POSTSUBSCRIPT italic_F ( italic_τ ) . (2)

It is easy to see that the normalizing factor should satisfy Z=𝐱𝒳R(𝐱)𝑍subscript𝐱𝒳𝑅𝐱Z=\sum_{\mathbf{x}\in\mathcal{X}}R(\mathbf{x})italic_Z = ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_R ( bold_x ). Nonetheless, such computation is non-trivial as it comes down to summation / enumeration over an exponentially large combinatorial space. GFlowNets therefore provide a way to approximate intractable computations, namely sampling – given an unnormalized probability function, like MCMC methods – and marginalizing – in the simplest scenario, estimating the partition function Z𝑍Zitalic_Z, but this can be extended to estimate general marginal probabilities in joint distributions (Bengio et al., 2021b).

2.2 Distributional modeling in control

Distributional reinforcement learning (Bellemare et al., 2023) is an RL approach that models the distribution of returns instead of their expected value. Mathematically, it considers the Bellman equation of a policy π𝜋\piitalic_π as

Zπ(𝐱,𝐚)=dr(𝐱,𝐚)+γZπ(𝐱,𝐚),superscript𝑑superscript𝑍𝜋𝐱𝐚𝑟𝐱𝐚𝛾superscript𝑍𝜋superscript𝐱superscript𝐚\displaystyle Z^{\pi}(\mathbf{x},\mathbf{a})\ {\stackrel{{\scriptstyle d}}{{=}% }}\ r(\mathbf{x},\mathbf{a})+\gamma Z^{\pi}(\mathbf{x}^{\prime},\mathbf{a}^{% \prime}),italic_Z start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_x , bold_a ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP italic_r ( bold_x , bold_a ) + italic_γ italic_Z start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (3)

where =dsuperscript𝑑{\stackrel{{\scriptstyle d}}{{=}}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP denotes the equality between two distributions, γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor, 𝐱,𝐚𝐱𝐚\mathbf{x},\mathbf{a}bold_x , bold_a is the state and action in RL, (r(𝐱,𝐚),𝐱)𝑟𝐱𝐚superscript𝐱(r(\mathbf{x},\mathbf{a}),\mathbf{x}^{\prime})( italic_r ( bold_x , bold_a ) , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are the reward and next state after interacting with the environment, 𝐚superscript𝐚\mathbf{a}^{\prime}bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the next action selected by policy π𝜋\piitalic_π at 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and Zπsuperscript𝑍𝜋Z^{\pi}italic_Z start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT denotes a random variable for the distribution of the Q-function value.

The key idea behind distributional RL is that it allows the agent to represent its uncertainty about the returns it can expect from different actions. In traditional RL, the agent only knows the expected reward of taking a certain action in a certain state, but it doesn’t have any information about how likely different rewards are. In contrast, distributional RL methods estimate the entire return distribution, and use it to make more informed decisions.

3 Formulation

3.1 GFlowNet learning criteria

Flow matching algorithm.

It is not computationally efficient to directly model the trajectory flow function, as it would require learning a function with a high-dimensional input (i.e., the trajectory). Instead, and taking advantage of the Markovian property, we define the state flow and edge flow functions F(𝐬)=τ𝐬F(τ)𝐹𝐬subscript𝐬𝜏𝐹𝜏F(\mathbf{s})=\sum_{\tau\ni\mathbf{s}}F(\tau)italic_F ( bold_s ) = ∑ start_POSTSUBSCRIPT italic_τ ∋ bold_s end_POSTSUBSCRIPT italic_F ( italic_τ ) and F(𝐬𝐬)=τ=(𝐬𝐬)F(τ)𝐹𝐬superscript𝐬subscript𝜏𝐬superscript𝐬𝐹𝜏F(\mathbf{s}\to\mathbf{s}^{\prime})=\sum_{\tau=(\ldots\to\mathbf{s}\to\mathbf{% s}^{\prime}\to\ldots)}F(\tau)italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_τ = ( … → bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → … ) end_POSTSUBSCRIPT italic_F ( italic_τ ). The edge flow is proportional to the marginal likelihood of a trajectory sampled from the GFlowNet including the edge transition. By the conservation law of the flow particles, it is natural to see the flow matching constraint of GFlowNets

𝐬:(𝐬𝐬)𝒜F(𝐬𝐬)=𝐬′′:(𝐬𝐬′′)𝒜F(𝐬𝐬′′),subscript:𝐬𝐬superscript𝐬𝒜𝐹𝐬superscript𝐬subscript:superscript𝐬′′superscript𝐬superscript𝐬′′𝒜𝐹superscript𝐬superscript𝐬′′\displaystyle\sum_{\mathbf{s}:(\mathbf{s}\to\mathbf{s}^{\prime})\in\mathcal{A}% }F(\mathbf{s}\to\mathbf{s}^{\prime})=\sum_{\mathbf{s}^{\prime\prime}:(\mathbf{% s}^{\prime}\to\mathbf{s}^{\prime\prime})\in\mathcal{A}}F(\mathbf{s}^{\prime}% \to\mathbf{s}^{\prime\prime}),∑ start_POSTSUBSCRIPT bold_s : ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT : ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , (4)

for any 𝐬𝒮superscript𝐬𝒮\mathbf{s}^{\prime}\in\mathcal{S}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S. This indicates that for every vertex, the in-flow (left-hand side) equals the out-flow (right-hand side). Furthermore, both equals the state flow F(𝐬)𝐹superscript𝐬F(\mathbf{s}^{\prime})italic_F ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). When 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in Equation 4 is a terminal state, this reduces to the special case of aforementioned reward matching.

Leveraging the generalization power of modern machine learning models, one could learn a parametric model F𝜽(𝐬,𝐬)subscript𝐹𝜽𝐬superscript𝐬F_{{\boldsymbol{\theta}}}(\mathbf{s},\mathbf{s}^{\prime})italic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_s , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to represent the edge flow. The general idea of GFlowNet training objectives is to turn constraints like  Equation 4 into losses that when globally minimized enforce these constraints. To approximately satisfy the flow-matching constraint (Equation 4), the parameter 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ can be trained to minimize the following flow matching (FM) objective for all intermediate states 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

FM(𝐬;𝜽)=[log(𝐬𝐬)𝒜F𝜽(𝐬,𝐬)(𝐬𝐬′′)𝒜F𝜽(𝐬,𝐬′′)]2.subscriptFMsuperscript𝐬𝜽superscriptdelimited-[]subscript𝐬superscript𝐬𝒜subscript𝐹𝜽𝐬superscript𝐬subscriptsuperscript𝐬superscript𝐬′′𝒜subscript𝐹𝜽superscript𝐬superscript𝐬′′2\displaystyle{\mathcal{L}}_{\text{FM}}(\mathbf{s}^{\prime};{\boldsymbol{\theta% }})=\left[\log\frac{\sum_{(\mathbf{s}{\rightarrow}\mathbf{s}^{\prime})\in% \mathcal{A}}F_{{\boldsymbol{\theta}}}(\mathbf{s},\mathbf{s}^{\prime})}{\sum_{(% \mathbf{s}^{\prime}{\rightarrow}\mathbf{s}^{\prime\prime})\in\mathcal{A}}F_{{% \boldsymbol{\theta}}}(\mathbf{s}^{\prime},\mathbf{s}^{\prime\prime})}\right]^{% 2}.caligraphic_L start_POSTSUBSCRIPT FM end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) = [ roman_log divide start_ARG ∑ start_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_s , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

In practice, the model is trained with trajectories from some training distribution π(τ)𝜋𝜏\pi(\tau)italic_π ( italic_τ ) with full support 𝔼τπ(τ)[𝐬τ𝜽FM(𝐬;𝜽)]subscript𝔼similar-to𝜏𝜋𝜏delimited-[]subscript𝐬𝜏subscript𝜽subscriptFM𝐬𝜽\mathbb{E}_{\tau\sim\pi(\tau)}\left[\sum_{\mathbf{s}\in\tau}\nabla_{% \boldsymbol{\theta}}{\mathcal{L}}_{\text{FM}}(\mathbf{s};{\boldsymbol{\theta}}% )\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π ( italic_τ ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT bold_s ∈ italic_τ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT FM end_POSTSUBSCRIPT ( bold_s ; bold_italic_θ ) ], where π(τ)𝜋𝜏\pi(\tau)italic_π ( italic_τ ) could be the trajectory distribution sampled by the GFlowNet (i.e., PF(τ)subscript𝑃𝐹𝜏P_{F}(\tau)italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ), which indicates an on-policy training), or (for off-policy training and better exploration) a tempered version of PF(τ)subscript𝑃𝐹𝜏P_{F}(\tau)italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) or a mixture between PF(τ)subscript𝑃𝐹𝜏P_{F}(\tau)italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) and a uniform policy.

Trajectory balance algorithm.

With the knowledge of edge flow, the corresponding forward policy is given by

PF(𝐬|𝐬)=F(𝐬𝐬)F(𝐬)F(𝐬𝐬).subscript𝑃𝐹conditionalsuperscript𝐬𝐬𝐹𝐬superscript𝐬𝐹𝐬proportional-to𝐹𝐬superscript𝐬P_{F}(\mathbf{s}^{\prime}|\mathbf{s})=\frac{F({\mathbf{s}\to\mathbf{s}^{\prime% }})}{F({\mathbf{s}})}\propto F({\mathbf{s}\to\mathbf{s}^{\prime}}).italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s ) = divide start_ARG italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_F ( bold_s ) end_ARG ∝ italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (6)

Similarly, the backward policy PB(𝐬|𝐬)subscript𝑃𝐵conditional𝐬superscript𝐬P_{B}(\mathbf{s}|\mathbf{s}^{\prime})italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_s | bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is defined to be F(𝐬𝐬)/F(𝐬)F(𝐬𝐬)proportional-to𝐹𝐬superscript𝐬𝐹superscript𝐬𝐹𝐬superscript𝐬{F({\mathbf{s}\to\mathbf{s}^{\prime}})}/{F({\mathbf{s}^{\prime}})}\propto F({% \mathbf{s}\to\mathbf{s}^{\prime}})italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_F ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∝ italic_F ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), a distribution over the parents of state 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The backward policy will not be directly used by the GFlowNet when generating objects, but its benefit could be seen in the following learning paradigm.

Equivalent with the decomposition in Equation 1, a complete trajectory could also be decomposed into products of backward policy probabilities

PB(τ=(𝐬0𝐬n|𝐬n=𝐱))=t=0n1PB(𝐬t|𝐬t+1),subscript𝑃𝐵𝜏subscript𝐬0conditionalsubscript𝐬𝑛subscript𝐬𝑛𝐱superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐵conditionalsubscript𝐬𝑡subscript𝐬𝑡1{P_{B}(\tau=(\mathbf{s}_{0}\to\ldots\to\mathbf{s}_{n}|\mathbf{s}_{n}=\mathbf{x% }))}=\prod_{t=0}^{n-1}P_{B}(\mathbf{s}_{t}|\mathbf{s}_{t+1}),italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → … → bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x ) ) = ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , (7)

as shown in Bengio et al. (2021b). In order to construct the balance between forward and backward model in the trajectory level, Malkin et al. (2022a) propose the following trajectory balance (TB) constraint,

Zt=0n1PF(𝐬t+1|𝐬t)=R(𝐱)t=0n1PB(𝐬t|𝐬t+1),𝑍superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐹conditionalsubscript𝐬𝑡1subscript𝐬𝑡𝑅𝐱superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐵conditionalsubscript𝐬𝑡subscript𝐬𝑡1Z\prod_{t=0}^{n-1}P_{F}(\mathbf{s}_{t+1}|\mathbf{s}_{t})=R(\mathbf{x})\prod_{t% =0}^{n-1}P_{B}(\mathbf{s}_{t}|\mathbf{s}_{t+1}),italic_Z ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R ( bold_x ) ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , (8)

where (𝐬0𝐬n)subscript𝐬0subscript𝐬𝑛(\mathbf{s}_{0}\to\ldots\to\mathbf{s}_{n})( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → … → bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is a complete trajectory and 𝐬n=𝐱subscript𝐬𝑛𝐱\mathbf{s}_{n}=\mathbf{x}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x. Suppose we have a parameterization with 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ consisting of the estimated forward policy PF(|𝐬;𝜽)P_{F}(\cdot|\mathbf{s};{\boldsymbol{\theta}})italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( ⋅ | bold_s ; bold_italic_θ ), backward policy PB(|𝐬;𝜽)P_{B}(\cdot|\mathbf{s}^{\prime};{\boldsymbol{\theta}})italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ | bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ), and the learnable global scalar Z𝜽subscript𝑍𝜽Z_{\boldsymbol{\theta}}italic_Z start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT for estimating the real partition function. Then we can turn Equation 8 into the TBsubscriptTB{\mathcal{L}}_{\text{TB}}caligraphic_L start_POSTSUBSCRIPT TB end_POSTSUBSCRIPT objective to optimize the parameters:

TB(τ;𝜽)=[logZ𝜽t=0n1PF(𝐬t+1|𝐬t;𝜽)R(𝐱)t=0n1PB(𝐬t|𝐬t+1;𝜽)]2,subscriptTB𝜏𝜽superscriptdelimited-[]subscript𝑍𝜽superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐹conditionalsubscript𝐬𝑡1subscript𝐬𝑡𝜽𝑅𝐱superscriptsubscriptproduct𝑡0𝑛1subscript𝑃𝐵conditionalsubscript𝐬𝑡subscript𝐬𝑡1𝜽2{\mathcal{L}}_{\text{TB}}(\tau;{\boldsymbol{\theta}})=\left[\log\frac{Z_{% \boldsymbol{\theta}}\prod_{t=0}^{n-1}P_{F}(\mathbf{s}_{t+1}|\mathbf{s}_{t};{% \boldsymbol{\theta}})}{R(\mathbf{x})\prod_{t=0}^{n-1}P_{B}(\mathbf{s}_{t}|% \mathbf{s}_{t+1};{\boldsymbol{\theta}})}\right]^{2},caligraphic_L start_POSTSUBSCRIPT TB end_POSTSUBSCRIPT ( italic_τ ; bold_italic_θ ) = [ roman_log divide start_ARG italic_Z start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) end_ARG start_ARG italic_R ( bold_x ) ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where τ=(𝐬0𝐬n=𝐱)𝜏subscript𝐬0subscript𝐬𝑛𝐱\tau=(\mathbf{s}_{0}\to\ldots\to\mathbf{s}_{n}=\mathbf{x})italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → … → bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x ). The model is then trained with stochastic gradient 𝔼τπ(τ)[𝜽TB(τ;𝜽)]subscript𝔼similar-to𝜏𝜋𝜏delimited-[]subscript𝜽subscriptTB𝜏𝜽\mathbb{E}_{\tau\sim\pi(\tau)}\left[\nabla_{\boldsymbol{\theta}}{\mathcal{L}}_% {\text{TB}}(\tau;{\boldsymbol{\theta}})\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π ( italic_τ ) end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TB end_POSTSUBSCRIPT ( italic_τ ; bold_italic_θ ) ]. Trajectory balance (Malkin et al., 2022a) is an extension of detailed balance (Bengio et al., 2021b) to the trajectory level, that aims to improve credit assignment, but may incur large variance as demonstrated in standard benchmarks (Madan et al., 2022). TB is categorized as Monte Carlo, while other GFlowNets (e.g., flow matching, detailed balance, and sub-trajectory balance) objectives are temporal-difference (that leverages the benefits of both Monte Carlo and dynamic programming).

3.2 Quantile flows

For learning with a deterministic reward function, the GFlowNet policy is stochastic while the edge flow function is deterministic as per Equation 6. Nonetheless, such modeling behavior cannot capture the potential uncertainty in the environment with stochastic reward function. See the behavior analysis in the following proposition.

Proposition 1 (informal).

Consider the reward R(𝐱)𝑅𝐱R(\mathbf{x})italic_R ( bold_x ) for object 𝐱𝐱\mathbf{x}bold_x to be a stochastic random variable, then given sufficiently large capacity and computation resource, the obtained GFlowNet after training would generate objects with probability proportional to exp(𝔼[logR(𝐱)])𝔼delimited-[]𝑅𝐱\exp\left(\mathbb{E}[\log R(\mathbf{x})]\right)roman_exp ( blackboard_E [ roman_log italic_R ( bold_x ) ] ).

The proof is deferred to Section B.1. In this work, we propose to treat the modeling of the flow function in a probabilistic manner: we see the state and edge flow as probability distributions rather than scalar values. Following the notation of Bellemare et al. (2017), we use Z(𝐬)𝑍𝐬Z(\mathbf{s})italic_Z ( bold_s ) and Z(𝐬𝐬)𝑍𝐬superscript𝐬Z(\mathbf{s}\to\mathbf{s}^{\prime})italic_Z ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to represent the random variable for state / edge flow values. Still, the marginalization property of flow matching holds, but on a distributional level:

Z(𝐬)=d(𝐬𝐬)𝒜Z(𝐬𝐬)=d(𝐬𝐬′′)𝒜Z(𝐬𝐬′′),superscript𝑑𝑍superscript𝐬subscript𝐬superscript𝐬𝒜𝑍𝐬superscript𝐬superscript𝑑subscriptsuperscript𝐬superscript𝐬′′𝒜𝑍superscript𝐬superscript𝐬′′Z(\mathbf{s}^{\prime})\ {\stackrel{{\scriptstyle d}}{{=}}}\sum_{(\mathbf{s}\to% \mathbf{s}^{\prime})\in\mathcal{A}}Z(\mathbf{s}\to\mathbf{s}^{\prime})\ {% \stackrel{{\scriptstyle d}}{{=}}}\sum_{(\mathbf{s}^{\prime}\to\mathbf{s}^{% \prime\prime})\in\mathcal{A}}Z(\mathbf{s}^{\prime}\to\mathbf{s}^{\prime\prime}),italic_Z ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ∑ start_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_Z ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ∑ start_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_Z ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , (10)

where =dsuperscript𝑑{\stackrel{{\scriptstyle d}}{{=}}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP denotes the equality between the distributions of two random variables. We thus aim to extend the flow matching constraint to a distributional one.

Among the different parametric modeling approaches for scalar random variables, it is effective to model its quantile function (Müller, 1997). The quantile function QZ(β):[0,1]:subscript𝑄𝑍𝛽01Q_{Z}(\beta):[0,1]\to\mathbb{R}italic_Q start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_β ) : [ 0 , 1 ] → blackboard_R is the generalized inverse function of cumulative distribution function (CDF), where Z𝑍Zitalic_Z is the random variable being represented and β𝛽\betaitalic_β is a scalar in [0,1]01[0,1][ 0 , 1 ]. Without ambiguity, we also denote the β𝛽\betaitalic_β-quantile of Z𝑍Zitalic_Z’s distribution by Zβsubscript𝑍𝛽Z_{\beta}italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT. For simplicity, we assume all quantile functions are continuous in this work. The quantile function fully characterizes a distribution. For instance, the expectation could be calculated with a uniform quadrature of the quantile function 𝔼[Z]=01QZ(β)dβ𝔼delimited-[]𝑍superscriptsubscript01subscript𝑄𝑍𝛽differential-d𝛽\mathbb{E}\left[Z\right]=\int_{0}^{1}Q_{Z}(\beta)\mathop{}\!\mathrm{d}\betablackboard_E [ italic_Z ] = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_β ) roman_d italic_β.

Provided that we use neural networks to parameterize the edge flow quantiles (similar to the flow matching parameterization), are we able to represent the distribution of the marginal state flows? Luckily, the following quantile mixture proposition from Dhaene et al. (2006) provides an affirmative answer.

Proposition 2 (quantile additivity).

For any set of M𝑀Mitalic_M one dimensional random variables {Zm}m=1Msuperscriptsubscriptsuperscript𝑍𝑚𝑚1𝑀\{Z^{m}\}_{m=1}^{M}{ italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT which share the same randomness through a common β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ] in the way that Zm=Qm(β),msuperscript𝑍𝑚superscript𝑄𝑚𝛽for-all𝑚Z^{m}=Q^{m}(\beta),\forall mitalic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_β ) , ∀ italic_m, where Qm()superscript𝑄𝑚normal-⋅Q^{m}(\cdot)italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ) is the quantile function for the m𝑚mitalic_m-th random variable, there exists a random variable Z0superscript𝑍0Z^{0}italic_Z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, such that Z0=dm=1MZmsuperscript𝑑superscript𝑍0superscriptsubscript𝑚1𝑀superscript𝑍𝑚Z^{0}{\stackrel{{\scriptstyle d}}{{=}}}\sum_{m=1}^{M}Z^{m}italic_Z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and its quantile function satisfies Q0()=m=1MQm()superscript𝑄0normal-⋅superscriptsubscript𝑚1𝑀superscript𝑄𝑚normal-⋅Q^{0}(\cdot)=\sum_{m=1}^{M}Q^{m}(\cdot)italic_Q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( ⋅ ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ).

We relegate the proof to Section B.2.

Remark 3.

Such additive property of the quantile function is essential to an efficient implementation of the distributional matching algorithm. On the other hand, other distribution representation methods may need considerable amount of computation to deal with the summation over a series of distributions. For example, for the discrete categorical representation (Bellemare et al., 2017; Barth-Maron et al., 2018), the summation between M𝑀Mitalic_M distributions would need M1𝑀1M-1italic_M - 1 convolution operations, which is highly time-consuming.

Quantile matching algorithm.

We propose to model the β𝛽\betaitalic_β-quantile of the edge flow of 𝐬𝐬𝐬superscript𝐬\mathbf{s}\to\mathbf{s}^{\prime}bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as Zβlog(𝐬𝐬;𝜽)superscriptsubscript𝑍𝛽𝐬superscript𝐬𝜽Z_{\beta}^{\log}(\mathbf{s}\to\mathbf{s}^{\prime};{\boldsymbol{\theta}})italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_log end_POSTSUPERSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) with network parameter 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ on the log scale for better learning stability. A temporal-difference-like (Sutton & Barto, 2005) error δ𝛿\deltaitalic_δ is then constructed following Bengio et al. (2021a)’s Flow Matching loss (Equation 5), but across all quantiles:

δβ,β~(𝐬;𝜽)=log(𝐬𝐬′′)𝒜expZβ~log(𝐬𝐬′′;𝜽)log(𝐬𝐬)𝒜expZβlog(𝐬𝐬;𝜽),superscript𝛿𝛽~𝛽superscript𝐬𝜽subscriptsuperscript𝐬superscript𝐬′′𝒜superscriptsubscript𝑍~𝛽superscript𝐬superscript𝐬′′𝜽subscript𝐬superscript𝐬𝒜superscriptsubscript𝑍𝛽𝐬superscript𝐬𝜽\begin{split}\delta^{\beta,\tilde{\beta}}(\mathbf{s}^{\prime};{\boldsymbol{% \theta}})=\log\sum_{(\mathbf{s}^{\prime}\to\mathbf{s}^{\prime\prime})\in% \mathcal{A}}\exp Z_{\tilde{\beta}}^{\log}(\mathbf{s}^{\prime}\to\mathbf{s}^{% \prime\prime};{\boldsymbol{\theta}})-\log\sum_{(\mathbf{s}\to\mathbf{s}^{% \prime})\in\mathcal{A}}\exp Z_{\beta}^{\log}(\mathbf{s}\to\mathbf{s}^{\prime};% {\boldsymbol{\theta}}),\end{split}start_ROW start_CELL italic_δ start_POSTSUPERSCRIPT italic_β , over~ start_ARG italic_β end_ARG end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) = roman_log ∑ start_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT roman_exp italic_Z start_POSTSUBSCRIPT over~ start_ARG italic_β end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_log end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → bold_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) - roman_log ∑ start_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_A end_POSTSUBSCRIPT roman_exp italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_log end_POSTSUPERSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) , end_CELL end_ROW (11)

where β,β~[0,1]𝛽~𝛽01\beta,\tilde{\beta}\in[0,1]italic_β , over~ start_ARG italic_β end_ARG ∈ [ 0 , 1 ] and 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ is the model parameter. This calculation is still valid as both log\logroman_log and exp\exproman_exp are monotonic operations, thus do not affect quantiles.

Notice that we aim to learn the quantile rather than average, thus we resort to quantile regression (Koenker, 2005) to minimize the pinball error ρβ(δ)|β𝟙{δ<0}|(δ)subscript𝜌𝛽𝛿𝛽1𝛿0𝛿\rho_{\beta}(\delta)\triangleq\left|\beta-\mathbbm{1}\{\delta<0\}\right|\ell(\delta)italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_δ ) ≜ | italic_β - blackboard_1 { italic_δ < 0 } | roman_ℓ ( italic_δ ), where ()\ell(\cdot)roman_ℓ ( ⋅ ) is usually 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm or its smooth alternative. In summary, we propose the following quantile matching (QM) objective for GFlowNet learning:

QM(𝐬;𝜽)=1N~i=1Nj=1N~ρβi(δβi,β~j(𝐬;𝜽)),subscriptQM𝐬𝜽1~𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1~𝑁subscript𝜌subscript𝛽𝑖superscript𝛿subscript𝛽𝑖subscript~𝛽𝑗𝐬𝜽{{\mathcal{L}}_{\text{QM}}}(\mathbf{s};{\boldsymbol{\theta}})=\frac{1}{\tilde{% N}}\sum_{i=1}^{N}\sum_{j=1}^{\tilde{N}}\rho_{\beta_{i}}(\delta^{\beta_{i},% \tilde{\beta}_{j}}(\mathbf{s};{\boldsymbol{\theta}})),caligraphic_L start_POSTSUBSCRIPT QM end_POSTSUBSCRIPT ( bold_s ; bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_N end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_δ start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_s ; bold_italic_θ ) ) , (12)

where βi,β~jsubscript𝛽𝑖subscript~𝛽𝑗\beta_{i},\tilde{\beta}_{j}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are sampled i.i.d. from the uniform distribution 𝒰[0,1],i,j𝒰01for-all𝑖𝑗\mathcal{U}[0,1],\forall i,jcaligraphic_U [ 0 , 1 ] , ∀ italic_i , italic_j. Here N,N~𝑁~𝑁N,\tilde{N}italic_N , over~ start_ARG italic_N end_ARG are two integer value hyperparameters. The average over β~jsubscript~𝛽𝑗\tilde{\beta}_{j}over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT makes the distribution matching valid; see an analysis in Section B.3. During the inference (i.e., sampling for generation) phase, the forward policy is estimated through numerical integration:

PF(𝐬|𝐬)=𝔼[Z(𝐬𝐬)](𝐬𝐬~)𝒜𝔼[Z(𝐬𝐬~)]𝔼[Z(𝐬𝐬)]1Ni=1Nexp(Zβilog(𝐬𝐬;𝜽)),subscript𝑃𝐹conditionalsuperscript𝐬𝐬𝔼delimited-[]𝑍𝐬superscript𝐬subscript𝐬~𝐬𝒜𝔼delimited-[]𝑍𝐬~𝐬proportional-to𝔼delimited-[]𝑍𝐬superscript𝐬1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝑍subscript𝛽𝑖𝐬superscript𝐬𝜽\displaystyle P_{F}(\mathbf{s}^{\prime}|\mathbf{s})=\frac{\mathbb{E}\left[Z(% \mathbf{s}\to\mathbf{s}^{\prime})\right]}{\sum_{(\mathbf{s}\to\tilde{\mathbf{s% }})\in\mathcal{A}}\mathbb{E}\left[Z(\mathbf{s}\to\tilde{\mathbf{s}})\right]}% \propto\mathbb{E}\left[Z(\mathbf{s}\to\mathbf{s}^{\prime})\right]\approx\frac{% 1}{N}\sum_{i=1}^{N}\exp\left(Z^{\log}_{\beta_{i}}(\mathbf{s}\to\mathbf{s}^{% \prime};{\boldsymbol{\theta}})\right),italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s ) = divide start_ARG blackboard_E [ italic_Z ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( bold_s → over~ start_ARG bold_s end_ARG ) ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E [ italic_Z ( bold_s → over~ start_ARG bold_s end_ARG ) ] end_ARG ∝ blackboard_E [ italic_Z ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_Z start_POSTSUPERSCRIPT roman_log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) ) , (13)

where βi𝒰[0,1],isimilar-tosubscript𝛽𝑖𝒰01for-all𝑖\beta_{i}\sim\mathcal{U}[0,1],\forall iitalic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U [ 0 , 1 ] , ∀ italic_i. We summarize the algorithmic details in Algorithm 1. We remark that due to Jensen’s inequality, this approximation is smaller than its true value.

Algorithm 1 GFlowNet quantile matching (QM) algorithm
GFlowNet quantile flow Zβ(𝐬𝐬;𝜽)subscript𝑍𝛽𝐬superscript𝐬𝜽Z_{\beta}(\mathbf{s}\to\mathbf{s}^{\prime};{\boldsymbol{\theta}})italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) with parameters 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ, target reward oracle.
repeat
   Sample trajectory τ𝜏\tauitalic_τ with the forward policy PF(|)P_{F}(\cdot|\cdot)italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( ⋅ | ⋅ ) estimated by Equation 13;
   𝜽𝐬τ𝜽QM(𝐬;𝜽)𝜽subscript𝐬𝜏subscript𝜽subscriptQM𝐬𝜽\triangle{\boldsymbol{\theta}}\leftarrow\sum_{\mathbf{s}\in\tau}\nabla_{% \boldsymbol{\theta}}{\mathcal{L}}_{\text{QM}}(\mathbf{s};{\boldsymbol{\theta}})△ bold_italic_θ ← ∑ start_POSTSUBSCRIPT bold_s ∈ italic_τ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT QM end_POSTSUBSCRIPT ( bold_s ; bold_italic_θ ) (as per Equation 12);
   Update 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ with some optimizer;
until some convergence condition

The proposed GFlowNet learning algorithm is independent of the specific quantile function modeling method. In the literature, both explicit (Dabney et al., 2017) and implicit (Dabney et al., 2018) methods have been investigated. In practice we choose the implicit quantile network (IQN) implementation due to its light-weight property and powerful expressiveness. We discuss the details about implementation and efficiency, and conduct an ablation study in Section C.1.

4 Risk Sensitive Flows

The real world is full of uncertainty. To cope with the stochasticity, financial mathematicians use various kinds of risk measures to value the amount of assets to reserve. Concretely speaking, a risk measure is a map** from a random variable to a real number with certain properties (Artzner et al., 1999). Commonly adopted risk measures such as mean or median do not impute the risk / stochasticity information well. In this work, we consider a special family of risk measures, namely the distortion risk measure (Hardy, 2002; Balbás et al., 2009).

𝔼g[Z]=01QZ(g(β))dβ,superscript𝔼𝑔delimited-[]𝑍superscriptsubscript01subscript𝑄𝑍𝑔𝛽differential-d𝛽\mathbb{E}^{g}\left[Z\right]=\int_{0}^{1}Q_{Z}(g(\beta))\mathop{}\!\mathrm{d}\beta,blackboard_E start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT [ italic_Z ] = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_g ( italic_β ) ) roman_d italic_β , (14)

where g:[0,1][0,1]:𝑔0101g:[0,1]\to[0,1]italic_g : [ 0 , 1 ] → [ 0 , 1 ] is a monotone distortion function333In some literature, a different but equivalent notation is used: 01QZ(β)dg(β)=01QZ(g1(β))dβsuperscriptsubscript01subscript𝑄𝑍𝛽differential-d𝑔𝛽superscriptsubscript01subscript𝑄𝑍superscript𝑔1𝛽differential-d𝛽\int_{0}^{1}Q_{Z}(\beta)\mathop{}\!\mathrm{d}g(\beta)=\int_{0}^{1}Q_{Z}(g^{-1}% (\beta))\mathop{}\!\mathrm{d}\beta∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_β ) roman_d italic_g ( italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_β ) ) roman_d italic_β..

Different distortion functions indicate different risk-sensitive effects. In this work, we focus on the following categories of distortion classes. The first cumulative probability weighting (CPW) function (Tversky & Kahneman, 1992; Gonzalez & Wu, 1999) reads

g(β;η)=βη(βη+(1β)η)1/η,𝑔𝛽𝜂superscript𝛽𝜂superscriptsuperscript𝛽𝜂superscript1𝛽𝜂1𝜂g(\beta;\eta)=\frac{\beta^{\eta}}{\left(\beta^{\eta}+(1-\beta)^{\eta}\right)^{% 1/\eta}},italic_g ( italic_β ; italic_η ) = divide start_ARG italic_β start_POSTSUPERSCRIPT italic_η end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_β start_POSTSUPERSCRIPT italic_η end_POSTSUPERSCRIPT + ( 1 - italic_β ) start_POSTSUPERSCRIPT italic_η end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_η end_POSTSUPERSCRIPT end_ARG , (15)

where η>0𝜂0\eta>0italic_η > 0 is a scalar parameter controlling its performative behaviour. We then consider another distortion risk measure family proposed by Wang (2000) as follows,

g(β;η)=Φ(Φ1(β)+η),𝑔𝛽𝜂ΦsuperscriptΦ1𝛽𝜂g(\beta;\eta)=\Phi\left(\Phi^{-1}(\beta)+\eta\right),italic_g ( italic_β ; italic_η ) = roman_Φ ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_β ) + italic_η ) , (16)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) is the CDF of standard normal distribution. When η>0𝜂0\eta>0italic_η > 0, this distortion function is convex and thus produces risk-seeking behaviours and vice-versa for η<0𝜂0\eta<0italic_η < 0. Last but not least, we consider the conditional value-at-risk (Rockafellar et al., 2000, CVaR): g(β;η)=ηβ𝑔𝛽𝜂𝜂𝛽g(\beta;\eta)=\eta\betaitalic_g ( italic_β ; italic_η ) = italic_η italic_β, where η[0,1]𝜂01\eta\in[0,1]italic_η ∈ [ 0 , 1 ]. CVaR measures the mean of the lowest 100×η100𝜂100\times\eta100 × italic_η percentage data and is proper for risk-averse modeling.

Provided a distortion risk measure g𝑔gitalic_g, Equation 13 now reads PFg(𝐬|𝐬)𝔼g[Z(𝐬𝐬)]proportional-tosuperscriptsubscript𝑃𝐹𝑔conditionalsuperscript𝐬𝐬superscript𝔼𝑔delimited-[]𝑍𝐬superscript𝐬P_{F}^{g}(\mathbf{s}^{\prime}|\mathbf{s})\propto\mathbb{E}^{g}\left[Z(\mathbf{% s}\to\mathbf{s}^{\prime})\right]italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s ) ∝ blackboard_E start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT [ italic_Z ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ], which can be estimated by Equation 17, where βi𝒰[0,1],isimilar-tosubscript𝛽𝑖𝒰01for-all𝑖\beta_{i}\sim\mathcal{U}[0,1],\forall iitalic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U [ 0 , 1 ] , ∀ italic_i.

1Ni=1Nexp(Zg(βi)log(𝐬𝐬;𝜽)).1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝑍𝑔subscript𝛽𝑖𝐬superscript𝐬𝜽\displaystyle\frac{1}{N}\sum_{i=1}^{N}\exp\left(Z^{\log}_{g(\beta_{i})}(% \mathbf{s}\to\mathbf{s}^{\prime};{\boldsymbol{\theta}})\right).divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_Z start_POSTSUPERSCRIPT roman_log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g ( italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) ) . (17)
Risk-averse quantile matching.
Refer to caption
Figure 2: A risky hypergrid environment.

We now consider a risk-sensitive task adapted from the hypergrid domain. The risky hypergrid is a variant of the hypergrid task studied in Bengio et al. (2021a) (a more detailed description can be found in Section 5.1), whose goal is to discover diverse modes while avoiding risky regions in a D𝐷Ditalic_D-dimensional map with size H𝐻Hitalic_H. We illustrate a 2222-dimensional hypergrid in Figure 2 as an example, where yellow denotes a high reward region (i.e., non-risky modes) and green denotes a region with stochasticity (i.e., risky modes). Entering the risky regions incurs a very low reward with a small probability. Beyond that, the risky mode behaves the same as the normal modes (up-left and bottom-right mode in Figure 2). As investigated by Deming et al. (1945); Arrow (1958); Hellwig (2004), human tends to be conservative during decision-making. Therefore, in this task we propose to combine QM algorithm together with risk-averse distortion risk measure to avoid getting into risky regions while maintaining performance.

Refer to caption
(a) small
Refer to caption
(b) large
Refer to caption
(c) small
Refer to caption
(d) large
Figure 3: Experiment results on stochastic risky hypergrid problems with different risk-sensitive policies. Up: CVaR(0.1)0.1(0.1)( 0.1 ) and Wang(0.75)0.75(-0.75)( - 0.75 ) induce risk-averse policies, thus achieving smaller violation rates. Bottom: Risk-sensitive methods achieve similar performance with other baselines with regard to the number of non-risky modes captured, indicating that the proposed conservative method do not hurt the standard performance.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Experiment results on the hypergrid tasks for different scale levels. Up: the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT error between the learned distribution density and the true target density. Bottom: the number of discovered modes across the training process. The proposed quantile matching algorithm achieves the best results across different hypergrid scales under both quantitative metrics.

We compare the risk-sensitive quantile matching with its risk-neutral variant and the standard GFlowNet (flow matching). We quantify their performance by the violation rate, i.e., the probability ratio of entering the risky regions, with different dimensions of the task including small and large. We also evaluate each method in terms of the number of standard (non-risky) modes discovered by each method during the course of training. As shown in Figure 3(a-b), CVaR(0.1)0.1(0.1)( 0.1 ) and Wang(0.75)0.75(-0.75)( - 0.75 ) leads to smaller violation rate; CVaR(0.1)0.1(0.1)( 0.1 ) performs the most conservative and achieves the smallest level of violation rate, as it only focuses on the lowest 10%percent1010\%10 % percentile data. Notice that CPW(0.71)0.71(0.71)( 0.71 )’s performing similar to risk-neutral QM and FM (i.e., baseline) matches the theory, since its distortion function is neither concave nor convex. This indicates that the risk-sensitive flows could effectively capture the uncertainty in this environment, and prevent the agent from going into dangerous areas. Figure 3(c-d) demonstrates the number of non-risky modes discovered by each algorithm, where they can all discover all the modes to have competitive performance. Results validate that the risk-averse CVaR quantile matching algorithm is able to discover high-quality and diverse solutions while avoiding entering the risky regions. We relegate more details to Section C.2.

5 Benchmarking Experiments

The proposed method has been demonstrated to be able to capture the uncertainty in stochastic environments. On the other hand, in this section we evaluate its performance on deterministic structured generation benchmarks. These tasks are challenging due to their exponentially large combinatorial search space, thus requiring efficient exploration and good generalization ability from past experience. In this work, all experimental results are run on NVIDIA Tesla V100100100100 Volta GPUs, and are averaged across 4444 random seeds.

5.1 Hypergrid

We investigate the hypergrid task from Bengio et al. (2021a). The space of states is a D𝐷Ditalic_D-dimensional hypergrid cube with size H××H𝐻𝐻H\times\cdots\times Hitalic_H × ⋯ × italic_H with H𝐻Hitalic_H being the size of the grid, and the agent is desired to plan in long horizon and learn from given sparse reward signals. The agent is initiated at one corner, and needs to navigate by taking increments in one of the coordinates by 1111 for each step. A special termination action is also available for each state. The agent receives a reward defined by Equation 18 when it decides to stop. The reward function is defined by

R(𝐱)=R0+R1d=1D𝕀[|𝐱dH10.5|(0.25,0.5]]+R2d=1D𝕀[|𝐱dH10.5|(0.3,0.4)],𝑅𝐱subscript𝑅0subscript𝑅1superscriptsubscriptproduct𝑑1𝐷𝕀delimited-[]subscript𝐱𝑑𝐻10.50.250.5subscript𝑅2superscriptsubscriptproduct𝑑1𝐷𝕀delimited-[]subscript𝐱𝑑𝐻10.50.30.4R\left(\mathbf{x}\right)=R_{0}+R_{1}\prod_{d=1}^{D}\mathbb{I}\left[\left|\frac% {\mathbf{x}_{d}}{H-1}-0.5\right|\in(0.25,0.5]\right]+R_{2}\prod_{d=1}^{D}% \mathbb{I}\left[\left|\frac{\mathbf{x}_{d}}{H-1}-0.5\right|\in(0.3,0.4)\right],italic_R ( bold_x ) = italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT blackboard_I [ | divide start_ARG bold_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_H - 1 end_ARG - 0.5 | ∈ ( 0.25 , 0.5 ] ] + italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT blackboard_I [ | divide start_ARG bold_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_H - 1 end_ARG - 0.5 | ∈ ( 0.3 , 0.4 ) ] , (18)

where 𝟙1\mathbbm{1}blackboard_1 is the indicator function and R0=0.001,R1=0.5,R2=2formulae-sequencesubscript𝑅00.001formulae-sequencesubscript𝑅10.5subscript𝑅22R_{0}=0.001,R_{1}=0.5,R_{2}=2italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.001 , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5 , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2. From the formula, we could see that there are 2Dsuperscript2𝐷2^{D}2 start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT modes for each task, where a mode refers to a local region (which could contain one or multiple states) that achieves the maximum reward value.

The environment is designed to test the GFlowNet’s ability of discovering diverse modes and generalizing from past experience. We use 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT error between the learned distribution probability density function and the ground truth probability density function as an evaluation metric. Besides, the number of modes discovered during the training phase is also used for quantifying the exploration ability. In this task there are 2Dsuperscript2𝐷2^{D}2 start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT modes for each task. We compare the QM algorithm with previous FM and TB methods, plus we also involve other non-GFlowNet baselines such as MCMC.

Figure 4 demonstrates the efficacy of QM on tasks with different scale levels, from 8×8×88888\times 8\times 88 × 8 × 8 to 20×20×20×202020202020\times 20\times 20\times 2020 × 20 × 20 × 20. We notice that TB has advantage over FM for small scale problems (8×8×88888\times 8\times 88 × 8 × 8) in the sense of lower error, but is not as good as FM and QM on larger scale tasks. Regarding the speed of mode discovering, QM is the fastest algorithm with regard to the time used to reach all the diverse modes. We also test PPO (Schulman et al., 2017) in this problem, but find it hardly converges on our scale level in the sense of the measured error, thus we do not plot its curve. We also examined the exploration ability under extremely sparse scenarios in Figure 9. See Section C.2 for details. To demonstrate the superiority against distributional RL baselines, we conduct experiments between distributional RL method and QM on hypergrid in Figure 8.

5.2 Sequence generation

Refer to caption
Figure 5: The number of modes reached by each algorithm across the whole training process for the sequence generation task. QM outperforms other baselines in terms of sample efficiency.

In this task, we aim to generate binary bit sequences in an autoregressive way. The length of the sequence is fixed to 120120120120 and the vocabulary for each token is as simple as the binary set, making the space to be {0,1}120superscript01120\{0,1\}^{120}{ 0 , 1 } start_POSTSUPERSCRIPT 120 end_POSTSUPERSCRIPT. The reward function is defined to be the exponential of negative distance to a predefined multimodal sequence set, whose definition can be seen in the Appendix. Each action appends one token to the existing partial sequence. Therefore, it is an autoregressive generation modeling, and each GFlowNet state only corresponds to one generation path. We compare the proposed quantile matching algorithm with previous GFlowNet algorithm (TB and FM), together with RL-based (A2C (Mnih et al., 2016), Soft Actor-Critic (Haarnoja et al., 2018)) and MCMC-based (Xie et al., 2021) method. In this problem setup, it is intractable to enumerate all possible states to calculate the error between learned and target density functions, thus we only report the number of discovered modes. We define finding a mode by a edit distance less than 28. Figure 5 shows the number of modes that each method finds with regard to exploration steps, demonstrating that QM achieves the best sample efficiency in the sense of finding diverse modes. We take the baseline results from Malkin et al. (2022a). We relegate other information to Section C.3, including average reward for top candidates in Figure 10.

5.3 Molecule optimization

We then investigate a more realistic molecule synthesis setting. The goal in this task is to synthesize diverse molecules with desired chemical properties (Figure 6(a)). Each state denotes a molecule graph structure, and the action space is a vocabulary of building blocks specified by junction tree modeling (Kumar et al., 2012; ** et al., 2018). We follow the experimental setups including the reward specification and episode constraints in Bengio et al. (2021a). We compare the proposed algorithm with FM, TB, as well as MARS (MCMC-based method) and PPO (RL-based method).

In realistic drug-design industry, many other properties such as drug-likeness (Bickerton et al., 2012), synthesizability (Shi et al., 2020), or toxicity should be taken into account. To this end, a promising method should be able to find diverse candidates for post selection. Consequently, we quantify the ability of searching for diverse molecules by the number of modes discovered conditioning on reward being larger than 7.57.57.57.5 (see details in Section C.4), and show the result in Figure 6(b), where our QM surpasses other baselines by a large margin. Further, we also evaluate the diversity by measuring the Tanimoto similarity in Figure 6(c), which demonstrates that QM is able to find the most diverse molecules. Figure 6(d) displays the average reward of the top-100100100100 candidates, assuring that the proposed QM method manages to find high-quality drug structures.

6 Related Work

GFlowNets.

Since the proposal of Bengio et al. (2021a; b), the field has witnessed an increasing number of work about GFlowNets on different aspects. Malkin et al. (2022b); Zimmermann et al. (2022) analyze the connection with variational methods with expected gradients coinciding in the on-policy case (π=PF𝜋subscript𝑃𝐹\pi=P_{F}italic_π = italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT), and show that GFlowNets outperform variational inference with off-policy training samples; Pan et al. (2022; 2023b) develop frameworks to enable the usage of intermediate signal in GFlowNets to improve credit assignment efficiency; Jain et al. (2022b) investigate the possibility of doing multi-objective generation with GFlowNets; Pan et al. (2023c) introduce world modeling into GFlowNets; Pan et al. (2023a) propose an unsupervised learning method for training GFlowNets; Ma et al. (2023) study how to utilize isomorphism tests to reduce the flow bias in GFlowNet training. Regarding probabilistic modeling point of view, Zhang et al. (2022c) jointly learn an energy-based model and a GFlowNet for generative modeling, and testify its efficacy on a series of discrete data modeling tasks. It also proposes a back-and-forth proposal, which is adopted by Kim et al. (2023) for doing local search. Further, Zhang et al. (2022a) analyze the modeling of different generative models, and theoretically point out that many of them are actually special cases of GFlowNets; this work also builds up the connection between diffusion modeling and GFlowNets, which is adopted by conitnuous sampling works (Lahlou et al., 2023; Zhang et al., 2023b). Different from the above efforts, this work aims at the opening problem of learning GFlowNet with a stochastic reward function. GFlowNet also expresses promising potential in many object generation application areas. Jain et al. (2022a) use it in biological sequence design; Deleu et al. (2022); Nishikawa-Toomey et al. (2022) leverage it for causal structure learning; Liu et al. (2022) employ it to sample structured subnetwork module for better predictive robustness; Zhang et al. (2023a; c) utilize it for combinatorial optimization problems; Zhou et al. (2023) learn a GFlowNet for Phylogenetic inference problems.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: Molecule synthesis experiment. (a) Illustration of the GFlowNet policy. Figure adapted from Pan et al. (2022). (b) The number of modes captured by algorithms. (c) Tanimoto similarity (lower is better). (d) Average reward across the top-100100100100 molecules.
Distributional modeling.

The whole distribution contains much more information beyond the first order moment (Yang et al., 2013; Imani & White, 2018). Thus, learning from distribution would bring benefit from more informative signals. Interestingly, a similar mechanism is also turned out to exist in human brains (Dabney et al., 2020). Specifically, in distributional RL (Bellemare et al., 2023) literature, people minimize distributional Bellman error in order to achieve Equation 3. Many different implementations are proposed: categorical DQN (Bellemare et al., 2017), quantile regression DQN (Dabney et al., 2017), implicit quantile network based DQN  (Dabney et al., 2018; Yang et al., 2019), and expectile regression based DQN (Rowland et al., 2019). Distributional modeling methods can be used with different types of methods, such as Q-learning (above-mentioned works), actor-critic (Ma et al., 2020) or policy gradient (Barth-Maron et al., 2018). The well-recognized Rainbow algorithm (Hessel et al., 2017) also adopts categorical DQN as an important component. In model-base RL methods, people have also found that distributional modeling could boost the performance (Hafner et al., 2023). In the domain of generative modeling, the idea of IQN is integrated into autoregressive models by Ostrovski et al. (2018).

7 Conclusion and Discussion

In summary, we would like to highlight that our proposed method, Quantile Matching, is not a straight forward combination of FM and distributional RL, but has considerable technical novelty.

Importance of the problem. We investigate an important problem in GFlowNets – we discover an important limitation of current formulations of GFlowNets in that they fail to tackle stochastic rewards well, which is generally in a wide range of real-world tasks and may limit its application. As a consequence, it fails to take the risks associated with actions into consideration, which is important in real-world applications (e.g., healthcare). Novelty of the algorithm. Different from the Bellman equation in RL, we need to consider all possible parents and children of a state 𝐬𝐬\mathbf{s}bold_s in the flow consistency constraint, and directly employing techniques in Bellemare et al. (2017) does not permit efficient computation as detailed in Remark 3. We propose quantile matching based with the justification of Proposition 2 with flexible computation.

What makes quantile matching prominent? Apart from the risk-sensitive modeling advantage brought by the implicit quantile modeling, it is an exciting surprise to see that the proposed QM also surpasses previous methods deterministic reward settings. We hypothesize the following rationales why QM brings benefits into GFlowNet training. (a) More informative learning signals for better generalization. As more complex models, the extra non-linear quantile flows encourage the capture of additional information besides the expected values, acting as regularization with auxiliary tasks (Lyle et al., 2019). In practical setups where it is intractable for a GFlowNet learner to see all possible trajectories, the issue of generalization matters very much. Therefore, it is important to extract as much useful generalizable information as possible from a small number of training samples. (b) Regularization effects. It has been previously observed that GFlowNets can overfit to past trajectories and thus have estimation bias to some flow values (Bengio et al., 2021a). However, since we maintain a distribution of flows, this helps propagate useful information and improve the prediction of flow values, thus regularizing this overestimation issue and benefiting the optimization process (Imani & White, 2018). (c) Pseudo uncertainty. In settings where we have uncertainty in the actual rewards (e.g., they are estimated from data), it would make sense to propagate reward distributions. However, even in deterministic environments, due to the complex representation of states, two different states may be incorrectly represented as the same by the network. This results in a pseudo uncertainty in the environment and is similar to the partial observability  (McCallum & Ballard, 1996), which leads to the so-called state aliasing in control.

Acknowledgments

The authors would like to thank Marc Bellemare, Moksh Jain, and Emannuel Bengio. Furthermore, Dinghuai Zhang would like to thank Sagrada Familia to remind him of the existence of beauty in this world.

References

  • Arrow (1958) Kenneth J. Arrow. Essays in the theory of risk-bearing. 1958.
  • Artzner et al. (1999) Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9, 1999.
  • Balbás et al. (2009) Alejandro Balbás, José Garrido, and Silvia Mayoral. Properties of distortion risk measures. Methodology and Computing in Applied Probability, 11:385–399, 2009.
  • Barth-Maron et al. (2018) Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, TB Dhruva, Alistair Muldal, Nicolas Manfred Otto Heess, and Timothy P. Lillicrap. Distributed distributional deterministic policy gradients. ArXiv, abs/1804.08617, 2018.
  • Bellemare et al. (2017) Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, 2017.
  • Bellemare et al. (2023) Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  • Bengio et al. (2021a) Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021a.
  • Bengio et al. (2012) Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In International Conference on Machine Learning, 2012.
  • Bengio et al. (2021b) Yoshua Bengio, Tristan Deleu, Edward J. Hu, Salem Lahlou, Mo Tiwari, and Emmanuel Bengio. GFlowNet foundations. arXiv preprint 2111.09266, 2021b.
  • Bickerton et al. (2012) G. Richard J. Bickerton, Gaia Valeria Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L. Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4 2:90–8, 2012.
  • Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In AAAI Conference on Artificial Intelligence, 2017.
  • Dabney et al. (2018) Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. ArXiv, abs/1806.06923, 2018.
  • Dabney et al. (2020) Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, and Matthew M. Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577:671–675, 2020.
  • Deleu et al. (2022) Tristan Deleu, Ant’onio G’ois, Chris C. Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022.
  • Deming et al. (1945) William Edwards Deming, John von Neumann, and Oscar Morgenstern. Theory of games and economic behavior. Journal of the American Statistical Association, 40:263, 1945.
  • Desjardins et al. (2010) Guillaume Desjardins, Aaron C. Courville, Yoshua Bengio, Pascal Vincent, and Olivier Delalleau. Tempered markov chain monte carlo for training of restricted boltzmann machines. In International Conference on Artificial Intelligence and Statistics, 2010.
  • Dhaene et al. (2006) Jan Dhaene, Steven Vanduffel, Marc J. Goovaerts, Rob Kaas, Q. Tang, and David Vyncke. Risk measures and comonotonicity: A review. Stochastic Models, 22:573 – 606, 2006. URL https://api.semanticscholar.org/CorpusID:11340721.
  • Ford & Fulkerson (1956) Lester Randolph Ford and Delbert Ray Fulkerson. Maximal flow through a network. Canadian Journal of Mathematics, 8:399 – 404, 1956.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. ArXiv, abs/1704.01212, 2017.
  • Gonzalez & Wu (1999) Richard Gonzalez and George Wu. On the shape of the probability weighting function. Cognitive Psychology, 38:129–166, 1999.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. ArXiv, abs/1706.04599, 2017.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, P. Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
  • Hafner et al. (2023) Danijar Hafner, J. Pasukonis, Jimmy Ba, and Timothy P. Lillicrap. Mastering diverse domains through world models. ArXiv, abs/2301.04104, 2023.
  • Hardy (2002) Mary R. Hardy. Distortion risk measures : Coherence and stochastic dominance. 2002.
  • Hazan et al. (2018) Elad Hazan, Sham M. Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. ArXiv, abs/1812.02690, 2018.
  • Hellwig (2004) Martin F. Hellwig. Risk aversion in the small and in the large. Max Planck Institute for Research on Collective Goods Research Paper Series, 2004.
  • Hessel et al. (2017) Matteo Hessel, Joseph Modayil, H. V. Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. ArXiv, abs/1710.02298, 2017.
  • Huang et al. (2016) Po-Ssu Huang, Scott E. Boyken, and David Baker. The coming of age of de novo protein design. Nature, 537:320–327, 2016.
  • Imani & White (2018) Ehsan Imani and Martha White. Improving regression performance with distributional losses. In International Conference on Machine Learning, 2018.
  • Jain et al. (2022a) Moksh Jain, Emmanuel Bengio, Alex García, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlowNets. International Conference on Machine Learning (ICML), 2022a.
  • Jain et al. (2022b) Moksh Jain, Sharath Chandra Raparthy, Alex Hernández-García, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective gflownets. ArXiv, abs/2210.12765, 2022b.
  • ** et al. (2018) Wengong **, Regina Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. ArXiv, abs/1802.04364, 2018.
  • Jumper et al. (2021) John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David A. Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589, 2021.
  • Karvanen (2006) Juha Karvanen. Estimation of quantile mixtures via l-moments and trimmed l-moments. Comput. Stat. Data Anal., 51:947–959, 2006.
  • Kendall & Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, 2017.
  • Kim et al. (2023) Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, and **kyoo Park. Local search gflownets, 2023.
  • Koenker (2005) Roger W. Koenker. Quantile regression. 2005.
  • Kumar et al. (2012) A. Kumar, Arnout R. D. Voet, and K Y J Zhang. Fragment based drug design: from experimental to computational approaches. Current medicinal chemistry, 19 30:5128–47, 2012.
  • Lahlou et al. (2023) Salem Lahlou, Tristan Deleu, Pablo Lemos, Dinghuai Zhang, Alexandra Volokhova, Alex Hernández-Garcıa, Léna Néhale Ezzine, Yoshua Bengio, and Nikolay Malkin. A theory of continuous generative flow networks. In International Conference on Machine Learning, pp. 18269–18300. PMLR, 2023.
  • Liu et al. (2022) Dianbo Liu, Moksh Jain, Bonaventure F. P. Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay Malkin, Chris C. Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, and Yoshua Bengio. Gflowout: Dropout with generative flow networks. ArXiv, abs/2210.12928, 2022.
  • Lyle et al. (2019) Clare Lyle, Pablo Samuel Castro, and Marc G. Bellemare. A comparative analysis of expected and distributional reinforcement learning. ArXiv, abs/1901.11084, 2019.
  • Ma et al. (2023) Jiangyan Ma, Emmanuel Bengio, Yoshua Bengio, and Dinghuai Zhang. Baking symmetry into gflownets, 2023.
  • Ma et al. (2020) Xiaoteng Ma, Li Xia, Zhengyuan Zhou, Jun Yang, and Qianchuan Zhao. Dsac: Distributional soft actor critic for risk-sensitive reinforcement learning. arXiv: Learning, 2020.
  • Madan et al. (2022) Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. ArXiv, abs/2209.12782, 2022.
  • Malkin et al. (2022a) Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlowNets. arXiv preprint 2201.13259, 2022a.
  • Malkin et al. (2022b) Nikolay Malkin, Salem Lahlou, Tristan Deleu, Xu Ji, Edward J. Hu, Katie Elizabeth Everett, Dinghuai Zhang, and Yoshua Bengio. Gflownets and variational inference. ArXiv, abs/2210.00580, 2022b.
  • McCallum & Ballard (1996) Andrew McCallum and Dana H. Ballard. Reinforcement learning with selective perception and hidden state. 1996.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ArXiv, abs/1602.01783, 2016.
  • Müller (1997) Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429 – 443, 1997.
  • Nishikawa-Toomey et al. (2022) Mizu Nishikawa-Toomey, Tristan Deleu, Jithendaraa Subramanian, Yoshua Bengio, and Laurent Charlin. Bayesian learning of causal structure and mechanisms with gflownets and variational bayes. ArXiv, abs/2211.02763, 2022.
  • Ostrovski et al. (2018) Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. ArXiv, abs/1806.05575, 2018.
  • Pan et al. (2022) L. Pan, Dinghuai Zhang, Aaron C. Courville, Longbo Huang, and Yoshua Bengio. Generative augmented flow networks. ArXiv, abs/2210.03308, 2022.
  • Pan et al. (2023a) Ling Pan, Moksh Jain, Kanika Madan, and Yoshua Bengio. Pre-training and fine-tuning generative flow networks. ArXiv, abs/2310.03419, 2023a. URL https://api.semanticscholar.org/CorpusID:263671952.
  • Pan et al. (2023b) Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of gflownets with local credit and incomplete trajectories. arXiv preprint arXiv:2302.01687, 2023b.
  • Pan et al. (2023c) Ling Pan, Dinghuai Zhang, Moksh Jain, Longbo Huang, and Yoshua Bengio. Stochastic generative flow networks. arXiv preprint arXiv:2302.09465, 2023c.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Rockafellar et al. (2000) R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  • Rowland et al. (2019) Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, 2019.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
  • Shi et al. (2020) Chence Shi, Minkai Xu, Hongyu Guo, Ming Zhang, and Jian Tang. A graph to graphs framework for retrosynthesis prediction. ArXiv, abs/2003.12725, 2020.
  • Sutton & Barto (2005) Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 16:285–286, 2005.
  • Tancik et al. (2020) Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. ArXiv, abs/2006.10739, 2020.
  • Teng et al. (2022) Jiaye Teng, Chuan Wen, Dinghuai Zhang, Yoshua Bengio, Yang Gao, and Yang Yuan. Predictive inference with feature conformal prediction. ArXiv, abs/2210.00173, 2022.
  • Tversky & Kahneman (1992) Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5:297–323, 1992.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Caglar Gulcehre, Ziyun Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pp.  1–5, 2019.
  • Wang (2000) Shaun S. Wang. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance, 67:15, 2000.
  • Xie et al. (2021) Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. Mars: Markov molecular sampling for multi-objective drug discovery. ArXiv, abs/2103.10432, 2021.
  • Yang et al. (2019) Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. In Neural Information Processing Systems, 2019.
  • Yang et al. (2013) Jiyan Yang, Xiangrui Meng, and Michael W. Mahoney. Quantile regression for large-scale applications. SIAM J. Sci. Comput., 36, 2013.
  • Zakeri & Syri (2015) Behnam Zakeri and Sanna Syri. Electrical energy storage systems: A comparative life cycle cost analysis. Renewable and sustainable energy reviews, 42:569–596, 2015.
  • Zhang et al. (2023a) David Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling with GFlowNets. International Conference on Learning Representations (ICLR), 2023a. To appear.
  • Zhang et al. (2021) Dinghuai Zhang, Jie Fu, Yoshua Bengio, and Aaron C. Courville. Unifying likelihood-free inference with black-box optimization and beyond. In International Conference on Learning Representations, 2021.
  • Zhang et al. (2022a) Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with gflownets. ArXiv, abs/2209.02606, 2022a.
  • Zhang et al. (2022b) Dinghuai Zhang, Aaron C. Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, and Ricky T. Q. Chen. Latent state marginalization as a low-cost approach for improving exploration. ArXiv, abs/2210.00999, 2022b.
  • Zhang et al. (2022c) Dinghuai Zhang, Nikolay Malkin, Z. Liu, Alexandra Volokhova, Aaron C. Courville, and Yoshua Bengio. Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022c.
  • Zhang et al. (2023b) Dinghuai Zhang, Ricky T. Q. Chen, Cheng-Hao Liu, Aaron C. Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. ArXiv, abs/2310.02679, 2023b.
  • Zhang et al. (2023c) Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. arXiv preprint arXiv:2305.17010, 2023c.
  • Zhou et al. (2023) Mingyang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain, Mathieu Blanchette, and Yoshua Bengio. Phylogfn: Phylogenetic inference with generative flow networks. arXiv preprint arXiv:2310.08774, 2023.
  • Zimmermann et al. (2022) Heiko Zimmermann, Fredrik Lindsten, J.-W. van de Meent, and Christian A. Naesseth. A variational perspective on generative flow networks. ArXiv, abs/2210.07992, 2022.
  • Zitnick et al. (2020) C Lawrence Zitnick, Lowik Chanussot, Abhishek Das, Siddharth Goyal, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Thibaut Lavril, Aini Palizhati, Morgane Riviere, et al. An introduction to electrocatalyst design using machine learning for renewable energy storage. arXiv preprint arXiv:2010.09435, 2020.

Appendix A Summary of Notation

Symbol Description
𝒮𝒮\mathcal{S}caligraphic_S state space
𝒳𝒳\mathcal{X}caligraphic_X object (terminal state) space, subset of 𝒮𝒮\mathcal{S}caligraphic_S
𝒜𝒜\mathcal{A}caligraphic_A action / transition space (edges 𝐬𝐬𝐬superscript𝐬\mathbf{s}\to\mathbf{s}^{\prime}bold_s → bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
𝒢𝒢\mathcal{G}caligraphic_G directed acyclic graph (𝒮,𝒜)𝒮𝒜(\mathcal{S},\mathcal{A})( caligraphic_S , caligraphic_A )
𝒯𝒯\mathcal{T}caligraphic_T set of complete trajectories
𝐬𝐬\mathbf{s}bold_s state in 𝒮𝒮\mathcal{S}caligraphic_S
𝐬0subscript𝐬0\mathbf{s}_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT initial state, element of 𝒮𝒮\mathcal{S}caligraphic_S
𝐱𝐱\mathbf{x}bold_x terminal state in 𝒳𝒳\mathcal{X}caligraphic_X
τ𝜏\tauitalic_τ trajectory in 𝒯𝒯\mathcal{T}caligraphic_T
F:𝒯:𝐹𝒯F:\mathcal{T}\to\mathbb{R}italic_F : caligraphic_T → blackboard_R Markovian flow
F:𝒮:𝐹𝒮F:\mathcal{S}\to\mathbb{R}italic_F : caligraphic_S → blackboard_R state flow
F:𝒜:𝐹𝒜F:\mathcal{A}\to\mathbb{R}italic_F : caligraphic_A → blackboard_R edge flow
PFsubscript𝑃𝐹P_{F}italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT forward policy (distribution over children)
PBsubscript𝑃𝐵P_{B}italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT backward policy (distribution over parents)
Z𝑍Zitalic_Z scalar, equal to τ𝒯F(τ)subscript𝜏𝒯𝐹𝜏\sum_{\tau\in\mathcal{T}}F(\tau)∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_T end_POSTSUBSCRIPT italic_F ( italic_τ ) for a Markovian flow

Appendix B Missing Details about Methodology

B.1 Proposition 1

Proof.

For flow matching algorithm, the reward matching training loss in practice is on the log scale:

(log(𝐬𝐱)𝒜F(𝐬𝐱)logR(𝐱))2.superscriptsubscript𝐬𝐱𝒜𝐹𝐬𝐱𝑅𝐱2\left(\log\sum_{(\mathbf{s}\to\mathbf{x})\in\mathcal{A}}F(\mathbf{s}\to\mathbf% {x})-\log R(\mathbf{x})\right)^{2}.( roman_log ∑ start_POSTSUBSCRIPT ( bold_s → bold_x ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( bold_s → bold_x ) - roman_log italic_R ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By assuming sufficiently large capacity and sufficiently much computation resource, we assume the obtained GFlowNet can sample correctly with probability proportional to the given reward. Now that the reward function is stochastic, and since the assumption presumes that we have infinite compute resource and neural network capacity, the property of square loss would let the log of in-flow to learn to fit the expectation of the log reward 𝔼[logR(x)]𝔼delimited-[]𝑅𝑥\mathbb{E}[\log R(x)]blackboard_E [ roman_log italic_R ( italic_x ) ] (it is the optimum in this minimization problem). This makes the optimization to have the same optimal solution as minimizing (log(𝐬𝐱)𝒜F(𝐬𝐱)𝔼[logR(𝐱)])2superscriptsubscript𝐬𝐱𝒜𝐹𝐬𝐱𝔼delimited-[]𝑅𝐱2\left(\log\sum_{(\mathbf{s}\to\mathbf{x})\in\mathcal{A}}F(\mathbf{s}\to\mathbf% {x})-\mathbb{E}\left[\log R(\mathbf{x})\right]\right)^{2}( roman_log ∑ start_POSTSUBSCRIPT ( bold_s → bold_x ) ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( bold_s → bold_x ) - blackboard_E [ roman_log italic_R ( bold_x ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. According to the property of flow matching algorithm (Bengio et al., 2021a, Proposition 3), the GFlowNet would learn to sample with probability proportional to the reward defined by exp(𝔼[logR(𝐱)])𝔼delimited-[]𝑅𝐱\exp\left(\mathbb{E}[\log R(\mathbf{x})]\right)roman_exp ( blackboard_E [ roman_log italic_R ( bold_x ) ] ).

For trajectory balance algorithm, since the loss could be written as

(logPF(τ)PB(τ|𝐱)logR(𝐱))2,superscriptsubscript𝑃𝐹𝜏subscript𝑃𝐵conditional𝜏𝐱𝑅𝐱2\left(\log\frac{P_{F}(\tau)}{P_{B}(\tau|\mathbf{x})}-\log R(\mathbf{x})\right)% ^{2},( roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_τ | bold_x ) end_ARG - roman_log italic_R ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

with the same reasoning we know that it is equivalent to minimizing (logPF(τ)PB(τ|𝐱)𝔼[logR(𝐱)])2superscriptsubscript𝑃𝐹𝜏subscript𝑃𝐵conditional𝜏𝐱𝔼delimited-[]𝑅𝐱2\left(\log\frac{P_{F}(\tau)}{P_{B}(\tau|\mathbf{x})}-\mathbb{E}\left[\log R(% \mathbf{x})\right]\right)^{2}( roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_τ | bold_x ) end_ARG - blackboard_E [ roman_log italic_R ( bold_x ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. According to the property of TB algorithm (Malkin et al., 2022a, Proposition 1), the GFlowNet would learn to sample with probability proportional to the reward defined by exp(𝔼[logR(𝐱)])𝔼delimited-[]𝑅𝐱\exp\left(\mathbb{E}[\log R(\mathbf{x})]\right)roman_exp ( blackboard_E [ roman_log italic_R ( bold_x ) ] ). The same ratiocination could be made for the detailed balance algorithm (Bengio et al., 2021b).

B.2 Proposition 2

We first rephrase the proposition as follows.

Proposition.

For any set of M+1𝑀1M+1italic_M + 1 quantile functions {Qm()}m=0Msuperscriptsubscriptsubscript𝑄𝑚normal-⋅𝑚0𝑀\{Q_{m}(\cdot)\}_{m=0}^{M}{ italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ) } start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT that satisfies Q0()=m=1MQm()subscript𝑄0normal-⋅superscriptsubscript𝑚1𝑀subscript𝑄𝑚normal-⋅Q_{0}(\cdot)=\sum_{m=1}^{M}Q_{m}(\cdot)italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ), there exists a set of random variables {Zm}m=1Msuperscriptsubscriptsubscript𝑍𝑚𝑚1𝑀\{Z_{m}\}_{m=1}^{M}{ italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT that satisfies Qm()subscript𝑄𝑚normal-⋅Q_{m}(\cdot)italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ) is the quantile function of Zm,msubscript𝑍𝑚for-all𝑚Z_{m},\forall mitalic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∀ italic_m, and Z0=dm=1MZmsuperscript𝑑subscript𝑍0superscriptsubscript𝑚1𝑀subscript𝑍𝑚Z_{0}{\stackrel{{\scriptstyle d}}{{=}}}\sum_{m=1}^{M}Z_{m}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

The following proof is inspired by Karvanen (2006).

Proof.

For zfor-all𝑧\forall z\in\mathbb{R}∀ italic_z ∈ blackboard_R,

(m=1MZmz)=superscriptsubscript𝑚1𝑀subscript𝑍𝑚𝑧absent\displaystyle\mathbb{P}(\sum_{m=1}^{M}Z_{m}\leq z)=blackboard_P ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ italic_z ) = ({β[0,1]:m=1MZm(β)z})=({β[0,1]:m=1MQm(β)z})conditional-set𝛽01superscriptsubscript𝑚1𝑀subscript𝑍𝑚𝛽𝑧conditional-set𝛽01superscriptsubscript𝑚1𝑀subscript𝑄𝑚𝛽𝑧\displaystyle\mathbb{P}\left(\left\{\beta\in[0,1]:\sum_{m=1}^{M}Z_{m}(\beta)% \leq z\right\}\right)=\mathbb{P}\left(\left\{\beta\in[0,1]:\sum_{m=1}^{M}Q_{m}% (\beta)\leq z\right\}\right)blackboard_P ( { italic_β ∈ [ 0 , 1 ] : ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) ≤ italic_z } ) = blackboard_P ( { italic_β ∈ [ 0 , 1 ] : ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) ≤ italic_z } )
=\displaystyle== sup{β[0,1]:zm=1MQm(β)}=inf{β[0,1]:zm=1MQm(β)}supremumconditional-set𝛽01𝑧superscriptsubscript𝑚1𝑀subscript𝑄𝑚𝛽infimumconditional-set𝛽01𝑧superscriptsubscript𝑚1𝑀subscript𝑄𝑚𝛽\displaystyle\sup\left\{\beta\in[0,1]:z\geq\sum_{m=1}^{M}Q_{m}(\beta)\right\}=% \inf\left\{\beta\in[0,1]:z\leq\sum_{m=1}^{M}Q_{m}(\beta)\right\}roman_sup { italic_β ∈ [ 0 , 1 ] : italic_z ≥ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) } = roman_inf { italic_β ∈ [ 0 , 1 ] : italic_z ≤ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_β ) }
=\displaystyle== inf{β[0,1]:zQ0(β)}=(Z0z).infimumconditional-set𝛽01𝑧subscript𝑄0𝛽subscript𝑍0𝑧\displaystyle\inf\left\{\beta\in[0,1]:z\leq Q_{0}(\beta)\right\}=\mathbb{P}% \left(Z_{0}\leq z\right).roman_inf { italic_β ∈ [ 0 , 1 ] : italic_z ≤ italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_β ) } = blackboard_P ( italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_z ) .

This indicates that Z0=dm=1MZmsuperscript𝑑subscript𝑍0superscriptsubscript𝑚1𝑀subscript𝑍𝑚Z_{0}{\stackrel{{\scriptstyle d}}{{=}}}\sum_{m=1}^{M}Z_{m}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. ∎

For the statement in Proposition 2, as we assume all quantile functions are continuous in this work, the summation of several continuous monotonic functions (m=1MQm()superscriptsubscript𝑚1𝑀subscript𝑄𝑚\sum_{m=1}^{M}Q_{m}(\cdot)∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ )) is also a continuous monotonic function, thus could be a quantile function of a random variable. Then we can use the above argument.

B.3 Regarding the Quantile Regression Objective

Say Z𝑍Zitalic_Z is a random variable and we want to get its β𝛽\betaitalic_β-quantile. To achieve this we should find x𝑥xitalic_x to solve

minx𝔼Z[ρβ(Zx)]1N~j=1N~ρβ(Zβ~jx),β~j𝒰[0,1],formulae-sequencesubscript𝑥subscript𝔼𝑍delimited-[]subscript𝜌𝛽𝑍𝑥1~𝑁superscriptsubscript𝑗1~𝑁subscript𝜌𝛽subscript𝑍subscript~𝛽𝑗𝑥similar-tosubscript~𝛽𝑗𝒰01\min_{x}\mathbb{E}_{Z}\left[\rho_{\beta}(Z-x)\right]\approx\frac{1}{\tilde{N}}% \sum_{j=1}^{\tilde{N}}\rho_{\beta}(Z_{\tilde{\beta}_{j}}-x),\quad\tilde{\beta}% _{j}\sim\mathcal{U}[0,1],roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_Z - italic_x ) ] ≈ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_N end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_x ) , over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_U [ 0 , 1 ] ,

where Zβ~jsubscript𝑍subscript~𝛽𝑗Z_{\tilde{\beta}_{j}}italic_Z start_POSTSUBSCRIPT over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the β~jsubscript~𝛽𝑗\tilde{\beta}_{j}over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-quantile of Z𝑍Zitalic_Z. Note that β𝛽\betaitalic_β does not overlap with {β~j}j=1N~superscriptsubscriptsubscript~𝛽𝑗𝑗1~𝑁\{\tilde{\beta}_{j}\}_{j=1}^{\tilde{N}}{ over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT.

Let us then look at Equation 12. According to the above analysis, 1N~j=1N~ρβi(δβi,β~j(𝐬;𝜽))1~𝑁superscriptsubscript𝑗1~𝑁subscript𝜌subscript𝛽𝑖superscript𝛿subscript𝛽𝑖subscript~𝛽𝑗𝐬𝜽\frac{1}{\tilde{N}}\sum_{j=1}^{\tilde{N}}\rho_{\beta_{i}}(\delta^{\beta_{i},% \tilde{\beta}_{j}}(\mathbf{s};{\boldsymbol{\theta}}))divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_N end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_δ start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_s ; bold_italic_θ ) ) will guide the in-flow to learn the βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT quantile of the out-flow. When we sum over different βi𝒰[0,1],i=1,,Nformulae-sequencesimilar-tosubscript𝛽𝑖𝒰01𝑖1𝑁\beta_{i}\sim\mathcal{U}[0,1],i=1,\ldots,Nitalic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U [ 0 , 1 ] , italic_i = 1 , … , italic_N, this helps us match the in-flow (as a distribution) to the out-flow (as a distribution), as two distributions with the same quantile function are the same distribution. Note that {βi}i=1Nsuperscriptsubscriptsubscript𝛽𝑖𝑖1𝑁\{\beta_{i}\}_{i=1}^{N}{ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT do not overlap with {β~j}j=1N~superscriptsubscriptsubscript~𝛽𝑗𝑗1~𝑁\{\tilde{\beta}_{j}\}_{j=1}^{\tilde{N}}{ over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT.

Appendix C More about Experiments

C.1 Details and Ablation about Quantile Modeling

We explain the modeling of quantile function in this subsection. For explicit modeling, the neural network directly output of i/M×100𝑖𝑀100i/M\times 100italic_i / italic_M × 100 percentage quantile of some distribution, 1iM1𝑖𝑀1\leq i\leq M1 ≤ italic_i ≤ italic_M. This means the neural network output head should output M𝑀Mitalic_M-dimensional vectors. On the contrary, for implicit modeling, the neural network takes an additional input β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ], and outputs the β𝛽\betaitalic_β-quantile value (i.e., scalar output). The latter modeling choice could provide more flexibility, as we could get arbitrary quantile of the modelled distribution. Both modeling methods share the same quantile regression based algorithm, as described in Section 3.2. For the implicit modeling, we use the Fourier feature (Dabney et al., 2018; Tancik et al., 2020) to augment the scalar β𝛽\betaitalic_β input: ϕ(β)j=ReLU(i=0DFcos(πiβ)wij+bj),j=1,,DFformulae-sequenceitalic-ϕsubscript𝛽𝑗ReLUsuperscriptsubscript𝑖0subscript𝐷𝐹𝜋𝑖𝛽subscript𝑤𝑖𝑗subscript𝑏𝑗𝑗1subscript𝐷𝐹\phi(\beta)_{j}=\text{ReLU}\left(\sum_{i=0}^{D_{F}}\cos(\pi i\beta)w_{ij}+b_{j% }\right),j=1,\ldots,D_{F}italic_ϕ ( italic_β ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ReLU ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_cos ( italic_π italic_i italic_β ) italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_j = 1 , … , italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, where DFsubscript𝐷𝐹D_{F}italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the dimensionality of the Fourier feature, which is a hyperparameter. We then use element-wise multiplication to combine the Fourier feature and the processed state representation for downstream processing; see the following subsections for specific modeling details for each different task. We compare their performance in Figure 7(b) on a 16×16×1616161616\times 16\times 1616 × 16 × 16 hypergrid, where M=200𝑀200M=200italic_M = 200 for explicit modeling and N=8𝑁8N=8italic_N = 8, 256256256256 dimensional Fourier features for implicit modeling. The implicit way is shown to have better performance, thus we use it in all the other experiments in this work. Regarding computation consideration, we remark that although the implicit way seems to need more number of network evaluation, however the actual runtime stays similar since we could parallel the multiple calls of the implicit quantile network through batch-level operation, thanks to the efficient implementation of batch network inference in PyTorch (Paszke et al., 2019).

We also conduct an ablation study about the implicit quantile network implementation. For the N𝑁Nitalic_N and N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG described in Algorithm 1, we try different values in a 16×16×1616161616\times 16\times 1616 × 16 × 16 hypergrid with 256256256256 dimensional Fourier feature. Figure 7(c) indicates that the performance of QM algorithm is robust to the choice of N𝑁Nitalic_N (we always set N=N~𝑁~𝑁N=\tilde{N}italic_N = over~ start_ARG italic_N end_ARG for simplicity. For the number of dimension of the Fourier feature, we conduct an ablation study shown in Figure 7(d), still with a 16×16×1616161616\times 16\times 1616 × 16 × 16 hypergrid and N=8𝑁8N=8italic_N = 8. The result also shows that QM is robust to the selection of the Fourier feature dimension.

C.2 Hypergrid

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 7: Hypergrid figures. (a) Illustration of the target reward function for a 8×8888\times 88 × 8 hypergrid, where a darker colour means a higher reward. Figure adapted from Malkin et al. (2022a). (b) The ablation study between implicit and explicit modeling of the quantile function; the implicit way achieves better sample efficiency. (c) Ablation study on the number of β𝛽\betaitalic_β percentage sampled. (d) Ablation study on the number of Fourier features used.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Figure 8: Experiment results of IQN on the hypergrid tasks for different scale levels (the red dashed line corresponds to the final result of Quantile Matching). The first row demonstrates the empirical L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT error while the second row illustrates the number of modes. As shown, IQN underperforms Quantile Matching by a large margin in terms of both convergence and diversity.

An illustration of the reward landscape when D=2,H=8formulae-sequence𝐷2𝐻8D=2,H=8italic_D = 2 , italic_H = 8 is shown in Figure 7(a). The probability density function of the learned model is empirically estimated from the past visited 200000200000200000200000 states. The GFlowNet uses networks that are three layer MLPs with 256256256256 hidden dimension and Leaky ReLU activation with one-hot state representation as inputs. FM uses an MLP to model the edge flows. TB uses an MLP to output the logits of the forward and backward policy at the same time. QM takes a similar three layer MLP modeling to FM, in the sense that the state input first goes through one layer, element-wise multiplied with the Fourier feature, and then goes through two linear layers. All methods are optimized by Adam. Regarding hyperparameters, we do not do much swee**: QM uses the same learning rate as FM which is 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT; what’s more, QM uses N=N~=8𝑁~𝑁8N=\tilde{N}=8italic_N = over~ start_ARG italic_N end_ARG = 8 and 256256256256 dimensional Fourier feature. Other baselines like TB, MCMC, PPO use the same configuration as in Malkin et al. (2022a).

Refer to caption
(a) Quantile matching
Refer to caption
(b) Flow matching
Refer to caption
(c) Trajectory balance
Figure 9: Hypergrid results with extremely sparse signals for 3333 GFlowNet methods. We find that TB is very easy to be affected by sparse reward setups and gives highly unstable performances, while QM behaves stably across different levels of reward landscape.
Risky hypergrid domain.

We set H=8,D=2formulae-sequence𝐻8𝐷2H=8,D=2italic_H = 8 , italic_D = 2 or 4444 for small or large environments, respectively. An illustration of the risky hypergrid with D=2𝐷2D=2italic_D = 2 is shown in Figure 2. It triggers a low reward of 0.10.10.10.1 when reaching the risky regions (located at the bottom-left and top-right corners of the grid with D=2𝐷2D=2italic_D = 2, and symmetrically for D=4𝐷4D=4italic_D = 4) with probability 30%percent3030\%30 %, and the agent obtains the original reward of 2.62.62.62.6 or 0.60.60.60.6 otherwise. For risk-neutral agents (flow matching and quantile matching), they may still enter risky regions while risk-averse agents should be aware of avoiding reaching these areas as much as possible. We track the number of modes discovered by each method during training, and also evaluate the violation rate. The latter metric is computed based on the number of times the agent entered the risky regions over a number of past samples. Experiments in risky hypergrid domain follow the same hyperparameter setup as described in the above section.

Exploration with extreme sparse signals.

We also investigate the setup with extremely sparse learning signal, where we assign a very small value (i.e., from 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 1×1091superscript1091\times 10^{-9}1 × 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT) to R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Equation 18. In this part we use a 3333 dimensional grid with H=8𝐻8H=8italic_H = 8. When R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is extremely small, the agent could hardly get any learning signal for most of the time, as the reward is near zero for most areas in the hypergrid (there is high reward near modes, but in high dimensional the area of modes is very little). Our results show that the proposed QM method is much more robust to the change of sparsity in reward landscape, while the exploration ability of both FM and TB are easily affected by sparse rewards.

C.3 Sequence generation

The reward is defined as R(𝐱)=max𝐦Mexp{dist(𝐱,𝐦)}𝑅𝐱subscript𝐦𝑀dist𝐱𝐦R(\mathbf{x})=\max_{\mathbf{m}\in M}\exp\{-\text{dist}(\mathbf{x},\mathbf{m})\}italic_R ( bold_x ) = roman_max start_POSTSUBSCRIPT bold_m ∈ italic_M end_POSTSUBSCRIPT roman_exp { - dist ( bold_x , bold_m ) }, where M𝑀Mitalic_M is a pre-generated set of sequences and the distance is the Levenshtein distance. The set is constructed by randomly combining symbols from {00000000,11111111,11110000,00001111,00111100}\{^{\prime}00000000^{\prime},^{\prime}11111111^{\prime},^{\prime}11110000^{% \prime},^{\prime}00001111^{\prime},^{\prime}00111100^{\prime}\}{ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 00000000 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 11111111 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 11110000 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 00001111 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 00111100 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. We re-generate the set of target sequences with the same generation protocol as Malkin et al. (2022a). We set the forward policy to uniform distribution with 0.5%percent0.50.5\%0.5 % probability for exploration for all methods. The reward exponential hyperparameter is set to 3333. Regarding the network implementation, we use a transformer with 3333 hidden layers and 8888 attention heads. For evaluation, we also plot the curve of top-100100100100 rewards for three GFlowNet methods in Figure 10(a) with our own experimental results. We do not use the correlation between log reward and model log likelihood on a given dataset as in Malkin et al. (2022a), as we find that the dataset could not cover the diverse modes appropriately, which causes that the correlation sometimes even reaches the high point at initialization; see Figure 10(b). Although the final rewards are similar among all three methods, the proposed QM reaches plateau in the shortest time.

All methods are optimized with Adam optimizer for 50000500005000050000 training steps, with the minibatch size being 16161616. We use a fixed random action probability of 0.0050.0050.0050.005. For all the baselines, we take the results from Malkin et al. (2022a). For quantile matching we use a two-layer MLP to process the Fourier feature of β𝛽\betaitalic_β, and then compute its element-wise product with the transformer encoding feature;about hyperparameters, we use the same learning rate (5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) as FM, N=N~=16𝑁~𝑁16N=\tilde{N}=16italic_N = over~ start_ARG italic_N end_ARG = 16, and 256256256256 dimensional Fourier feature.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 10: (a) Top-100100100100 average reward for the sequence generation task. (b) Instability of the correlation between model log likelihood and log of true reward for the sequence generation task. (c) Top-10101010 average reward for the molecule synthesis task.

C.4 Molecule synthesis

We train a proxy model to predict the normalized negative binding energy to the 4JNC inhibitor of the soluble epoxide hydrolase (sEH) protein to serve as the reward. The number of the action is in the range between 100100100100 and 2000200020002000, making |𝒳|1016𝒳superscript1016|\mathcal{X}|\approx 10^{16}| caligraphic_X | ≈ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. For the choice of neural network architecture, we use a message passing neural networks (Gilmer et al., 2017, MPNN) for all models. Apart from the results in the main text, we also show top-10101010 reward plots in Figure 10(c).

For evaluation of number of modes, we define a set and add a newly discovered molecule into it if its reward value larger than 7.57.57.57.5 and its Tanimoto similarity with all previous set elements is smaller than 0.70.70.70.7. Note that this criterion is stricter than simply counting the number of different Bemis-Murcko scaffolds that reach the reward threshold Bengio et al. (2021a); Pan et al. (2022). Pan et al. (2022) is experimented on a modified and thus different task, where the signal is sparsified to make the importance of active exploration to be prominent. Therefore, we do not include it in this work due to different setups. The way using Tanimoto separated modes is more applicable for de novo molecule design, while scaffold-based metric is more appropriate for lead optimization. Further, counting Tanimoto separated modes is a more strict metric as can be seen from the Figure 14 and Figure 15 from Bengio et al. (2021a). We measure the Tanimoto similarity for the top-1000100010001000 molecules as in Bengio et al. (2021a). TB uses a uniform backward policy as in Malkin et al. (2022a) as it provides better results. For baseline setups we simply follow the hyperparameters from Bengio et al. (2021a); Malkin et al. (2022a). For quantile matching we use the same Adam learning rate (5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) as FM, N=N~=16𝑁~𝑁16N=\tilde{N}=16italic_N = over~ start_ARG italic_N end_ARG = 16, and 256256256256 dimensional Fourier feature.

There are two ways for counting the modes of molecules: (i) the first way is to distinguish molecules by their Tanimoto similarity; (ii) the second way is to check if the Bemis-Murcko scaffold (i.e., a simplified representation denoting the core structure of molecules) of the molecules are different. For de novo drug discovery, one should use the first way because it captures more biological information, while the second way is mostly used in the application of lead optimization. Therefore, we follow the first option for reporting the results. We remark that in Bengio et al. (2021a) the authors chose the second way due to unfamiliarity with the domain at the time of project preparation.