1.1 Stochastic Differential Equation (Noising) View of DBMs
Standard Diffusion-Based models (DBMs) use a fixed, forward noising process defined by the stochastic differential equation (SDE) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019):
\[d \mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + \mathbf{G}(\mathbf{x}, t) \cdot d \mathbf{W},\]to transform an initial data distribution \(p_0(\mathbf{x}) = p_{data}(\mathbf{x})\) into an isotropic Gaussian. Here, typically, one assumes linear drift and monotically increasing isotropic noise (i.e., \(\mathbf{f} = f(t) \mathbf{x}\) and \(\mathbf{G}=\sigma(t)\mathbf{I}\)), such that for \(\int_0^T \sigma(T)dt \gg \sigma_{data}, \; p_T(\mathbf{x})\) becomes essentially indistinguishable from a Gaussian distribution (Karras et al., 2022). DBMs work by learning an approximation to the corresponding reverse SDE (Anderson, 1982):
\[d \mathbf{x} = \lbrace \mathbf{f}(\mathbf{x}, t) - \nabla \cdot [\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^\top] - \mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^\top \nabla_{\mathbf{x}}\log p_t(\mathbf{x}) \rbrace \mathrm{d}t + \mathbf{G}(\mathbf{x}, t) \cdot \mathrm{d}\mathbf{\bar{W}}\]where \( \mathbf{\bar{W}} \) is time-reversed Brownian motion. In practice, this requires approximating the score function \(\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\) by incrementally adding noise according to the schedule \(\sigma(t)\) of the forward process and then requiring that denoising by the reverse SDE match the original sample. The fully trained model then generates samples from the target distribution by starting with \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \sigma^2(T) \mathbf{I}) \) and integrating the reserve SDE backwards in time. Note that this process is noisy and mixing (left panel on animation above).
1.2 Distributional View of DBMs
As previously shown (Song et al., 2021; Karras et al., 2022), this forward diffusion process gives rise to a series of marginal distributions \(p_t(\mathbf{x})\) satisfying a Fokker-Planck equation (FPE):
\[\partial_t p_t(\mathbf{x}) = -\sum_i \partial_i[f_i(\mathbf{x}, t) p_t(\mathbf{x})] + \frac{1}{2}\sum_{ij} \partial_i \partial_j \left[\sum_k G_{ik}(\mathbf{x}, t)G_{jk}(\mathbf{x}, t)p_t(\mathbf{x})\right]\]where \(\partial_i \equiv \frac{\partial}{\partial x_i}\). In the “variance preserving” noise schedule of Song et al., 2021, the above Fokker-Plack equation has as its stationary solution an isotropic Gaussian distribution. That is, in DBMs, the forward process is a means of gradually transforming the data into an easy-to-sample form (as with normalizing flows) and the reverse process as a means of data generation. Therefore, one can analyze DBMs through the lenz of their respective Fokker-Planck equations and the series of marginal distributions these induce over time - i.e., we can view these models from a distributional perspective (middle panel of animation).
1.3 Probability Flow ODE View of DBMs
However, a different process with no noise term can generate marginal distributions satisfying the same FPE governing the forward noising process (Song et al., 2021). This is the so-called probability flow ODE (pfODE) view of DBMs:
\[d \mathbf{x} = \left\lbrace \mathbf{f}(\mathbf{x}, t) - \frac{1}{2}\nabla \cdot [\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^\top] - \frac{1}{2}\mathbf{G}(\mathbf{x}, t)\mathbf{G}(\mathbf{x}, t)^\top \nabla_{\mathbf{x}}\log p_t(\mathbf{x}) \right\rbrace dt\]Unlike the noising process induced by the forward SDE, this process is instead deterministic, and data points evolve smoothly, resulting in a flow that preserves local neighborhoods (right panel of animation). Moreover, the pfODE is uniquely defined by \(\mathbf{f}(\mathbf{x}, t)\), \(\mathbf{G}(\mathbf{x}, t)\), and the score function. In this work, we show how this pfODE, constructed using a score function estimated by training the corresponding DBM, can be used to map points from \(p_{\mathrm{data}}(\mathbf{x})\) to a compressed latent space in a manner that affords accurate uncertainty quantification.
1.4 Connections to Flow Matching
This connection between the marginals satisfying the SDEs of diffusion processes and deterministic flows described by an equivalent ODE has also been recently explored in the context of flow matching models (Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023). In a nutshell, flow matching models utilize a simple, time-differentiable, “interpolant” function to specify conditional families of distributions that continuously map between specified initial and final densities. That is, the interpolant functions define flows that map samples from a base distribution \(\rho_0(\mathbf{x})\) to samples from a target distribution \(\rho_1(\mathbf{x})\). Typically, these approaches rely on a simple quadratic objective that attempts to match the conditional flow field, which can be computed in closed form without needing to integrate the corresponding ODE.
As shown in Appendix A.5 of our paper, the pfODEs obtained using our proposed scaling and noising schedules are equivalent to the ODEs obtained by using the “Gaussian paths formulation” from Lipman et al., 2023 when the latter are generalized to full covariance matrices. As a result, our models are amenable to training using flow-matching techniques, suggesting that faster training and inference schemes may be possible through leveraging connections between flow matching and optimal transport (Tong et al., 2023; Pooladian et al., 2023; Tong et al., 2024; Albergo & Vanden-Eijnden, 2023).