Ch:1 Preliminaries

Diffusion models are fundamentally based on stochastic differential equations (SDEs), which combine deterministic dynamics with randomness. While we understand that SDEs govern how diffusion models evolve and that ordinary differential equations (ODEs) are crucial for sampling, the mathematical foundations - particularly random variables and stochastic processes - deserve careful examination.

This blog post aims to build a rigorous understanding of these foundational concepts. We’ll explore key definitions, theorems, and derivations from Chapter 2 of the Book, focusing on the most essential elements while providing detailed explanations where needed. By the end, we’ll try to connect it back to diffusion models.

Random Variable

Definition 1.1: If \(\Omega\) is a given set, then a \(\sigma\)-algebra \(\mathcal{F}\) on \(\Omega\) is a family \(\mathcal{F}\) of subsets of \(\Omega\) with the three key properties:

  1. $\phi \in \mathcal{F}$
  2. If \(F \in \mathcal{F}\) then \(F^c \in \mathcal{F}\), where \(F^c \in \Omega \setminus \mathcal{F}\) (complement)
  3. If \(A_1, A_2, \ldots \in \mathcal{F}\), then \(\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}\) (countable union)

Let $P: \mathcal{F} \rightarrow [0,1]$ be the probability on a space ($\Omega, \mathcal{F}$). Then ($\Omega, \mathcal{F}, P$) together defines the probability space. Intuitively,

\[P(F) = \text{the probability that the event } F \text{ occurs}\]
(Click to expand) Toy example Let's define our sample space \(\Omega\) as all possible outcomes when rolling a fair six-sided die:
\[ \Omega = \{1, 2, 3, 4, 5, 6\} \] A \(\sigma\)-algebra \(\mathcal{F}\) is a collection of subsets of Ω that satisfies the three properties. Hence, A simple example of a \(\sigma\)-algebra is:
\[ \mathcal{F}=\{\phi,\Omega,\{1,3,5\},\{2,4,6\}\} \] **Task:** Try to manually verify the above three properties.

Understanding the Borel sets: We can define the smallest $\sigma$-algebra $\mathcal{H}_\mathcal{U}$ containing $\mathcal{U}$ ($\mathcal{U} \in \Omega$), then

\[\mathcal{H}_\mathcal{U} = \bigcap\{\mathcal{H}; \mathcal{H}~\sigma \text{-algebra of } \Omega\}.\]

The intersection of all $\sigma$-algebras containing $\mathcal{U}$ is itself a $\sigma$-algebra and is the smallest $\sigma$-algebra containing $\mathcal{U}$. This is called the Borel $\sigma$-algebra (i.e., $\mathcal{B} = \mathcal{H}_\mathcal{U}$) generated by $\mathcal{U}$.

The Stochastic Process

Definition 1.2: A stochastic process is a parameterized collection of random variables

\[\{X_t\}_{t \in T}\]

defined on a probability space ($\Omega, \mathcal{F}, P$) and assuming values in $R^n$. Here, $T$ can be any interval [a, b] (like, in diffusion we have [0, 1000]). Utilizing this fact we can define $\omega \in \Omega$ as $\omega \rightarrow X_t(\omega)$ and $t \in T$ as $t \rightarrow X_t(\omega)$. Hence, for simplicity we can write:

\[(t, \omega) \rightarrow X(t, \omega)\]

Kolmogorov’s Extension Theorem: is a fundamental result in probability theory that allows us to construct stochastic processes—random variables indexed over time—when given a consistent collection of finite-dimensional distributions.

Intuitively, this theorem states that

  1. K1 (Consistency Condition) (a.k.a. Chapman-Kolmogorov equation):
    • The probability measures must be consistent under marginalization. That is, if we integrate out one of the intermediate times, the resulting probability distribution must match the lower-dimensional distribution.
    • Mathematically,
\[\int_{R^n} p(t, x, z)p(s, z, y)dz = p(t+s, x, y).\]
  1. K2 (Total Probability Condition):
    • The total probability of transitioning from any initial point to any final point over all space must be 1:
\[\int_{R^n} p(t,x,y)dy = 1, \forall t \geq 0.\]
  1. Then there exists a probability space $(\Omega,\mathcal{F},P)$ and a stochastic process ${X_t}$ such that these finite distributions correspond to the probabilities of events in this space.

This result is crucial because it guarantees that if we define random variables consistently at finite time steps, we can always extend this definition to an infinite process.

This theorem is modified to make it easier to understand next section. However, Theorem 2.1.5 from the book provides Measure-Theoretic Form.

Brownian Motion

Brownian motion is just an example of the observed stochastic process ($B_t(\omega)$) of the pollen grains defined using the Kolmogorov’s Extension theorem. Specifically,

\[p(t, x, y) = (2\pi t)^{-n/2} \cdot \exp\left(-\frac{|x-y|^2}{2t}\right), \text{ where } y \in \mathbb{R}^n, t>0 \text{ and fixed } x\in \mathbb{R}^n.\]

Then we can define probability measure as:

\[v_{t_1, ..., t_k} (F_1 \times ... \times F_k) = \int_{F_1} ... \int_{F_k} p(t_1, x, y_1)p(t_2-t_1, y_1, y_2)...p(t_k-t_{k-1}, y_{k-1}, y_k)dy_1...dy_k\]

where, $0 \leq t_1 \leq t_2 \leq … \leq t_k$. It is worth noting that, at $t=0$ we have $p(0,x,y)dy = \delta_x(y)$.

Let’s verify that the Brownian motion equation satisfies the key criteria of Kolmogorov’s Extension Theorem:

(Click to expand) Let's verify that the Brownian motion equation satisfies the key criteria of Kolmogorov's Extension Theorem: Step 1: (Chapman-Kolmogorov equation) Let's verify that the transition probability density function satisfies the Chapman-Kolmogorov equation by substituting the Gaussian density function into K1: \[ \begin{aligned} \text{LHS} &= \int_{\mathbb{R}^n} p(t,x,z)p(s,z,y)dz \\ &= \int_{\mathbb{R}^n} (2\pi t)^{-n/2} \exp\left(-\frac{|x-z|^2}{2t}\right) (2\pi s)^{-n/2} \exp\left(-\frac{|z-y|^2}{2s}\right) dz \\ &= (2\pi t)^{-n/2} (2\pi s)^{-n/2} \exp\left(-\frac{|x-y|^2}{2(t+s)}\right) \times \int_{R^n} exp \left(-\frac{1 (s+t)}{2(ts)}|z-\alpha|^2\right) dz \\ & \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad \text{ here, } \alpha \text{ is constant w.r.t. } z\\ &= p(t+s,x,y) = \text{RHS} \end{aligned} \] (Note that we skipped one extra step to further simplify the second term. If needed, comment down below and I can provide it.)
Step 2: (Total Probability) For the total probability condition, we need to show that integrating over all possible values of y equals 1: \[ \begin{aligned} \int_{\mathbb{R}^n} p(t,x,y)dy &= \int_{\mathbb{R}^n} (2\pi t)^{-n/2} \exp\left(-\frac{|x-y|^2}{2t}\right)dy \\ &= 1 \end{aligned} \] This equality holds because we are integrating a properly normalized Gaussian density function. Therefore, the Brownian motion transition density satisfies both key conditions of Kolmogorov's Extension Theorem, confirming that it defines a valid stochastic process.

Theorem 2.3: Kolmogorov's Continuity Theorem

Kolmogorov’s Continuity Theorem provides conditions under which a stochastic process has continuous paths. It is a fundamental result in the theory of stochastic processes and is particularly useful in the study of Brownian motion and diffusion processes.

Theorem: Let \(X = \{X_t : t \in T\}\) be a stochastic process. Suppose there exist constants $\alpha > 0$, $\beta > 0$, and $C > 0$ such that for all $s, t \in T$,

\[\mathbb{E}[|X_t - X_s|^\alpha] \leq C |t - s|^{1 + \beta}\]

Then, there exists a modification of the process $X$ that has continuous paths with probability 1.

Explanation: The theorem essentially states that if the increments of a stochastic process satisfy a certain moment condition, then the process can be modified to have continuous paths. This is particularly important for ensuring that models like Brownian motion, which are used to describe continuous phenomena, are mathematically well-defined.

This theorem is a cornerstone in the study of stochastic processes, providing the necessary conditions for the existence of continuous sample paths, which are crucial for modeling real-world phenomena where continuity is expected.

(Click to expand) Please refer to the derivation by Prof. Fabrice Baudoin. Intuitive explanation is attached here.
The Kolmogorov Continuity Theorem helps us understand when a stochastic process has continuous paths. Here's the intuitive idea:
1. We start by looking at the process at discrete time points (like a grid).
2. Using Chebyshev's inequality, we can show that large jumps between consecutive grid points are very unlikely:
\[ \text{max}_{1\leq k \leq 2^n} \left| X_{k/2^n} - X_{k-1/2^n} \right| \geq 2^{-\gamma n} \] 3. The Borel-Cantelli lemma then helps us prove that as we make our grid finer and finer (\(n\rightarrow \infty\)), the maximum size of these jumps becomes smaller and smaller:
\[ \left| X_t - X_s \right| \leq \frac{2C}{1-2^{-\gamma}}2^{-n\gamma} \] 4. This means that:
- The process becomes "smoother" as we look at it more closely
- Any discontinuities would have to be infinitely small
- Therefore, the process must be continuous "almost" surely

This is particularly important for Brownian motion, as it guarantees that particle paths are continuous - there are no sudden teleportations or jumps in the motion.

Connections between Diffusion and Brownian motion

Brownian motion is a specific stochastic process that models random, continuous motion, like the movement of particles in a fluid. Brownian motion is a basic building block used in many areas, such as modeling randomness, diffusion processes, and stochastic differential equations (SDEs). It has properties like,

Diffusion models are general frameworks used to model the evolution of a system over time, incorporating randomness. Diffusion models describe a forward process that adds noise to data (like images) step by step and a reverse process that removes the noise to recover the original data.

Importantly,

\[dX_t = \sigma dB_t ~~~~\text{(random motion)}\] \[dX_t = f(X-t, t)dt + g(X_t, t)dB_t ~~~~\text{(generalized)}\]

Conclusion

In this blog post, we explored fundamental concepts in probability theory and stochastic processes that form the mathematical foundation of diffusion models. We covered key ideas like probability spaces, random variables, and stochastic processes, with a particular focus on Brownian motion. We also began to see how these mathematical tools connect to modern diffusion models used in machine learning.

While this was an introductory look at these concepts, they are essential for understanding how diffusion models work at a deeper level. As we continue through the book in future posts, we will build on these fundamentals to develop a more comprehensive understanding of diffusion models and their theoretical underpinnings.