Memorization and Generalization in Generative Diffusion under the Manifold Hypothesis

Achilli, Beatrice; Ambrogioni, Luca; Lucibello, Carlo; Mezard, Marc; Ventura, Enrico

doi:10.1088/1742-5468/ade136

We study the memorization and generalization capabilities of Diffusion Models (DMs) when data lies on a structured latent manifold. Specifically, we consider a set of $P$ data points in $N$ dimensions confined to a latent subspace of dimension $D = α_D N$, following the Hidden Manifold Model (HMM). We analyze the reverse diffusion process using the empirical score function as a proxy, and characterize it in the high-dimensional limit $P = \exp(αN)$, $N \gg 1$, by exploiting a connection with the Random Energy Model (REM). We show that a characteristic time $t_o$ marks the emergence of traps in the time-dependent potential, which however do not affect typical trajectories. The size of their basins of attraction is computed at all times. We derive the collapse time $t_c < t_o$, at which trajectories fall into the basin of a training point, signaling memorization. An explicit formula for $t_c$ as a function of $P$ and $α_D$ shows that the curse of dimensionality is avoided for structured data ($α_D \ll 1$), even with nonlinear manifolds. We also prove that collapse corresponds to the condensation transition in the REM. Generalization is quantified via the Kullback-Leibler divergence between the exact distribution and the reverse one at time $t$. We find a distinct time $t_g < t_c < t_o$ minimizing this divergence. Surprisingly, the best generalization occurs inside the memorization phase. We conclude that generalization in DMs improves with data structure, as $t_g \to 0$ faster than $t_c$ when $α_D \to 0$.