We study the memorization and generalization capabilities of Diffusion Models (DMs) when data lies on a structured latent manifold. Specifically, we consider a set of $P$ data points in $N$ dimensions confined to a latent subspace of dimension $D = α_D N$, following the Hidden Manifold Model (HMM). We analyze the reverse diffusion process using the empirical score function as a proxy, and characterize it in the high-dimensional limit $P = \exp(αN)$, $N \gg 1$, by exploiting a connection with the Random Energy Model (REM). We show that a characteristic time $t_o$ marks the emergence of traps in the time-dependent potential, which however do not affect typical trajectories. The size of their basins of attraction is computed at all times. We derive the collapse time $t_c < t_o$, at which trajectories fall into the basin of a training point, signaling memorization. An explicit formula for $t_c$ as a function of $P$ and $α_D$ shows that the curse of dimensionality is avoided for structured data ($α_D \ll 1$), even with nonlinear manifolds. We also prove that collapse corresponds to the condensation transition in the REM. Generalization is quantified via the Kullback-Leibler divergence between the exact distribution and the reverse one at time $t$. We find a distinct time $t_g < t_c < t_o$ minimizing this divergence. Surprisingly, the best generalization occurs inside the memorization phase. We conclude that generalization in DMs improves with data structure, as $t_g \to 0$ faster than $t_c$ when $α_D \to 0$.
Memorization and Generalization in Generative Diffusion under the Manifold Hypothesis
Achilli, Beatrice;Lucibello, Carlo;Mezard, Marc;Ventura, Enrico
2025
Abstract
We study the memorization and generalization capabilities of Diffusion Models (DMs) when data lies on a structured latent manifold. Specifically, we consider a set of $P$ data points in $N$ dimensions confined to a latent subspace of dimension $D = α_D N$, following the Hidden Manifold Model (HMM). We analyze the reverse diffusion process using the empirical score function as a proxy, and characterize it in the high-dimensional limit $P = \exp(αN)$, $N \gg 1$, by exploiting a connection with the Random Energy Model (REM). We show that a characteristic time $t_o$ marks the emergence of traps in the time-dependent potential, which however do not affect typical trajectories. The size of their basins of attraction is computed at all times. We derive the collapse time $t_c < t_o$, at which trajectories fall into the basin of a training point, signaling memorization. An explicit formula for $t_c$ as a function of $P$ and $α_D$ shows that the curse of dimensionality is avoided for structured data ($α_D \ll 1$), even with nonlinear manifolds. We also prove that collapse corresponds to the condensation transition in the REM. Generalization is quantified via the Kullback-Leibler divergence between the exact distribution and the reverse one at time $t$. We find a distinct time $t_g < t_c < t_o$ minimizing this divergence. Surprisingly, the best generalization occurs inside the memorization phase. We conclude that generalization in DMs improves with data structure, as $t_g \to 0$ faster than $t_c$ when $α_D \to 0$.| File | Dimensione | Formato | |
|---|---|---|---|
|
2503.09518v1.pdf
accesso aperto
Descrizione: article
Tipologia:
Documento in Pre-print (Pre-print document)
Licenza:
PUBBLICO DOMINIO
Dimensione
395.34 kB
Formato
Adobe PDF
|
395.34 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


