Modern AI systems have achieved remarkable results by learning from long sequences of complex inputs like natural language or biological data. Yet, the theoretical understanding of how neural networks learn from such structured data remains limited. In this work, we introduce a simplified but powerful model to study this learning process from first principles. The model—bilinear sequence regression (BSR)—captures some of the core features of advanced architectures like transformers while remaining analytically solvable. The BSR model defines outputs as structured functions of sequences of high-dimensional vectors, enabling precise analysis of learning behavior. Using techniques from statistical physics and probabilistic inference, we determine the optimal prediction accuracy in the “thermodynamic limit,” where both the sequence length and token dimension become large. This analysis uncovers a sharp phase transition in learning: There exists a threshold in data volume below which learning fails and above which it suddenly becomes successful. We also analyze gradient descent learning dynamics and provide numerical evidence that a variant of this algorithm can reach the optimal performance predicted by our theory. BSR provides a rigorous and flexible framework for understanding how neural networks learn from sequential data. It offers insight into why architectures like transformers perform so well and helps to identify the fundamental conditions under which they succeed or fail.
Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-Dimensional Tokens
Biggio, Luca;
2025
Abstract
Modern AI systems have achieved remarkable results by learning from long sequences of complex inputs like natural language or biological data. Yet, the theoretical understanding of how neural networks learn from such structured data remains limited. In this work, we introduce a simplified but powerful model to study this learning process from first principles. The model—bilinear sequence regression (BSR)—captures some of the core features of advanced architectures like transformers while remaining analytically solvable. The BSR model defines outputs as structured functions of sequences of high-dimensional vectors, enabling precise analysis of learning behavior. Using techniques from statistical physics and probabilistic inference, we determine the optimal prediction accuracy in the “thermodynamic limit,” where both the sequence length and token dimension become large. This analysis uncovers a sharp phase transition in learning: There exists a threshold in data volume below which learning fails and above which it suddenly becomes successful. We also analyze gradient descent learning dynamics and provide numerical evidence that a variant of this algorithm can reach the optimal performance predicted by our theory. BSR provides a rigorous and flexible framework for understanding how neural networks learn from sequential data. It offers insight into why architectures like transformers perform so well and helps to identify the fundamental conditions under which they succeed or fail.| File | Dimensione | Formato | |
|---|---|---|---|
|
prx.pdf
accesso aperto
Tipologia:
Pdf editoriale (Publisher's layout)
Licenza:
Creative commons
Dimensione
1.84 MB
Formato
Adobe PDF
|
1.84 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


