Large-width asymptotics and training dynamics of alpha-Stable Re{LU} neural networks

Favaro, Stefano; Fortini, Sandra; Peluchetti, Stefano

Large-width asymptotic properties of neural networks (NNs) with Gaussian distributed weights have been extensively investigated in the literature, with major results characterizing their large-width asymptotic behavior in terms of Gaussian processes and their large-width training dynamics in terms of the neural tangent kernel (NTK). In this paper, we study large-width asymptotics and training dynamics of α-Stable ReLU-NNs, namely NNs with ReLU activation function and α-Stable distributed weights, with α ∈ (0, 2). For α ∈ (0, 2], α-Stable distributions form a broad class of heavy tails distributions, with the special case α = 2 corresponding to the Gaussian distribution. Firstly, we show that if the NN’s width goes to infinity, then a rescaled α-Stable ReLU-NN converges weakly (in distribution) to an α-Stable process, which generalizes the Gaussian process. As a difference with respect to the Gaussian setting, our result shows that the activation function affects the scaling of the α-Stable NN; more precisely, in order to achieve the infinite-width α-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Secondly, we characterize the large-width training dynamics of α-Stable ReLU-NNs in terms an infinite-width random kernel, which is referred to as the α-Stable NTK, and we show that the gradient descent achieves zero training error at linear rate, for a sufficiently large width, with high probability. Differently from the NTK arising in the Gaussian setting, the α-Stable NTK is a random kernel; more precisely, the randomness of the α-Stable ReLU-NN at initialization does not vanish in the large-width training dynamics.