Entropic gradient descent algorithms and wide flat minima

IRIS

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

Entropic gradient descent algorithms and wide flat minima

Pittorino, Fabrizio^{Membro del Collaboration Group};Lucibello, Carlo^{Membro del Collaboration Group};Feinauer, Christoph^{Membro del Collaboration Group};Perugini, Gabriele^{Membro del Collaboration Group};Baldassi, Carlo^{Membro del Collaboration Group};Demyanenko, Elizaveta^{Membro del Collaboration Group};Zecchina, Riccardo^{Membro del Collaboration Group}

2021

Abstract

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

Scheda breve

Scheda completa

Scheda completa (DC)

	Year / Anno
	
				2021
			
	Date first on line publication / Data di prima pubblicazione on line
	
				2021
			
	DOI
	
				https://dx.doi.org/10.1088/1742-5468/ac3ae8
			
	Journal / Rivista
	
				JOURNAL OF STATISTICAL MECHANICS: THEORY AND EXPERIMENT
			
	URL / Indirizzo web
	
				https://doi.org/10.1088/1742-5468/ac3ae8
			
	Tutti gli autori
	
						Pittorino, Fabrizio; Lucibello, Carlo; Feinauer, Christoph; Perugini, Gabriele; Baldassi, Carlo; Demyanenko, Elizaveta; Zecchina, Riccardo
					
	Appare nelle tipologie:
	
				01 - Article in academic journal / Articolo su rivista scientifica

File in questo prodotto:

File	Dimensione	Formato
Pittorino_2021_J._Stat._Mech._2021_124015.pdf non disponibili Descrizione: pdf Tipologia: Pdf editoriale (Publisher's layout) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.48 MB Formato Adobe PDF Visualizza/Apri	1.48 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11565/4044127

Citazioni

ND

12

12

social impact