Entropy-SGD: biasing gradient descent into wide valleys

IRIS

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

Entropy-SGD: biasing gradient descent into wide valleys

Chaudhari, Pratik;Choromanska, Anna;Soatto, Stefano;LeCun, Yann;Baldassi, Carlo;Borgs, Christian;Chayes, Jennifer;Sagun, Levent;Zecchina, Riccardo

2019

Abstract

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Year / Anno
	
				2019
			
	Date first on line publication / Data di prima pubblicazione on line
	
				2019
			
	DOI
	
				https://dx.doi.org/10.1088/1742-5468/ab39d9
			
	Journal / Rivista
	
				JOURNAL OF STATISTICAL MECHANICS: THEORY AND EXPERIMENT
			
	URL / Indirizzo web
	
				http://dx.doi.org/10.1088/1742-5468/ab39d9
			
	Tutti gli autori
	
						Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; Lecun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Ricc...espandi
						
	Appare nelle tipologie:
	
				01 - Article in academic journal / Articolo su rivista scientifica

File in questo prodotto:

File	Dimensione	Formato
1611.01838.pdf non disponibili Descrizione: Articolo principale Tipologia: Documento in Pre-print (Pre-print document) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 728.05 kB Formato Adobe PDF Visualizza/Apri	728.05 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11565/4023488

Citazioni

ND

224

223

social impact