Gradient descent optimizes over-parameterized deep ReLU networks

Zou, Difan; Cao, Yuan; Zhou, Dongruo; Gu, Quanquan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/s10994-019-05839-6
Scopus: eid_2-s2.0-85074601535
WOS: WOS:000494074500001
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Statistics & Actuarial Science: Journal/Magazine Articles

Article: Gradient descent optimizes over-parameterized deep ReLU networks

Title	Gradient descent optimizes over-parameterized deep ReLU networks
Authors	Zou, Difan Cao, Yuan Zhou, Dongruo Gu, Quanquan
Keywords	Over-parameterization Deep neural networks Global convergence Gradient descent Random initialization
Issue Date	2020
Citation	Machine Learning, 2020, v. 109, n. 3, p. 467-492 How to Cite? DOI: http://dx.doi.org/10.1007/s10994-019-05839-6
Abstract	We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.
Persistent Identifier	http://hdl.handle.net/10722/303629
ISSN	0885-6125 2023 Impact Factor: 4.3 2023 SCImago Journal Rankings: 1.720
ISI Accession Number ID	WOS:000494074500001

DC Field	Value	Language
dc.contributor.author	Zou, Difan	-
dc.contributor.author	Cao, Yuan	-
dc.contributor.author	Zhou, Dongruo	-
dc.contributor.author	Gu, Quanquan	-
dc.date.accessioned	2021-09-15T08:25:42Z	-
dc.date.available	2021-09-15T08:25:42Z	-
dc.date.issued	2020	-
dc.identifier.citation	Machine Learning, 2020, v. 109, n. 3, p. 467-492	-
dc.identifier.issn	0885-6125	-
dc.identifier.uri	http://hdl.handle.net/10722/303629	-
dc.description.abstract	We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.	-
dc.language	eng	-
dc.relation.ispartof	Machine Learning	-
dc.subject	Over-parameterization	-
dc.subject	Deep neural networks	-
dc.subject	Global convergence	-
dc.subject	Gradient descent	-
dc.subject	Random initialization	-
dc.title	Gradient descent optimizes over-parameterized deep ReLU networks	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/s10994-019-05839-6	-
dc.identifier.scopus	eid_2-s2.0-85074601535	-
dc.identifier.volume	109	-
dc.identifier.issue	3	-
dc.identifier.spage	467	-
dc.identifier.epage	492	-
dc.identifier.eissn	1573-0565	-
dc.identifier.isi	WOS:000494074500001	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Gradient descent optimizes over-parameterized deep ReLU networks

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats