File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/s10994-019-05839-6
- Scopus: eid_2-s2.0-85074601535
- WOS: WOS:000494074500001
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Gradient descent optimizes over-parameterized deep ReLU networks
Title | Gradient descent optimizes over-parameterized deep ReLU networks |
---|---|
Authors | |
Keywords | Over-parameterization Deep neural networks Global convergence Gradient descent Random initialization |
Issue Date | 2020 |
Citation | Machine Learning, 2020, v. 109, n. 3, p. 467-492 How to Cite? |
Abstract | We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks. |
Persistent Identifier | http://hdl.handle.net/10722/303629 |
ISSN | 2023 Impact Factor: 4.3 2023 SCImago Journal Rankings: 1.720 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zou, Difan | - |
dc.contributor.author | Cao, Yuan | - |
dc.contributor.author | Zhou, Dongruo | - |
dc.contributor.author | Gu, Quanquan | - |
dc.date.accessioned | 2021-09-15T08:25:42Z | - |
dc.date.available | 2021-09-15T08:25:42Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Machine Learning, 2020, v. 109, n. 3, p. 467-492 | - |
dc.identifier.issn | 0885-6125 | - |
dc.identifier.uri | http://hdl.handle.net/10722/303629 | - |
dc.description.abstract | We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks. | - |
dc.language | eng | - |
dc.relation.ispartof | Machine Learning | - |
dc.subject | Over-parameterization | - |
dc.subject | Deep neural networks | - |
dc.subject | Global convergence | - |
dc.subject | Gradient descent | - |
dc.subject | Random initialization | - |
dc.title | Gradient descent optimizes over-parameterized deep ReLU networks | - |
dc.type | Article | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/s10994-019-05839-6 | - |
dc.identifier.scopus | eid_2-s2.0-85074601535 | - |
dc.identifier.volume | 109 | - |
dc.identifier.issue | 3 | - |
dc.identifier.spage | 467 | - |
dc.identifier.epage | 492 | - |
dc.identifier.eissn | 1573-0565 | - |
dc.identifier.isi | WOS:000494074500001 | - |