Closing the generalization gap of adaptive gradient methods in training deep neural networks

Chen, Jinghui; Zhou, Dongruo; Tang, Yiqi; Yang, Ziyan; Cao, Yuan; Gu, Quanquan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.24963/ijcai.2020/452
Scopus: eid_2-s2.0-85095192038
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Statistics & Actuarial Science: Conference papers

Conference Paper: Closing the generalization gap of adaptive gradient methods in training deep neural networks

Title	Closing the generalization gap of adaptive gradient methods in training deep neural networks
Authors	Chen, Jinghui Zhou, Dongruo Tang, Yiqi Yang, Ziyan Cao, Yuan Gu, Quanquan
Issue Date	2020
Citation	Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020, p. 3267-3275 How to Cite? DOI: http://dx.doi.org/10.24963/ijcai.2020/452
Abstract	Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes “over adapted”. We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter p, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.
Persistent Identifier	http://hdl.handle.net/10722/303708
ISSN	1045-0823 2020 SCImago Journal Rankings: 0.649

DC Field	Value	Language
dc.contributor.author	Chen, Jinghui	-
dc.contributor.author	Zhou, Dongruo	-
dc.contributor.author	Tang, Yiqi	-
dc.contributor.author	Yang, Ziyan	-
dc.contributor.author	Cao, Yuan	-
dc.contributor.author	Gu, Quanquan	-
dc.date.accessioned	2021-09-15T08:25:51Z	-
dc.date.available	2021-09-15T08:25:51Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020, p. 3267-3275	-
dc.identifier.issn	1045-0823	-
dc.identifier.uri	http://hdl.handle.net/10722/303708	-
dc.description.abstract	Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes “over adapted”. We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter p, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence	-
dc.title	Closing the generalization gap of adaptive gradient methods in training deep neural networks	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.doi	10.24963/ijcai.2020/452	-
dc.identifier.scopus	eid_2-s2.0-85095192038	-
dc.identifier.spage	3267	-
dc.identifier.epage	3275	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Closing the generalization gap of adaptive gradient methods in training deep neural networks

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats