Revisiting Over-smoothing in BERT from the Perspective of Graph

SHI, H; GAO, J; XU, H; LIANG, X; LI, Z; Kong, L; Lee, SMS; KWOK, J T

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers
- Statistics & Actuarial Science: Conference papers

Conference Paper: Revisiting Over-smoothing in BERT from the Perspective of Graph

Title	Revisiting Over-smoothing in BERT from the Perspective of Graph
Authors	SHI, H GAO, J XU, H LIANG, X LI, Z Kong, L Lee, SMS KWOK, J T
Keywords	BERT Over-smoothing Transformer
Issue Date	2022
Publisher	International Conference on Representation Learning.
Citation	The Tenth International Conference on Learning Representations (ICLR 2022) (Virtual), April 25-29, 2022 How to Cite?
Abstract	Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.
Persistent Identifier	http://hdl.handle.net/10722/315040

DC Field	Value	Language
dc.contributor.author	SHI, H	-
dc.contributor.author	GAO, J	-
dc.contributor.author	XU, H	-
dc.contributor.author	LIANG, X	-
dc.contributor.author	LI, Z	-
dc.contributor.author	Kong, L	-
dc.contributor.author	Lee, SMS	-
dc.contributor.author	KWOK, J T	-
dc.date.accessioned	2022-08-05T09:39:07Z	-
dc.date.available	2022-08-05T09:39:07Z	-
dc.date.issued	2022	-
dc.identifier.citation	The Tenth International Conference on Learning Representations (ICLR 2022) (Virtual), April 25-29, 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/315040	-
dc.description.abstract	Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.	-
dc.language	eng	-
dc.publisher	International Conference on Representation Learning.	-
dc.subject	BERT	-
dc.subject	Over-smoothing	-
dc.subject	Transformer	-
dc.title	Revisiting Over-smoothing in BERT from the Perspective of Graph	-
dc.type	Conference_Paper	-
dc.identifier.email	Kong, L: lpk@cs.hku.hk	-
dc.identifier.email	Lee, SMS: smslee@hku.hk	-
dc.identifier.authority	Kong, L=rp02775	-
dc.identifier.authority	Lee, SMS=rp00726	-
dc.identifier.hkuros	335345	-
dc.publisher.place	United States	-

File Download

Supplementary

Conference Paper: Revisiting Over-smoothing in BERT from the Perspective of Graph

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats