File Download

There are no files associated with this item.

Supplementary

Conference Paper: Transformer learns optimal variable selection in group-sparse classification

TitleTransformer learns optimal variable selection in group-sparse classification
Authors
Issue Date11-Apr-2025
Abstract

Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.


Persistent Identifierhttp://hdl.handle.net/10722/359514

 

DC FieldValueLanguage
dc.contributor.authorZHANG, Chenyang-
dc.contributor.authorMeng, Xuran-
dc.contributor.authorCao, Yuan-
dc.contributor.author-
dc.date.accessioned2025-09-07T00:30:50Z-
dc.date.available2025-09-07T00:30:50Z-
dc.date.issued2025-04-11-
dc.identifier.urihttp://hdl.handle.net/10722/359514-
dc.description.abstract<p>Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.<br></p>-
dc.languageeng-
dc.relation.ispartofThe 13th International Conference on Learning Representations (ICLR) (24/04/2025-28/04/2025, Singapore)-
dc.titleTransformer learns optimal variable selection in group-sparse classification-
dc.typeConference_Paper-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats