File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Transformer learns optimal variable selection in group-sparse classification
| Title | Transformer learns optimal variable selection in group-sparse classification |
|---|---|
| Authors | |
| Issue Date | 11-Apr-2025 |
| Abstract | Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data. |
| Persistent Identifier | http://hdl.handle.net/10722/359514 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | ZHANG, Chenyang | - |
| dc.contributor.author | Meng, Xuran | - |
| dc.contributor.author | Cao, Yuan | - |
| dc.contributor.author | - | |
| dc.date.accessioned | 2025-09-07T00:30:50Z | - |
| dc.date.available | 2025-09-07T00:30:50Z | - |
| dc.date.issued | 2025-04-11 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/359514 | - |
| dc.description.abstract | <p>Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.<br></p> | - |
| dc.language | eng | - |
| dc.relation.ispartof | The 13th International Conference on Learning Representations (ICLR) (24/04/2025-28/04/2025, Singapore) | - |
| dc.title | Transformer learns optimal variable selection in group-sparse classification | - |
| dc.type | Conference_Paper | - |
