File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TNNLS.2025.3550979
- Scopus: eid_2-s2.0-105002250309
- PMID: 40178959
- Find via

Supplementary
- Citations:
- Appears in Collections:
Article: TransXNet: Learning Both Global and Local Dynamics With a Dual Dynamic Token Mixer for Visual Recognition
| Title | TransXNet: Learning Both Global and Local Dynamics With a Dual Dynamic Token Mixer for Visual Recognition |
|---|---|
| Authors | |
| Keywords | Dual Dynamic Token Mixer Vision Transformer Visual recognition |
| Issue Date | 1-Jan-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Neural Networks and Learning Systems, 2025, v. 36, n. 6, p. 11534-11547 How to Cite? |
| Abstract | Recent studies have integrated convolutions into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a suboptimal representation capacity of the constructed networks. To find a solution, we propose a lightweight dual dynamic token mixer (D-Mixer) to simultaneously learn global and local dynamics, that is, mechanisms that compute weights for aggregating global contexts and local details in an input-dependent manner. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN–transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6%, respectively, with reasonable computational costs. In addition, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. |
| Persistent Identifier | http://hdl.handle.net/10722/361929 |
| ISSN | 2023 Impact Factor: 10.2 2023 SCImago Journal Rankings: 4.170 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Lou, Meng | - |
| dc.contributor.author | Zhang, Shu | - |
| dc.contributor.author | Zhou, Hong Yu | - |
| dc.contributor.author | Yang, Sibei | - |
| dc.contributor.author | Wu, Chuan | - |
| dc.contributor.author | Yu, Yizhou | - |
| dc.date.accessioned | 2025-09-17T00:32:07Z | - |
| dc.date.available | 2025-09-17T00:32:07Z | - |
| dc.date.issued | 2025-01-01 | - |
| dc.identifier.citation | IEEE Transactions on Neural Networks and Learning Systems, 2025, v. 36, n. 6, p. 11534-11547 | - |
| dc.identifier.issn | 2162-237X | - |
| dc.identifier.uri | http://hdl.handle.net/10722/361929 | - |
| dc.description.abstract | <p>Recent studies have integrated convolutions into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a suboptimal representation capacity of the constructed networks. To find a solution, we propose a lightweight dual dynamic token mixer (D-Mixer) to simultaneously learn global and local dynamics, that is, mechanisms that compute weights for aggregating global contexts and local details in an input-dependent manner. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN–transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6%, respectively, with reasonable computational costs. In addition, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs.</p> | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Neural Networks and Learning Systems | - |
| dc.subject | Dual Dynamic Token Mixer | - |
| dc.subject | Vision Transformer | - |
| dc.subject | Visual recognition | - |
| dc.title | TransXNet: Learning Both Global and Local Dynamics With a Dual Dynamic Token Mixer for Visual Recognition | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TNNLS.2025.3550979 | - |
| dc.identifier.pmid | 40178959 | - |
| dc.identifier.scopus | eid_2-s2.0-105002250309 | - |
| dc.identifier.volume | 36 | - |
| dc.identifier.issue | 6 | - |
| dc.identifier.spage | 11534 | - |
| dc.identifier.epage | 11547 | - |
| dc.identifier.eissn | 2162-2388 | - |
| dc.identifier.issnl | 2162-237X | - |
