File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPAMI.2024.3515454
- Scopus: eid_2-s2.0-85212095793
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers
Title | BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers |
---|---|
Authors | |
Keywords | 3D object detection autonomous driving bird's-eye-view LiDAR- camera map segmentation transformer |
Issue Date | 11-Dec-2024 |
Publisher | Institute of Electrical and Electronics Engineers |
Citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 How to Cite? |
Abstract | Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer. |
Persistent Identifier | http://hdl.handle.net/10722/353662 |
ISSN | 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Li, Zhiqi | - |
dc.contributor.author | Wang, Wenhai | - |
dc.contributor.author | Li, Hongyang | - |
dc.contributor.author | Xie, Enze | - |
dc.contributor.author | Sima, Chonghao | - |
dc.contributor.author | Lu, Tong | - |
dc.contributor.author | Yu, Qiao | - |
dc.contributor.author | Dai, Jifeng | - |
dc.date.accessioned | 2025-01-22T00:35:32Z | - |
dc.date.available | 2025-01-22T00:35:32Z | - |
dc.date.issued | 2024-12-11 | - |
dc.identifier.citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 | - |
dc.identifier.issn | 0162-8828 | - |
dc.identifier.uri | http://hdl.handle.net/10722/353662 | - |
dc.description.abstract | <p>Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.</p> | - |
dc.language | eng | - |
dc.publisher | Institute of Electrical and Electronics Engineers | - |
dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject | 3D object detection | - |
dc.subject | autonomous driving | - |
dc.subject | bird's-eye-view | - |
dc.subject | LiDAR- camera | - |
dc.subject | map segmentation | - |
dc.subject | transformer | - |
dc.title | BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers | - |
dc.type | Article | - |
dc.identifier.doi | 10.1109/TPAMI.2024.3515454 | - |
dc.identifier.scopus | eid_2-s2.0-85212095793 | - |
dc.identifier.eissn | 1939-3539 | - |
dc.identifier.issnl | 0162-8828 | - |