BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers

Li, Zhiqi; Wang, Wenhai; Li, Hongyang; Xie, Enze; Sima, Chonghao; Lu, Tong; Yu, Qiao; Dai, Jifeng

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2024.3515454
Scopus: eid_2-s2.0-85212095793
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Journal/Magazine Articles

Article: BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers

Title	BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers
Authors	Li, Zhiqi Wang, Wenhai Li, Hongyang Xie, Enze Sima, Chonghao Lu, Tong Yu, Qiao Dai, Jifeng
Keywords	3D object detection autonomous driving bird's-eye-view LiDAR- camera map segmentation transformer
Issue Date	11-Dec-2024
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2024.3515454
Abstract	Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.
Persistent Identifier	http://hdl.handle.net/10722/353662
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Li, Zhiqi	-
dc.contributor.author	Wang, Wenhai	-
dc.contributor.author	Li, Hongyang	-
dc.contributor.author	Xie, Enze	-
dc.contributor.author	Sima, Chonghao	-
dc.contributor.author	Lu, Tong	-
dc.contributor.author	Yu, Qiao	-
dc.contributor.author	Dai, Jifeng	-
dc.date.accessioned	2025-01-22T00:35:32Z	-
dc.date.available	2025-01-22T00:35:32Z	-
dc.date.issued	2024-12-11	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/353662	-
dc.description.abstract	<p>Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.</p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	3D object detection	-
dc.subject	autonomous driving	-
dc.subject	bird's-eye-view	-
dc.subject	LiDAR- camera	-
dc.subject	map segmentation	-
dc.subject	transformer	-
dc.title	BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2024.3515454	-
dc.identifier.scopus	eid_2-s2.0-85212095793	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats