File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers

TitleBEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers
Authors
Keywords3D object detection
autonomous driving
bird's-eye-view
LiDAR- camera
map segmentation
transformer
Issue Date11-Dec-2024
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 How to Cite?
Abstract

Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.


Persistent Identifierhttp://hdl.handle.net/10722/353662
ISSN
2023 Impact Factor: 20.8
2023 SCImago Journal Rankings: 6.158

 

DC FieldValueLanguage
dc.contributor.authorLi, Zhiqi-
dc.contributor.authorWang, Wenhai-
dc.contributor.authorLi, Hongyang-
dc.contributor.authorXie, Enze-
dc.contributor.authorSima, Chonghao-
dc.contributor.authorLu, Tong-
dc.contributor.authorYu, Qiao-
dc.contributor.authorDai, Jifeng-
dc.date.accessioned2025-01-22T00:35:32Z-
dc.date.available2025-01-22T00:35:32Z-
dc.date.issued2024-12-11-
dc.identifier.citationIEEE Transactions on Pattern Analysis and Machine Intelligence, 2024-
dc.identifier.issn0162-8828-
dc.identifier.urihttp://hdl.handle.net/10722/353662-
dc.description.abstract<p>Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.</p>-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Pattern Analysis and Machine Intelligence-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject3D object detection-
dc.subjectautonomous driving-
dc.subjectbird's-eye-view-
dc.subjectLiDAR- camera-
dc.subjectmap segmentation-
dc.subjecttransformer-
dc.titleBEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers-
dc.typeArticle-
dc.identifier.doi10.1109/TPAMI.2024.3515454-
dc.identifier.scopuseid_2-s2.0-85212095793-
dc.identifier.eissn1939-3539-
dc.identifier.issnl0162-8828-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats