File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPAMI.2025.3561598
- Scopus: eid_2-s2.0-105002839761
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: PonderV2: Improved 3D Representation with A Universal Pre-training Paradigm
| Title | PonderV2: Improved 3D Representation with A Universal Pre-training Paradigm |
|---|---|
| Authors | |
| Keywords | 3D pre-training 3D vision foundation model LiDAR multi-view image neural rendering point cloud RGB-D image |
| Issue Date | 18-Apr-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 8, p. 6550-6565 How to Cite? |
| Abstract | In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness. |
| Persistent Identifier | http://hdl.handle.net/10722/362091 |
| ISSN | 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Zhu, Haoyi | - |
| dc.contributor.author | Yang, Honghui | - |
| dc.contributor.author | Wu, Xiaoyang | - |
| dc.contributor.author | Di, Huang | - |
| dc.contributor.author | Sha, Zhang | - |
| dc.contributor.author | He, Xianglong | - |
| dc.contributor.author | Zhao, Hengshuang | - |
| dc.contributor.author | Shen, Chunhua | - |
| dc.contributor.author | Yu, Qiao | - |
| dc.contributor.author | He, Tong | - |
| dc.contributor.author | Wanli, Ouyang. | - |
| dc.date.accessioned | 2025-09-19T00:31:49Z | - |
| dc.date.available | 2025-09-19T00:31:49Z | - |
| dc.date.issued | 2025-04-18 | - |
| dc.identifier.citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 8, p. 6550-6565 | - |
| dc.identifier.issn | 0162-8828 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362091 | - |
| dc.description.abstract | <p>In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.<br></p> | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | 3D pre-training | - |
| dc.subject | 3D vision | - |
| dc.subject | foundation model | - |
| dc.subject | LiDAR | - |
| dc.subject | multi-view image | - |
| dc.subject | neural rendering | - |
| dc.subject | point cloud | - |
| dc.subject | RGB-D image | - |
| dc.title | PonderV2: Improved 3D Representation with A Universal Pre-training Paradigm | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TPAMI.2025.3561598 | - |
| dc.identifier.scopus | eid_2-s2.0-105002839761 | - |
| dc.identifier.volume | 47 | - |
| dc.identifier.issue | 8 | - |
| dc.identifier.spage | 6550 | - |
| dc.identifier.epage | 6565 | - |
| dc.identifier.eissn | 1939-3539 | - |
| dc.identifier.issnl | 0162-8828 | - |
