Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Ding, Runyu; Yang, Jihan; Xue, Chuhui; Zhang, Wenqing; Bai, Song; Qi, Xiaojuan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2024.3410324
Scopus: eid_2-s2.0-85195375531
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Electrical & Electronic Engineering: Journal/Magazine Articles

Article: Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Title	Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding
Authors	Ding, Runyu Yang, Jihan Xue, Chuhui Zhang, Wenqing Bai, Song Qi, Xiaojuan
Keywords	3D scene understanding instance segmentation Location awareness open vocabulary open world panoptic segmentation point clouds Semantic segmentation Semantics Solid modeling Task analysis Three-dimensional displays Training
Issue Date	1-Dec-2024
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, v. 46, n. 12, p. 8517-8533 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2024.3410324
Abstract	Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and, thus, the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5% ∼ 65.3%), instance segmentation (e.g. 21.8% ∼ 54.0%), and panoptic segmentation (e.g. 14.7% ∼ 43.3%). Code will be available.
Persistent Identifier	http://hdl.handle.net/10722/351086
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Ding, Runyu	-
dc.contributor.author	Yang, Jihan	-
dc.contributor.author	Xue, Chuhui	-
dc.contributor.author	Zhang, Wenqing	-
dc.contributor.author	Bai, Song	-
dc.contributor.author	Qi, Xiaojuan	-
dc.date.accessioned	2024-11-09T00:35:45Z	-
dc.date.available	2024-11-09T00:35:45Z	-
dc.date.issued	2024-12-01	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, v. 46, n. 12, p. 8517-8533	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/351086	-
dc.description.abstract	<p>Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and, thus, the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5% ∼ 65.3%), instance segmentation (e.g. 21.8% ∼ 54.0%), and panoptic segmentation (e.g. 14.7% ∼ 43.3%). Code will be available.<br></p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	3D scene understanding	-
dc.subject	instance segmentation	-
dc.subject	Location awareness	-
dc.subject	open vocabulary	-
dc.subject	open world	-
dc.subject	panoptic segmentation	-
dc.subject	point clouds	-
dc.subject	Semantic segmentation	-
dc.subject	Semantics	-
dc.subject	Solid modeling	-
dc.subject	Task analysis	-
dc.subject	Three-dimensional displays	-
dc.subject	Training	-
dc.title	Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2024.3410324	-
dc.identifier.scopus	eid_2-s2.0-85195375531	-
dc.identifier.volume	46	-
dc.identifier.issue	12	-
dc.identifier.spage	8517	-
dc.identifier.epage	8533	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats