Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Yuan, Zhihao; Ren, Jinke; Feng, Chun-Mei; Zhao, Hengshuang; Cui, Shuguang; Li, Zhen

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR52733.2024.01949

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Title	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
Authors	Yuan, Zhihao Ren, Jinke Feng, Chun-Mei Zhao, Hengshuang Cui, Shuguang Li, Zhen
Issue Date	17-Jun-2024
Abstract	3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/Z5VG3D.
Persistent Identifier	http://hdl.handle.net/10722/350519

DC Field	Value	Language
dc.contributor.author	Yuan, Zhihao	-
dc.contributor.author	Ren, Jinke	-
dc.contributor.author	Feng, Chun-Mei	-
dc.contributor.author	Zhao, Hengshuang	-
dc.contributor.author	Cui, Shuguang	-
dc.contributor.author	Li, Zhen	-
dc.date.accessioned	2024-10-29T00:32:02Z	-
dc.date.available	2024-10-29T00:32:02Z	-
dc.date.issued	2024-06-17	-
dc.identifier.uri	http://hdl.handle.net/10722/350519	-
dc.description.abstract	<p>3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/Z5VG3D.</p>	-
dc.language	eng	-
dc.relation.ispartof	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (17/06/2024-21/06/2024, Seattle)	-
dc.title	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	-
dc.type	Conference_Paper	-
dc.identifier.doi	10.1109/CVPR52733.2024.01949	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats