File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

TitleVisual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
Authors
Issue Date17-Jun-2024
Abstract

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/Z5VG3D.


Persistent Identifierhttp://hdl.handle.net/10722/350519

 

DC FieldValueLanguage
dc.contributor.authorYuan, Zhihao-
dc.contributor.authorRen, Jinke-
dc.contributor.authorFeng, Chun-Mei-
dc.contributor.authorZhao, Hengshuang-
dc.contributor.authorCui, Shuguang-
dc.contributor.authorLi, Zhen-
dc.date.accessioned2024-10-29T00:32:02Z-
dc.date.available2024-10-29T00:32:02Z-
dc.date.issued2024-06-17-
dc.identifier.urihttp://hdl.handle.net/10722/350519-
dc.description.abstract<p>3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/Z5VG3D.</p>-
dc.languageeng-
dc.relation.ispartof2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (17/06/2024-21/06/2024, Seattle)-
dc.titleVisual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding-
dc.typeConference_Paper-
dc.identifier.doi10.1109/CVPR52733.2024.01949-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats