Exploring Grounding Abilities in Vision-Language Models through Contextual Perception

Xu, Wei; Zhou, Tianfei; Zhang, Taoyuan; Li, Jie; Chen, Peiyin; Pan, Jia; Liu, Xiaofeng

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TCDS.2025.3566649
Scopus: eid_2-s2.0-105004695236
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Exploring Grounding Abilities in Vision-Language Models through Contextual Perception

Title	Exploring Grounding Abilities in Vision-Language Models through Contextual Perception
Authors	Xu, Wei Zhou, Tianfei Zhang, Taoyuan Li, Jie Chen, Peiyin Pan, Jia Liu, Xiaofeng
Keywords	human-robot interaction Large language model prompt engineering visual grounding
Issue Date	1-Jan-2025
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Cognitive and Developmental Systems, 2025 How to Cite? DOI: http://dx.doi.org/10.1109/TCDS.2025.3566649
Abstract	Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.
Persistent Identifier	http://hdl.handle.net/10722/362625
ISSN	2379-8920 2023 Impact Factor: 5.0 2023 SCImago Journal Rankings: 1.302

DC Field	Value	Language
dc.contributor.author	Xu, Wei	-
dc.contributor.author	Zhou, Tianfei	-
dc.contributor.author	Zhang, Taoyuan	-
dc.contributor.author	Li, Jie	-
dc.contributor.author	Chen, Peiyin	-
dc.contributor.author	Pan, Jia	-
dc.contributor.author	Liu, Xiaofeng	-
dc.date.accessioned	2025-09-26T00:36:31Z	-
dc.date.available	2025-09-26T00:36:31Z	-
dc.date.issued	2025-01-01	-
dc.identifier.citation	IEEE Transactions on Cognitive and Developmental Systems, 2025	-
dc.identifier.issn	2379-8920	-
dc.identifier.uri	http://hdl.handle.net/10722/362625	-
dc.description.abstract	Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Cognitive and Developmental Systems	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	human-robot interaction	-
dc.subject	Large language model	-
dc.subject	prompt engineering	-
dc.subject	visual grounding	-
dc.title	Exploring Grounding Abilities in Vision-Language Models through Contextual Perception	-
dc.type	Article	-
dc.identifier.doi	10.1109/TCDS.2025.3566649	-
dc.identifier.scopus	eid_2-s2.0-105004695236	-
dc.identifier.eissn	2379-8939	-
dc.identifier.issnl	2379-8920	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Exploring Grounding Abilities in Vision-Language Models through Contextual Perception

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats