File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Exploring Grounding Abilities in Vision-Language Models through Contextual Perception

TitleExploring Grounding Abilities in Vision-Language Models through Contextual Perception
Authors
Keywordshuman-robot interaction
Large language model
prompt engineering
visual grounding
Issue Date1-Jan-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Cognitive and Developmental Systems, 2025 How to Cite?
AbstractVision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.
Persistent Identifierhttp://hdl.handle.net/10722/362625
ISSN
2023 Impact Factor: 5.0
2023 SCImago Journal Rankings: 1.302

 

DC FieldValueLanguage
dc.contributor.authorXu, Wei-
dc.contributor.authorZhou, Tianfei-
dc.contributor.authorZhang, Taoyuan-
dc.contributor.authorLi, Jie-
dc.contributor.authorChen, Peiyin-
dc.contributor.authorPan, Jia-
dc.contributor.authorLiu, Xiaofeng-
dc.date.accessioned2025-09-26T00:36:31Z-
dc.date.available2025-09-26T00:36:31Z-
dc.date.issued2025-01-01-
dc.identifier.citationIEEE Transactions on Cognitive and Developmental Systems, 2025-
dc.identifier.issn2379-8920-
dc.identifier.urihttp://hdl.handle.net/10722/362625-
dc.description.abstractVision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Cognitive and Developmental Systems-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjecthuman-robot interaction-
dc.subjectLarge language model-
dc.subjectprompt engineering-
dc.subjectvisual grounding-
dc.titleExploring Grounding Abilities in Vision-Language Models through Contextual Perception-
dc.typeArticle-
dc.identifier.doi10.1109/TCDS.2025.3566649-
dc.identifier.scopuseid_2-s2.0-105004695236-
dc.identifier.eissn2379-8939-
dc.identifier.issnl2379-8920-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats