File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TCDS.2025.3566649
- Scopus: eid_2-s2.0-105004695236
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Exploring Grounding Abilities in Vision-Language Models through Contextual Perception
| Title | Exploring Grounding Abilities in Vision-Language Models through Contextual Perception |
|---|---|
| Authors | |
| Keywords | human-robot interaction Large language model prompt engineering visual grounding |
| Issue Date | 1-Jan-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Cognitive and Developmental Systems, 2025 How to Cite? |
| Abstract | Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions. |
| Persistent Identifier | http://hdl.handle.net/10722/362625 |
| ISSN | 2023 Impact Factor: 5.0 2023 SCImago Journal Rankings: 1.302 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Xu, Wei | - |
| dc.contributor.author | Zhou, Tianfei | - |
| dc.contributor.author | Zhang, Taoyuan | - |
| dc.contributor.author | Li, Jie | - |
| dc.contributor.author | Chen, Peiyin | - |
| dc.contributor.author | Pan, Jia | - |
| dc.contributor.author | Liu, Xiaofeng | - |
| dc.date.accessioned | 2025-09-26T00:36:31Z | - |
| dc.date.available | 2025-09-26T00:36:31Z | - |
| dc.date.issued | 2025-01-01 | - |
| dc.identifier.citation | IEEE Transactions on Cognitive and Developmental Systems, 2025 | - |
| dc.identifier.issn | 2379-8920 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362625 | - |
| dc.description.abstract | Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions. | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Cognitive and Developmental Systems | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | human-robot interaction | - |
| dc.subject | Large language model | - |
| dc.subject | prompt engineering | - |
| dc.subject | visual grounding | - |
| dc.title | Exploring Grounding Abilities in Vision-Language Models through Contextual Perception | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TCDS.2025.3566649 | - |
| dc.identifier.scopus | eid_2-s2.0-105004695236 | - |
| dc.identifier.eissn | 2379-8939 | - |
| dc.identifier.issnl | 2379-8920 | - |
