File Download
  Links for fulltext
     (May Require Subscription)

Article: LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

TitleLVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Authors
Keywordslarge language model
multi-turn evaluation
multimodal evaluation
Vision-language model
Issue Date1-Mar-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 3, p. 1877-1893 How to Cite?
Abstract

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.


Persistent Identifierhttp://hdl.handle.net/10722/362081
ISSN
2023 Impact Factor: 20.8
2023 SCImago Journal Rankings: 6.158

 

DC FieldValueLanguage
dc.contributor.authorXu, Peng-
dc.contributor.authorShao, Wenqi-
dc.contributor.authorZhang, Kaipeng-
dc.contributor.authorGao, Peng-
dc.contributor.authorLiu, Shuo-
dc.contributor.authorLei, Meng-
dc.contributor.authorMeng, Fanqing-
dc.contributor.authorHuang, Siyuan-
dc.contributor.authorQiao, Yu-
dc.contributor.authorLuo, Ping-
dc.date.accessioned2025-09-19T00:31:42Z-
dc.date.available2025-09-19T00:31:42Z-
dc.date.issued2025-03-01-
dc.identifier.citationIEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 3, p. 1877-1893-
dc.identifier.issn0162-8828-
dc.identifier.urihttp://hdl.handle.net/10722/362081-
dc.description.abstract<p>Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.</p>-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Pattern Analysis and Machine Intelligence-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectlarge language model-
dc.subjectmulti-turn evaluation-
dc.subjectmultimodal evaluation-
dc.subjectVision-language model-
dc.titleLVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models-
dc.typeArticle-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.1109/TPAMI.2024.3507000-
dc.identifier.scopuseid_2-s2.0-85210902043-
dc.identifier.volume47-
dc.identifier.issue3-
dc.identifier.spage1877-
dc.identifier.epage1893-
dc.identifier.eissn1939-3539-
dc.identifier.issnl0162-8828-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats