File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPAMI.2024.3507000
- Scopus: eid_2-s2.0-85210902043
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
| Title | LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models |
|---|---|
| Authors | |
| Keywords | large language model multi-turn evaluation multimodal evaluation Vision-language model |
| Issue Date | 1-Mar-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 3, p. 1877-1893 How to Cite? |
| Abstract | Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques. |
| Persistent Identifier | http://hdl.handle.net/10722/362081 |
| ISSN | 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Xu, Peng | - |
| dc.contributor.author | Shao, Wenqi | - |
| dc.contributor.author | Zhang, Kaipeng | - |
| dc.contributor.author | Gao, Peng | - |
| dc.contributor.author | Liu, Shuo | - |
| dc.contributor.author | Lei, Meng | - |
| dc.contributor.author | Meng, Fanqing | - |
| dc.contributor.author | Huang, Siyuan | - |
| dc.contributor.author | Qiao, Yu | - |
| dc.contributor.author | Luo, Ping | - |
| dc.date.accessioned | 2025-09-19T00:31:42Z | - |
| dc.date.available | 2025-09-19T00:31:42Z | - |
| dc.date.issued | 2025-03-01 | - |
| dc.identifier.citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, v. 47, n. 3, p. 1877-1893 | - |
| dc.identifier.issn | 0162-8828 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362081 | - |
| dc.description.abstract | <p>Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.</p> | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | large language model | - |
| dc.subject | multi-turn evaluation | - |
| dc.subject | multimodal evaluation | - |
| dc.subject | Vision-language model | - |
| dc.title | LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | - |
| dc.type | Article | - |
| dc.description.nature | published_or_final_version | - |
| dc.identifier.doi | 10.1109/TPAMI.2024.3507000 | - |
| dc.identifier.scopus | eid_2-s2.0-85210902043 | - |
| dc.identifier.volume | 47 | - |
| dc.identifier.issue | 3 | - |
| dc.identifier.spage | 1877 | - |
| dc.identifier.epage | 1893 | - |
| dc.identifier.eissn | 1939-3539 | - |
| dc.identifier.issnl | 0162-8828 | - |
