File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)

Article: Video Understanding with Large Language Models: A Survey

TitleVideo Understanding with Large Language Models: A Survey
Authors
KeywordsLarge Language Model
Multimodality Learning
Video Understanding
Vision-Language Model
Issue Date1-Jan-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Circuits and Systems for Video Technology, 2025 How to Cite?
AbstractWith the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer × LLM, Video Embedder × LLM, and (Analyzer + Embedder) × LLM. We identify five subtypes based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. This survey also presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methods for Vid-LLMs. Additionally, it explores the extensive applications of Vid-LLMs in various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Additionally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are encouraged to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
Persistent Identifierhttp://hdl.handle.net/10722/362643
ISSN
2023 Impact Factor: 8.3
2023 SCImago Journal Rankings: 2.299

 

DC FieldValueLanguage
dc.contributor.authorTang, Yunlong-
dc.contributor.authorBi, Jing-
dc.contributor.authorXu, Siting-
dc.contributor.authorSong, Luchuan-
dc.contributor.authorLiang, Susan-
dc.contributor.authorWang, Teng-
dc.contributor.authorZhang, Daoan-
dc.contributor.authorAn, Jie-
dc.contributor.authorLin, Jingyang-
dc.contributor.authorZhu, Rongyi-
dc.contributor.authorVosoughi, Ali-
dc.contributor.authorHuang, Chao-
dc.contributor.authorZhang, Zeliang-
dc.contributor.authorLiu, Pinxin-
dc.contributor.authorFeng, Mingqian-
dc.contributor.authorZheng, Feng-
dc.contributor.authorZhang, Jianguo-
dc.contributor.authorLuo, Ping-
dc.contributor.authorLuo, Jiebo-
dc.contributor.authorXu, Chenliang-
dc.date.accessioned2025-09-26T00:36:40Z-
dc.date.available2025-09-26T00:36:40Z-
dc.date.issued2025-01-01-
dc.identifier.citationIEEE Transactions on Circuits and Systems for Video Technology, 2025-
dc.identifier.issn1051-8215-
dc.identifier.urihttp://hdl.handle.net/10722/362643-
dc.description.abstractWith the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer × LLM, Video Embedder × LLM, and (Analyzer + Embedder) × LLM. We identify five subtypes based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. This survey also presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methods for Vid-LLMs. Additionally, it explores the extensive applications of Vid-LLMs in various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Additionally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are encouraged to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Circuits and Systems for Video Technology-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectLarge Language Model-
dc.subjectMultimodality Learning-
dc.subjectVideo Understanding-
dc.subjectVision-Language Model-
dc.titleVideo Understanding with Large Language Models: A Survey-
dc.typeArticle-
dc.identifier.doi10.1109/TCSVT.2025.3566695-
dc.identifier.scopuseid_2-s2.0-105004593993-
dc.identifier.eissn1558-2205-
dc.identifier.issnl1051-8215-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats