File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

TitleLaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models
Authors
KeywordsDiffusion models
Generative modeling
Video generation
Issue Date1-May-2025
PublisherSpringer
Citation
International Journal of Computer Vision, 2025, v. 133, n. 5, p. 3059-3078 How to Cite?
Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.


Persistent Identifierhttp://hdl.handle.net/10722/358674
ISSN
2023 Impact Factor: 11.6
2023 SCImago Journal Rankings: 6.668

 

DC FieldValueLanguage
dc.contributor.authorWang, Yaohui-
dc.contributor.authorChen, Xinyuan-
dc.contributor.authorMa, Xin-
dc.contributor.authorZhou, Shangchen-
dc.contributor.authorHuang, Ziqi-
dc.contributor.authorWang, Yi-
dc.contributor.authorYang, Ceyuan-
dc.contributor.authorHe, Yinan-
dc.contributor.authorYu, Jiashuo-
dc.contributor.authorYang, Peiqing-
dc.contributor.authorGuo, Yuwei-
dc.contributor.authorWu, Tianxing-
dc.contributor.authorSi, Chenyang-
dc.contributor.authorJiang, Yuming-
dc.contributor.authorChen, Cunjian-
dc.contributor.authorLoy, Chen Change-
dc.contributor.authorDai, Bo-
dc.contributor.authorLin, Dahua-
dc.contributor.authorQiao, Yu-
dc.contributor.authorLiu, Ziwei-
dc.date.accessioned2025-08-13T07:47:21Z-
dc.date.available2025-08-13T07:47:21Z-
dc.date.issued2025-05-01-
dc.identifier.citationInternational Journal of Computer Vision, 2025, v. 133, n. 5, p. 3059-3078-
dc.identifier.issn0920-5691-
dc.identifier.urihttp://hdl.handle.net/10722/358674-
dc.description.abstract<p>This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.</p>-
dc.languageeng-
dc.publisherSpringer-
dc.relation.ispartofInternational Journal of Computer Vision-
dc.subjectDiffusion models-
dc.subjectGenerative modeling-
dc.subjectVideo generation-
dc.titleLaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models-
dc.typeArticle-
dc.identifier.doi10.1007/s11263-024-02295-1-
dc.identifier.scopuseid_2-s2.0-105003250612-
dc.identifier.volume133-
dc.identifier.issue5-
dc.identifier.spage3059-
dc.identifier.epage3078-
dc.identifier.eissn1573-1405-
dc.identifier.issnl0920-5691-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats