File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/s11263-024-02295-1
- Scopus: eid_2-s2.0-105003250612
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models
| Title | LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models |
|---|---|
| Authors | |
| Keywords | Diffusion models Generative modeling Video generation |
| Issue Date | 1-May-2025 |
| Publisher | Springer |
| Citation | International Journal of Computer Vision, 2025, v. 133, n. 5, p. 3059-3078 How to Cite? |
| Abstract | This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/. |
| Persistent Identifier | http://hdl.handle.net/10722/358674 |
| ISSN | 2023 Impact Factor: 11.6 2023 SCImago Journal Rankings: 6.668 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Wang, Yaohui | - |
| dc.contributor.author | Chen, Xinyuan | - |
| dc.contributor.author | Ma, Xin | - |
| dc.contributor.author | Zhou, Shangchen | - |
| dc.contributor.author | Huang, Ziqi | - |
| dc.contributor.author | Wang, Yi | - |
| dc.contributor.author | Yang, Ceyuan | - |
| dc.contributor.author | He, Yinan | - |
| dc.contributor.author | Yu, Jiashuo | - |
| dc.contributor.author | Yang, Peiqing | - |
| dc.contributor.author | Guo, Yuwei | - |
| dc.contributor.author | Wu, Tianxing | - |
| dc.contributor.author | Si, Chenyang | - |
| dc.contributor.author | Jiang, Yuming | - |
| dc.contributor.author | Chen, Cunjian | - |
| dc.contributor.author | Loy, Chen Change | - |
| dc.contributor.author | Dai, Bo | - |
| dc.contributor.author | Lin, Dahua | - |
| dc.contributor.author | Qiao, Yu | - |
| dc.contributor.author | Liu, Ziwei | - |
| dc.date.accessioned | 2025-08-13T07:47:21Z | - |
| dc.date.available | 2025-08-13T07:47:21Z | - |
| dc.date.issued | 2025-05-01 | - |
| dc.identifier.citation | International Journal of Computer Vision, 2025, v. 133, n. 5, p. 3059-3078 | - |
| dc.identifier.issn | 0920-5691 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/358674 | - |
| dc.description.abstract | <p>This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.</p> | - |
| dc.language | eng | - |
| dc.publisher | Springer | - |
| dc.relation.ispartof | International Journal of Computer Vision | - |
| dc.subject | Diffusion models | - |
| dc.subject | Generative modeling | - |
| dc.subject | Video generation | - |
| dc.title | LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1007/s11263-024-02295-1 | - |
| dc.identifier.scopus | eid_2-s2.0-105003250612 | - |
| dc.identifier.volume | 133 | - |
| dc.identifier.issue | 5 | - |
| dc.identifier.spage | 3059 | - |
| dc.identifier.epage | 3078 | - |
| dc.identifier.eissn | 1573-1405 | - |
| dc.identifier.issnl | 0920-5691 | - |
