This patch release adds Diffusers support for the upcoming CogVideoX-5B-I2V release (an Image-to-Video generation model)! The model weights will be available by end of the week on the HF Hub at `THUDM/CogVideoX-5b-I2V` ([Link](https://huggingface.co/THUDM/CogVideoX-5b-I2V)). Stay tuned for the release!
This release features two new pipelines:
- CogVideoXImageToVideoPipeline
- CogVideoXVideoToVideoPipeline
Additionally, we now have support for tiled encoding in the CogVideoX VAE. This can be enabled by calling the `vae.enable_tiling()` method, and it is used in the new Video-to-Video pipeline to encode sample videos to latents in a memory-efficient manner.
CogVideoXImageToVideoPipeline
The code below demonstrates how to use the new image-to-video pipeline:
python
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16)
pipe.to("cuda")
Optionally, enable memory optimizations.
If enabling CPU offloading, remember to remove `pipe.to("cuda")` above
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
video = pipe(image, prompt, use_dynamic_cfg=True)
export_to_video(video.frames[0], "output.mp4", fps=8)
<table align=center>
<tr>
<td align=center colspan=1><img src="https://github.com/user-attachments/assets/1c7c1d86-f97e-44dd-9b17-4fec2bbc2b1a" /></td>
<td align=center colspan=1><video src="https://github.com/user-attachments/assets/a115372e-c539-4ca0-b0d0-770d62862257"> Your broswer does not support the video tag. </video></td>
</tr>
</table>
CogVideoXVideoToVideoPipeline
The code below demonstrates how to use the new video-to-video pipeline:
python
import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video
Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-trial", torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
input_video = load_video(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
)
prompt = (
"An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
"valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
"the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
"moons, but the remainder of the scene is mostly realistic."
)
video = pipe(
video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)
<table align=center>
<tr>
<td align=center><video src="https://github.com/user-attachments/assets/bc9273ff-e459-42f9-af1e-c9b084b28f4d"> Your browser does not support the video tag. </video></td>
</tr>
</table>
Shoutout to tin2tin for the awesome demonstration!
Refer to our [documentation](https://huggingface.co/docs/diffusers/api/pipelines/cogvideox) to learn more about it.
All commits
* [core] Support VideoToVideo with CogVideoX by a-r-r-o-w in 9333
* [core] CogVideoX memory optimizations in VAE encode by a-r-r-o-w in 9340
* [CI] Quick fix for Cog Video Test by DN6 in 9373
* [refactor] move positional embeddings to patch embed layer for CogVideoX by a-r-r-o-w in 9263
* CogVideoX-5b-I2V support by zRzRzRzRzRzRzR in 9418