Latent Consistency Models (LCM)
![Untitled](https://github.com/huggingface/diffusers/assets/22957388/cce05d6b-e5de-4be0-8416-36556b80176e)
LCMs enable a significantly fast inference process for diffusion models. They require far fewer inference steps to produce high-resolution images without compromising the image quality too much. Below is a usage example:
python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32)
To save GPU memory, torch.float16 can be used, but it may compromise image quality.
pipe.to(torch_device="cuda", torch_dtype=torch.float32)
prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 4
images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images
Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/latent_consistency_models) to learn more.
LCM comes with both text-to-image and image-to-image pipelines and they were contributed by luosiallen, nagolinc, and dg845.
PixArt-Alpha
<img width="627" alt="header_collage" src="https://github.com/huggingface/diffusers/assets/22957388/25beb50f-d530-4d19-9430-b1c00ae3b1d9">
PixArt-Alpha is a Transformer-based text-to-image diffusion model that rivals the quality of the existing state-of-the-art ones, such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient.
It was trained T5 text embeddings and has a maximum sequence length of 120. Thus, it allows for more detailed prompt inputs, unlocking better quality generations.
Despite the large text encoder, with model offloading, it takes a little under 11GBs of VRAM to run the `PixArtAlphaPipeline`:
python
from diffusers import PixArtAlphaPipeline
import torch
pipeline_id = "PixArt-alpha/PixArt-XL-2-1024-MS"
pipeline = PixArtAlphaPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
image.save("sahara.png")
Check out the [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pixart) to learn more.
AnimateDiff
![animatediff-doc](https://github.com/huggingface/diffusers/assets/22957388/1a5f416b-f272-444c-b8bc-1b470abf38d4)
AnimateDiff is a modelling framework that allows you to create videos using pre-existing Stable Diffusion text-to-image models. It achieves this by inserting motion module layers into a frozen text-to-image model and training it on video clips to extract a motion prior.
These motion modules are applied after the ResNet and Attention blocks in the Stable Diffusion UNet. Their purpose is to introduce coherent motion across image frames. To support these modules, we introduce the concepts of a `MotionAdapter` and a `UNetMotionModel`. These serve as a convenient way to use these motion modules with existing Stable Diffusion models.
The following example demonstrates how you can utilize the motion modules with an existing Stable Diffusion text-to-image model.
python
import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
scheduler = DDIMScheduler.from_pretrained(
model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()
output = pipe(
prompt=(
"masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
"orange sky, warm lighting, fishing boats, ocean waves seagulls, "
"rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
"golden hour, coastal landscape, seaside scenery"
),
negative_prompt="bad quality, worse quality",
num_frames=16,
guidance_scale=7.5,
num_inference_steps=25,
generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
You can convert an existing 2D UNet into a `UNetMotionModel`:
python
from diffusers import MotionAdapter, UNetMotionModel, UNet2DConditionModel
unet = UNetMotionModel()
Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", subfolder="unet")
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
load motion adapter here
unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)
Or load motion modules after init
unet_motion.load_motion_modules(motion_adapter)
freeze all 2D UNet layers except for the motion modules for finetuning
unet_motion.freeze_unet2d_params()
Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)
AnimateDiff also comes with motion LoRA modules, letting you control subtleties:
python
import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")
scheduler = DDIMScheduler.from_pretrained(
model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()
output = pipe(
prompt=(
"masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
"orange sky, warm lighting, fishing boats, ocean waves seagulls, "
"rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
"golden hour, coastal landscape, seaside scenery"
),
negative_prompt="bad quality, worse quality",
num_frames=16,
guidance_scale=7.5,
num_inference_steps=25,
generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
![animatediff-zoom-out-lora](https://github.com/huggingface/diffusers/assets/22957388/edfa27f3-a8fd-4ce7-809a-f8e3b39f4868)
Check out the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/animatediff) to learn more.
PEFT 🤝 Diffusers
There are many adapters (LoRA, for example) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🤗 **[PEFT](https://huggingface.co/docs/peft/index)** integration in 🤗 Diffusers, it is really easy to load and manage adapters for inference.
Here is an example of combining multiple LoRAs using this new integration:
python
from diffusers import DiffusionPipeline
import torch
pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
Load LoRA 1.
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
Load LoRA 2.
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
Combine the adapters.