SDXL ControlNets š
The š§ØĀ diffusers team has trained two ControlNets on [Stable Diffusion XL (SDXL)](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl):
- Canny ([diffusers/controlnet-canny-sdxl-1.0](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0))
- Depth ([diffusers/controlnet-depth-sdxl-1.0](https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0))
![image_grid_controlnet_sdxl](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/image_grid_controlnet_sdxl.jpg)
You can find all the SDXL ControlNet checkpoints [here](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet), including some [smaller](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-small) [ones](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-mid) (5 to 7x smaller).
To know more about how to use these ControlNets to perform inference, check out the respective model cards and the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet_sdxl). To train custom SDXL ControlNets, you can try out [our training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
MultiControlNet for SDXL
This release also introduces support for combining multiple ControlNets trained on SDXL and performing inference with them. Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet_sdxl#multicontrolnet) to learn more.
GLIGEN
The GLIGEN model was developed by researchers and engineers fromĀ **[University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN)**. TheĀ `StableDiffusionGLIGENPipeline`Ā can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, itāll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. Itās trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
![gligen_gif](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/same_box.gif)
*(GIF from the [official website](https://gligen.github.io/))*
**Grounded inpainting**
python
import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image
Insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
"masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
input_image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
)
prompt = "a birthday cake"