:art: Stable Diffusion 2 is here!
Installation
`pip install diffusers[torch]==0.9 transformers`
Stable Diffusion 2.0 is available in several flavors:
Stable Diffusion 2.0-V at `768x768`
New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as `1.5`, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. SD 2.0-v is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.

python
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")
Stable Diffusion 2.0-base at `512x512`
The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.

python
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
Stable Diffusion 2.0 for Inpanting
This model for text-guided inpanting is finetuned from SD 2.0-base. Follows the mask-generation strategy presented in [LAMA](https://github.com/saic-mdal/lama) which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning.

python
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
image.save("yellow_cat.png")
Stable Diffusion X4 Upscaler
The model was trained on crops of size 512x512 and is a text-guided [latent upscaling diffusion model](https://arxiv.org/abs/2112.10752). In addition to the textual input, it receives a noise_level as an input parameter, which can be used to add noise to the low-resolution input according to a [predefined diffusion schedule](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/low_res_scheduler/scheduler_config.json).

python
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")
Saving & Loading is fixed for Versatile Diffusion
Previously there was a :bug: when saving & loading versatile diffusion - this is fixed now so that memory efficient saving & loading works as expected.
* [Versatile Diffusion] Fix remaining tests by patrickvonplaten in 1418
:memo: Changelog
* add v prediction by patil-suraj in 1386
* Adapt UNet2D for supre-resolution by patil-suraj in 1385
* Version 0.9.0.dev0 by anton-l in 1394
* Make height and width optional by patrickvonplaten in 1401
* [Config] Add optional arguments by patrickvonplaten in 1395
* Upscaling fixed by patrickvonplaten in 1402
* Add the new SD2 attention params to the VD text unet by anton-l in 1400
* Deprecate sample size by patrickvonplaten in 1406
* Support SD2 attention slicing by anton-l in 1397
* Add SD2 inpainting integration tests by anton-l in 1412
* Fix sample size conversion script by patrickvonplaten in 1408
* fix clip guided by patrickvonplaten in 1414
* Fix all stable diffusion by patrickvonplaten in 1415
* [MPS] call contiguous after permute by kashif in 1411
* Deprecate `predict_epsilon` by pcuenca in 1393
* Fix ONNX conversion and inference by anton-l in 1416
* Allow to set config params directly in init by patrickvonplaten in 1419
* Add tests for Stable Diffusion 2 V-prediction 768x768 by anton-l in 1420
* StableDiffusionUpscalePipeline by patil-suraj in 1396
* added initial v-pred support to DPM-solver by kashif in 1421
* SD2 docs by patrickvonplaten in 1424