Improved the following parts based on the author phymhan's feedback (https://github.com/mkshing/svdiff-pytorch/issues/3)!
- Train spectral shifts for 1-D weights such as LayerNorm too. (file size: 935kB (before: 923kB))
- Using different learning rate for 1-D weights via `--learning_rate_1d`
- Additionally, train spectral shifts of text encoder by `--train_text_encoder` (file size: 1.17MB)
By this change, you get better results with less training steps than the first release v0.1.1!!
**sample example**
<details>
bash
accelerate launch svdiff-pytorch-2/train_svdiff.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"\
--instance_data_dir=$INSTANCE_DATA_DIR \
--class_data_dir=$CLASS_DATA_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="photo of sks woman" \
--class_prompt="photo of a woman" \
--with_prior_preservation --prior_loss_weight=1.0 \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-3 \
--learning_rate_1d=1e-6 \
--train_text_encoder \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--checkpointing_steps=200 \
--max_train_steps=1000 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention \
--seed=42 \
--gradient_checkpointing
</details>
"portrait of sks woman wearing kimono" where `sks` indicates Gal Gadot.

Added Single Image Editing
**sample script**
training
<details>
accelerate launch train_svdiff.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--instance_data_dir="pink-chair-dir" \
--output_dir="output-dir" \
--instance_prompt="photo of a pink chair with black legs" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-3 \
--learning_rate_1d=1e-6 \
--train_text_encoder \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=500 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention \
--seed=42 \
--gradient_checkpointing
</details>
inference
<details>
python
import sys
import torch
from PIL import Image
from diffusers import DDIMScheduler
sys.path.append("/content/svdiff-pytorch-2")
from svdiff_pytorch import load_unet_for_svdiff, load_text_encoder_for_svdiff, StableDiffusionPipelineWithDDIMInversion
pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5"
spectral_shifts_ckpt_dir = "/content/SIE/checkpoint-500"
image = "pink-chair.jpeg"
source_prompt = "photo of a pink chair with black legs"
target_prompt = "photo of a blue chair with black legs"
unet = load_unet_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="unet")
text_encoder = load_text_encoder_for_svdiff(pretrained_model_name_or_path, spectral_shifts_ckpt=spectral_shifts_ckpt_dir, subfolder="text_encoder")
load pipe
pipe = StableDiffusionPipelineWithDDIMInversion.from_pretrained(
pretrained_model_name_or_path,
unet=unet,
text_encoder=text_encoder,
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
in this example, i didn't use ddim inversion
inv_latents = None
(optional) ddim inversion
image = Image.open(image).convert("RGB").resize((512, 512))
in SVDiff, they use guidance scale=1 in ddim inversion
inv_latents = pipe.invert(source_prompt, image=image, guidance_scale=1.0).latents
image = pipe(target_prompt, latents=inv_latents).images[0]
</details>
"photo of a ~~pink~~ blue chair with black legs"
\* the input image was taken from https://unsplash.com/photos/1JJJIHh7-Mk