Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SD 1.x to SDXL refiner #4

Open
holwech opened this issue Oct 27, 2023 · 8 comments
Open

SD 1.x to SDXL refiner #4

holwech opened this issue Oct 27, 2023 · 8 comments

Comments

@holwech
Copy link

holwech commented Oct 27, 2023

Hey! Very cool that you've made this! I tried to combine you converter with SD 1.x and the SDXL refiner but so far I haven't had much luck. Is this something you've managed to do successfully?

Here is the code I've used to combine SD 1.x and the SDXL refiner:

https://colab.research.google.com/drive/1lUHih8KsSGuKFTfYBz0I-6FkMEU5GkdP?usp=sharing

Here is an example of what I get out from the refiner atm:

image

@city96
Copy link
Owner

city96 commented Oct 28, 2023

Hi! I did a quick test and using the refiner with 1.5 works with the ComfyUI node, meaning the issue is somewhere else.

Could it be a scaling issue? Just looking at the code it looks like you're passing the scaled latents directly to the refiner. Try modify the code like this:

scaled_latents = 1 / 0.18215 * latents
sdxl_scaled_latents = convert(scaled_latents.to(dtype=torch.float32), "v1", "xl", torch.float32, torch_device)
sdxl_latents = 0.18215 * sdxl_scaled_latents

@holwech
Copy link
Author

holwech commented Nov 4, 2023

I made a simplified notebook to limit the number of potential issues. Still having the same problem as in the previous notebook unfortunately.

I'm not very familiar with ComfyUI and I don't know how exactly you connect the two models, so it's hard for me to pinpoint what the issue is. Could you share some details on how you connected SD1.5 and the refiner and/or can you have a look at the simplified notebook to see if there are any obvious issues?

Could it be that the interposer was trained on a specific vae and the default vae for SD1.5 are not compatible?

@city96
Copy link
Owner

city96 commented Nov 4, 2023

Your notebook asks me to log in, which I assume means it's set to private. Could you check the visibility settings?

ComfyUI is just a node-based frontend to the LDM code, so internally it uses the same models/etc as diffusers, so that shouldn't matter in this case.

Here is a quick and dirty example of the refiner being connected to the output of a 1.5 model. (Officially, this isn't quite correct, since you're supposed to return the noisy latent at around 80% denoise, then pass it to the refiner for the final 20%, but it works as an example here.)

EXAMPLE

I don't think it's a VAE incompatibility issue either, the encoder part is the same for all v1.5 VAE as far as I know.

I can try to write some example code for how to use this with diffusers if you want. I still suspect it's a scaling issue.

@city96
Copy link
Owner

city96 commented Nov 4, 2023

Not great but it works. Oddly enough the v1 pipe doesn't have a denoising_end option but you can just use the custom sampler like you were doing in your original notebook to do a partial denoise.

Code below
import torch
from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline

# Load pipelines
pipe = StableDiffusionPipeline.from_single_file(
    r"D:\Software\AI\sd-models\checkpoints\mix\Silicon29_dark.safetensors",
    load_safety_checker=False, # takes forever to download
    torch_dtype=torch.float16,
)
pipe.enable_xformers_memory_efficient_attention()
refiner = StableDiffusionXLPipeline.from_single_file(
    r"D:\Software\AI\sd-models\checkpoints\sd\sdxl_v1.0_refiner.safetensors",
    torch_dtype=torch.float16,
)
refiner.enable_xformers_memory_efficient_attention()

# Generate image on SDv1
pipe.to("cuda")
scaled_latent = pipe(
    prompt,
    height = 1024,
    width  = 1024,
    output_type = "latent",
    # denoising_end = 0.90, # doesn't work on v1
    num_inference_steps=20,
).images[0]
del pipe # free VRAM

# Convert latent
latent = scaled_latent * (1/0.18215)
xl_latent = convert_latent(latent, "v1", "xl") # code for the interposer, from your notebook
xl_scaled_latent = xl_latent * 0.18215

# Finish with refiner
refiner.to("cuda")
image = refiner(
    prompt = prompt,
    image  = xl_scaled_latent,
    denoising_start = 0.90,
    num_inference_steps=20,
).images[0]
del refiner # free VRAM
image.show()

@holwech
Copy link
Author

holwech commented Nov 5, 2023

Awesome! Thanks for a thorough answer. It definitely seems like the issue was the scaling. With you code I got some more acceptable output.

I made the notebook public, so it should be possible to view it now.

In the notebook I made a simple test and I'm curious to get your opinion on whether this is the expected quality or not.

import requests
import torch
from PIL import Image
from io import BytesIO
import torchvision.transforms as transforms
from diffusers.image_processor import VaeImageProcessor
import gc
from diffusers import AutoencoderKL



generator = torch.manual_seed(0)
response = requests.get("https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg")
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

# Processing
sd_vae = AutoencoderKL().from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae", variant="fp16", torch_dtype=torch.float16).to("cuda")
vaeImageProcessor = VaeImageProcessor(2 ** (len(sd_vae.config.block_out_channels) - 1))
init_pre_image = vaeImageProcessor.preprocess(init_image).to(dtype=torch.float16, device="cuda")

# Encode
sd_latents = sd_vae.encode(init_pre_image).latent_dist.sample(generator)
#sd_latents = sd_latents * (1/0.18215)

# Convert
sdxl_latents = convert(sd_latents, "v1", "xl", torch.float16, "cuda").to(dtype=torch.float32)
sdxl_latents = sdxl_latents * 0.18215

# Decode
sdxl_vae = AutoencoderKL().from_pretrained("stabilityai/sdxl-vae").to("cuda")
image_tensor = sdxl_vae.decode(sdxl_latents / sdxl_vae.config.scaling_factor, return_dict=False)[0]

# Post-processing
image = vaeImageProcessor.postprocess(image=image_tensor.detach())[0]
image

Input image:

image

Output image:

image

As you can see, it has some artifacts. I could've done something wrong there though, as I inferred a lot of the steps from the diffusers library and it has a lot of stuff going on.

@holwech
Copy link
Author

holwech commented Nov 5, 2023

Here is an example from the notebook with 80% steps on SD1.x and 20% on the refiner. Not getting great results unfortunately :(

SD1.5 output:
image

Refiner output:
image

@city96
Copy link
Owner

city96 commented Nov 5, 2023

That quality looks similar to what I get, maybe a bit worse but that could be from you running it in FP16. It's a tiny model so I'd recommend keeping the cast you had in the first nodebook and using it with FP32, though not sure how much that changes. It could also be a clamping difference on the output, hardware differences, etc, etc...

(Also noticed you were using the default XL VAE. I usually use this one since it lets me use FP16, though there's no noticeable difference in terms of visual quality.)

Doing v1=>xl is a lot harder than xl=>v1 because the XL latent contains more information than the V1 latent, so I could never get it 100% perfect since it has to "make up" fake details to fit the format I'm pretty sure.

For the generation example, I think the image degradation you're seeing might be from the fact that you're passing a fully denoised latent into the refiner. As I noted above, there's no denoising_end option for v1 to get the noisy latent, so you'd have to use what you did in your first notebook - add a custom sampler except stop it a few steps before the final one.
You could also try euler a for the refiner, which adds noise at every step so it might alleviate it a bit.

Again, I'm just guessing. You could also do a 3 stage thing where your initial image is v1 512x512, then upscale it and send it to v1 1024x1024 before sending it to the refiner. v1 doesn't like generating at resolutions that high natively. (xl=>v1 is simpler since v1 can handle the 1024x1024 image from xl nicely as there it's basically img2img at a low denoise, meaning no weird hires repetition problems appear.)

@TomLucidor
Copy link

@city96 thank you for the great work and I hope that there will be a new version with less artifacts, and that latent space expansion is a tough problem indeed. Here is a small question: can a LoRA or embedding be transferred the same way?

@holwech could you interpolate between 70-30, 80-20, and 90-10 to see if the issue is "too much" or "not enough"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants