Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make image inversion more precise? #20

Open
hmartiro opened this issue Oct 15, 2022 · 8 comments
Open

How to make image inversion more precise? #20

hmartiro opened this issue Oct 15, 2022 · 8 comments

Comments

@hmartiro
Copy link

hmartiro commented Oct 15, 2022

Fantastic work on this project @bloc97!

I'm able to get super impressive results with prompt editing. However, when doing img2img I find that the results degrade greatly. For example, here I'm editing the prompt to change to a charcoal drawing, which works well. However, if I pass in the initial image generated from the original prompt, there's no values of parameters I can find to get anywhere close to the quality of the prompt edit without initial image. I'm observing similar issues to stock SD where either the macro structure of the initial image is lost or the prompt edit has little to no effect.

The reason I want this is to edit real images and to build edits on top of each other. I realize this may be unsolved, and depend on how well the network understands the scene content, but I'm very interested in your thoughts and suggestions here as I think it would be incredibly powerful.

img_original = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
)

img_prompt_edit = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
)

img_init_image = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
    init_image=img_original,
    init_image_strength=0.6,
)

image
image
image

@bloc97
Copy link
Owner

bloc97 commented Oct 15, 2022

You can try the method in InverseCrossAttention_Release.ipynb. However, this works very well only with images that were generated with stable diffusion. For real images, inversion with high CFG is an unsolved problem currently. Sometimes you get good results, sometimes you don't (images with uniform content are usually easy to reconstruct, eg. object/faces with a white background, otherwise the reconstruction can focus on the background, distorting the intended object). Also, images that will never get generated by the model given the input prompt fares quite poorly.
Here's what I get with the following code:

gen_latents = inversestablediffusion(input_image, "a fantasy landscape with a maple forest", refine_iterations=10, guidance_scale=5.0)
stablediffusion("a fantasy landscape with a maple forest", guidance_scale=5.0, init_latents=gen_latents)
stablediffusion("a fantasy landscape with a maple forest", prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest", guidance_scale=5.0, init_latents=gen_latents)

Reconstruction:
image
Edit:
image

@bloc97
Copy link
Owner

bloc97 commented Oct 15, 2022

Examples where inversion doesn't work well:
Left is original image, right is reconstruction with prompt.

https://www.pexels.com/photo/tray-of-pumpkins-on-a-knitted-sweater-5429788/

gen_latents = inversestablediffusion(input_image, "pumpkins in a tray seen from above", refine_iterations=10, refine_skip=0.4, guidance_scale=3.0)
stablediffusion("pumpkins in a tray seen from above", guidance_scale=3.0, init_latents=gen_latents)

image

https://www.pexels.com/photo/trees-at-the-park-under-clear-sky-3227735/

gen_latents = inversestablediffusion(input_image, "a temperate forest in autumn", refine_iterations=10, refine_skip=0.5, guidance_scale=3.0)
stablediffusion("a temperate forest in autumn", guidance_scale=3.0, init_latents=gen_latents)

image

@hmartiro
Copy link
Author

That's good input @bloc97, thank you. I spent most of today experimenting with the inverse method on real images and getting unpredictable results, as you commented. Although, with a couple of great examples.

Have you thought about whether fine tuning the model or using an embedding (like textual inversion or dreambooth) could help with this applied to a particular domain? For example if one were to take a video through a particular forest or scene and fine tune with examples of that, whether it would then be possible to do precise prompt-to-prompt editing of real images that are close to that distribution?

@bloc97
Copy link
Owner

bloc97 commented Oct 16, 2022

Have you thought about whether fine tuning the model or using an embedding (like textual inversion or dreambooth)

That might actually work! A better reconstruction usually allows for better editing...

@bloc97
Copy link
Owner

bloc97 commented Oct 18, 2022

@hmartiro There's Google's Imagic paper that just got released. https://arxiv.org/abs/2210.09276

From the paper, it seems that inverting the prompt embeddings too (not just the latent) yields even better results. In that paper they also fine tune the model on the inverted embeddings so that it reconstructs the input image better.

@hmartiro
Copy link
Author

Very spicy @bloc97! I'll take a look at that paper.

I did end up fine tuning using the DreamBooth approach on a novel object type, and I haven't tried this cross-attention method yet on that model but the similar "img2img alternate" approach in the automatic1111 repo did greatly improve prompt editing once the model understood the class. It's definitely still immature though and results are sproadic.

@hmartiro
Copy link
Author

Also see: https://text2live.github.io/

So far it appears very slow to run since it requires training extensively on a single image, but very exciting in the results. I really like the idea of the split edit layer that gets composited on top of the original. Do you think such an approach would have value with your method?

@KevinGoodman
Copy link

Thanks for the amazing work. But I am a bit confused about this inversion ... I wonder if there is corresponding paper/article that elaborate this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants