-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make image inversion more precise? #20
Comments
You can try the method in gen_latents = inversestablediffusion(input_image, "a fantasy landscape with a maple forest", refine_iterations=10, guidance_scale=5.0)
stablediffusion("a fantasy landscape with a maple forest", guidance_scale=5.0, init_latents=gen_latents)
stablediffusion("a fantasy landscape with a maple forest", prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest", guidance_scale=5.0, init_latents=gen_latents) |
Examples where inversion doesn't work well: https://www.pexels.com/photo/tray-of-pumpkins-on-a-knitted-sweater-5429788/ gen_latents = inversestablediffusion(input_image, "pumpkins in a tray seen from above", refine_iterations=10, refine_skip=0.4, guidance_scale=3.0)
stablediffusion("pumpkins in a tray seen from above", guidance_scale=3.0, init_latents=gen_latents) https://www.pexels.com/photo/trees-at-the-park-under-clear-sky-3227735/ gen_latents = inversestablediffusion(input_image, "a temperate forest in autumn", refine_iterations=10, refine_skip=0.5, guidance_scale=3.0)
stablediffusion("a temperate forest in autumn", guidance_scale=3.0, init_latents=gen_latents) |
That's good input @bloc97, thank you. I spent most of today experimenting with the inverse method on real images and getting unpredictable results, as you commented. Although, with a couple of great examples. Have you thought about whether fine tuning the model or using an embedding (like textual inversion or dreambooth) could help with this applied to a particular domain? For example if one were to take a video through a particular forest or scene and fine tune with examples of that, whether it would then be possible to do precise prompt-to-prompt editing of real images that are close to that distribution? |
That might actually work! A better reconstruction usually allows for better editing... |
@hmartiro There's Google's Imagic paper that just got released. https://arxiv.org/abs/2210.09276 From the paper, it seems that inverting the prompt embeddings too (not just the latent) yields even better results. In that paper they also fine tune the model on the inverted embeddings so that it reconstructs the input image better. |
Very spicy @bloc97! I'll take a look at that paper. I did end up fine tuning using the DreamBooth approach on a novel object type, and I haven't tried this cross-attention method yet on that model but the similar "img2img alternate" approach in the automatic1111 repo did greatly improve prompt editing once the model understood the class. It's definitely still immature though and results are sproadic. |
Also see: https://text2live.github.io/ So far it appears very slow to run since it requires training extensively on a single image, but very exciting in the results. I really like the idea of the split edit layer that gets composited on top of the original. Do you think such an approach would have value with your method? |
Thanks for the amazing work. But I am a bit confused about this inversion ... I wonder if there is corresponding paper/article that elaborate this process. |
Fantastic work on this project @bloc97!
I'm able to get super impressive results with prompt editing. However, when doing img2img I find that the results degrade greatly. For example, here I'm editing the prompt to change to a charcoal drawing, which works well. However, if I pass in the initial image generated from the original prompt, there's no values of parameters I can find to get anywhere close to the quality of the prompt edit without initial image. I'm observing similar issues to stock SD where either the macro structure of the initial image is lost or the prompt edit has little to no effect.
The reason I want this is to edit real images and to build edits on top of each other. I realize this may be unsolved, and depend on how well the network understands the scene content, but I'm very interested in your thoughts and suggestions here as I think it would be incredibly powerful.
The text was updated successfully, but these errors were encountered: