-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a family of AND_ALIGN_D_S keywords #63
base: main
Are you sure you want to change the base?
Conversation
These methods compare average cosine similarity in 2x2 sub-windows of detail windows of size DxD and structure windows of size SxS. For each latent pixel, the difference of these values is computed, forming an alignment map which is positive when new details can be blended without disrupting the overall structure of the composition.
This looks really cool! I'll run some generations later today to try to get a feel for it. From a usability perspective, we might want to add 2 sliders in the prompt formatter (1 for D, 1 for S) that become visible when selecting "Alignment blend". Using Instead of generating all possible keywords, we could use a regex in the parser and make the conciliation strategy an actual object instead of an enum (so that it can contain extra data like D and S). In rust enums can have associated values, but in python maybe it isn't super practical to keep it an enum after all. Can't wait to try this! |
Indeed. For me, a big unresolved question is how to make this a tool that people can use easily and intuitively without a deep understanding of diffusion models, latent space, linear algebra, differential equations. While ideally still allowing people to tweak parameters if they want to, in order to dial in a specific vision of what they are really trying to express. And since it's your project, I thought it's a better approach to present the general method first and then start a conversation about what is the best UX. One idea that occurs to me is that we could ship a few presets like
Cool, I'm glad. :) Some of the more interesting tests that I did beyond just varying D and S, that helped develop some intuition for how it works: I. When the two prompts are very similar and have only subtle differences Some tests I'd like to explore further: IV. Is there any use for D > S ? It's not exactly the same interchanging the two prompts (due to the clamping of negative alignment weight, I think) |
To facilitate comparison of the techniques without repeated checking out a different commit and restarting, I pushed an additional commit that adds This commit is purely for experimentation. |
Apologies for the delay here. I generated 2 large grids locally to try to understand the method a bit better but didn't take more time yet.
Sure. We should look for the one that gives the best results and only keep this one in my opinion. Let me know if this is what you intended to do. Is it okay with you if I update the UX code and refactor the parser directly in your PR? |
Yes, we're on the same page there. After some further testing today, I think I like
Yeah, it's totally fine with me. Reading the code in this extension is what inspired the idea in the first place and I'm happy to work together on it. |
Hey @ljleb , I'm looking for feedback on a novel method that I developed for blended guidance.
The idea is to compare guidance conds in two different-sized windows around each latent pixel: a detail window of size DxD and a structure window of size SxS. For each latent pixel, we compute a pair of alignment maps between the two tensors, called the detail_alignment and structure_alignment respectively.
The detail_alignment map considers all 2x2 sub-regions in the DxD regions and computes the cosine similarity between the parent and child tensors in each 2x2 sub-region, then averaging these values over all such sub-regions in the DxD regions. The structure_alignment map is computed similarly, using regions of size SxS instead for the structure.
Because cosine similarity of two random vectors in R^n tends to 0 as n grows sufficiently high (intuitively, random vectors tend to be increasingly orthogonal as the dimension of the space increases), averaging cosine similarity over 2x2 sub-regions instead of directly computing cosine similarity in regions of size DxD and SxS, is used as a normalization method to ensure the values computed at different resolutions DxD and SxS are comparable.
The detail alignment will tend to be higher when the details are similar between the two latents, and have a negative value instead when the details in the child latent are contrasting to the details in the parent latent, indicating that the second prompt contains novel details that can be blended. The structure alignment will instead be positive when the structure is similar, and negative where the child latent would significantly diverge from the compositional structure of the parent at resolution SxS.
An alignment weight is computed by starting with the structure alignment and subtracting the detail alignment, giving a single alignment map which is positive when the child latent guidance can enhance the details of the parent latent guidance, without disrupting its structure. The negative values are clamped, and each latent pixel is blended according to its resulting alignment weight.
I currently have this implemented for a range of values for D and S, from 2 to 33 latent pixels each, for experimentation. Decreasing the value of D will typically make it easier for the child prompt to influence the details of the resulting image, while increasing the value of S will work to relax the preservation of higher level compositional structure. For most prompts,
structure prompt AND_ALIGN_3_7 detail prompt
feels like a good starting point, but I recommend trying different combinations with a range of different prompts to get a feel for how they behave.I've tested this method rather extensively and found it to be very useful, and I'm excited to share it. Please let me know if you have any questions, comments, suggestions, feedback, etc.