Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approx tests #130

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Approx tests #130

wants to merge 3 commits into from

Conversation

DanaCase
Copy link
Contributor

Hi there,

I love your project! It has been a great tool for me to train LoRAs locally on my mac. I was playing around and trying to run tests locally.

I noticed that the image generation tests failed on a pixel by pixel comparison, but the images looked perceptually identical.

For example:
dev-output
devdiff

lora_output
devloradiff

schnell-out
schnell-diff

Is there a reason to expect outputs to be pixel-identical on different setups? It makes sense to me that there would be some small differences from float arith etc...

For reference I'm using an M4 Max.

Anyway I made a MSE comparison that is still pretty strict but still allows for some small differences in output.
What do you think?

@filipstrand
Copy link
Owner

@DanaCase Hi, glad you enjoy the project!

You raise a valid concern. I do not remember that anyone have reported different tests results on different hardware, but if this is the case then we should fix that, maybe with our suggested approach. Out of curiosity, which MLX version are you using (should be mlx>=0.22.0,<0.24.0)?

For reference, I have tested the latest main on my M1 Pro and M2 Max running 2 versions of MacOS (Sonoma and Sequoia) with identical test results.

One thing I have noted earlier, but not bothered to fix yet, is that since the image generation tests saves a temporary file to disk, it can sometimes fail to be removed properly between tests, causing the test code to read the wrong image (this is because the existing save_image functionality will not override a previously saved image, so if one was there before and not removed we will get unexpected test failures). I really should really fix this soon because it has caused me some headaches before :) . Thus, when running the the image-generation tests, make sure the resources folder in the test catalog does not contain any other image file than the specified reference images (more specifically, there should not be a leftover image named output.png from previous runs.)

Regarding pixel-level specific tests, I can think of some pros/cons with this approach:

Pros:

  • Makes it easer to guarantee the exact same output when refactoring - gives great confidence that nothing is broken or changed, even if only in a minor way.
  • For curiosity: For whatever reason, it can sometimes be interesting/informative to know that something (code, dependency etc) has been updated which causes the output to differ.
  • Naive and easy

Cons:

  • Updates in MLX sometimes results in minor changes to the output image (but typically very small differences, similar to the ones you show here)
  • Every time there is a small change in the output, all the reference images must be updated (the git catalog increase in size with no real benefits).
  • Tests might then require a certain MLX to properly work.
  • Requires defining a reasonable test-criteria for similarity that only lets trough similar-enough looking images (mse might be enough to do this successfully given a good threshold)

The most important things we want to capture with the tests is the perpetual similarity to the reference images, and if you get test failures on a clean install of the project (i.e with mlx>=0.22.0,<0.24.0, and a resources folder without any left-over images) on your M4 Max, then I think we could go with this approach.

@filipstrand
Copy link
Owner

I really should really fix this soon because it has caused me some headaches before :) . Thus, when running the the image-generation tests, make sure the resources folder in the test catalog does not contain any other image file than the specified reference images (more specifically, there should not be a leftover image named output.png from previous runs.)

I have fixed this unintended behaviour for the tests in the latest main now. It might not resolve your problem, but it was something that was overdue anyways

@DanaCase
Copy link
Contributor Author

DanaCase commented Mar 4, 2025

Thanks for the response and the update to main.

I am running MLX 0.23.1, and tried running tests on a clean install (including clearing my hugging-face cache) but am still seeing slight differences in the output images.

It makes sense to aim for pixel-perfect comparisons when possible. A conditional test might be useful, allowing a perfect match for testing changes during local refactoring or other cases where pixel precision is expected, but not required for a pre-commit hook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants