Approx tests #130

DanaCase · 2025-02-28T04:06:12Z

Hi there,

I love your project! It has been a great tool for me to train LoRAs locally on my mac. I was playing around and trying to run tests locally.

I noticed that the image generation tests failed on a pixel by pixel comparison, but the images looked perceptually identical.

For example:

Is there a reason to expect outputs to be pixel-identical on different setups? It makes sense to me that there would be some small differences from float arith etc...

For reference I'm using an M4 Max.

Anyway I made a MSE comparison that is still pretty strict but still allows for some small differences in output.
What do you think?

filipstrand · 2025-03-01T20:25:15Z

@DanaCase Hi, glad you enjoy the project!

You raise a valid concern. I do not remember that anyone have reported different tests results on different hardware, but if this is the case then we should fix that, maybe with our suggested approach. Out of curiosity, which MLX version are you using (should be mlx>=0.22.0,<0.24.0)?

For reference, I have tested the latest main on my M1 Pro and M2 Max running 2 versions of MacOS (Sonoma and Sequoia) with identical test results.

One thing I have noted earlier, but not bothered to fix yet, is that since the image generation tests saves a temporary file to disk, it can sometimes fail to be removed properly between tests, causing the test code to read the wrong image (this is because the existing save_image functionality will not override a previously saved image, so if one was there before and not removed we will get unexpected test failures). I really should really fix this soon because it has caused me some headaches before :) . Thus, when running the the image-generation tests, make sure the resources folder in the test catalog does not contain any other image file than the specified reference images (more specifically, there should not be a leftover image named output.png from previous runs.)

Regarding pixel-level specific tests, I can think of some pros/cons with this approach:

Pros:

Makes it easer to guarantee the exact same output when refactoring - gives great confidence that nothing is broken or changed, even if only in a minor way.
For curiosity: For whatever reason, it can sometimes be interesting/informative to know that something (code, dependency etc) has been updated which causes the output to differ.
Naive and easy

Cons:

Updates in MLX sometimes results in minor changes to the output image (but typically very small differences, similar to the ones you show here)
Every time there is a small change in the output, all the reference images must be updated (the git catalog increase in size with no real benefits).
Tests might then require a certain MLX to properly work.
Requires defining a reasonable test-criteria for similarity that only lets trough similar-enough looking images (mse might be enough to do this successfully given a good threshold)

The most important things we want to capture with the tests is the perpetual similarity to the reference images, and if you get test failures on a clean install of the project (i.e with mlx>=0.22.0,<0.24.0, and a resources folder without any left-over images) on your M4 Max, then I think we could go with this approach.

filipstrand · 2025-03-01T20:47:55Z

I really should really fix this soon because it has caused me some headaches before :) . Thus, when running the the image-generation tests, make sure the resources folder in the test catalog does not contain any other image file than the specified reference images (more specifically, there should not be a leftover image named output.png from previous runs.)

I have fixed this unintended behaviour for the tests in the latest main now. It might not resolve your problem, but it was something that was overdue anyways

DanaCase · 2025-03-04T02:52:42Z

Thanks for the response and the update to main.

I am running MLX 0.23.1, and tried running tests on a clean install (including clearing my hugging-face cache) but am still seeing slight differences in the output images.

It makes sense to aim for pixel-perfect comparisons when possible. A conditional test might be useful, allowing a perfect match for testing changes during local refactoring or other cases where pixel precision is expected, but not required for a pre-commit hook.

DanaCase added 2 commits February 27, 2025 19:52

mse for test image comparision

8c83933

remove saving output

2673f7b

filipstrand mentioned this pull request Mar 1, 2025

Add option to override saved images #132

Merged

Merge branch 'main' into approx_tests

40c13b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approx tests #130

Approx tests #130

DanaCase commented Feb 28, 2025

filipstrand commented Mar 1, 2025

filipstrand commented Mar 1, 2025

DanaCase commented Mar 4, 2025

Approx tests #130

Are you sure you want to change the base?

Approx tests #130

Conversation

DanaCase commented Feb 28, 2025

filipstrand commented Mar 1, 2025

filipstrand commented Mar 1, 2025

DanaCase commented Mar 4, 2025