-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FromLinear
and IntoLinear
lookup table creation to build script
#416
base: master
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #416 will degrade performances by 19.73%Comparing Summary
Benchmarks breakdown
|
I'm not sure I understand what's causing the tests to fail at this point. |
Interesting. The longer build times are worrying, but there may be optimization opportunities. I'm still having that cold I mentioned, so I'll have a proper look when I'm feeling better. A few quick things for now:
|
Alright, thanks for the direction. I'll fix the issue with |
I changed around the features as you mentioned and I think I was also able to fix the issue with the build script running more often than it needs to. It should now, hopefully, only rerun if any of the files in the build folder are changed or if any features are changed. I also made it so that the code for Currently, the conditional compilation for the 16-bit lookup table methods/structs are controlled by repeated uses of Additionally, even though the build script runs less often now, it is still quite slow and seems to drastically increase the time the PR tests take, so, as mentioned before, do you have a suggestion on how I can reconcile including the float to uint lookup table generation code somewhere in the codebase while not fully rerunning it each time? |
I may be starting to realize a way to optimize the build script, but it's going to take some further research. I'll hopefully be pushing a new commit within the week that will at least somewhat improve the build time. |
I went ahead and optimized the table building to use integration instead of summation since these are all exponential or linear functions. I also removed the tests for errors and just put in the values for the number of mantissa bits used in the index since they do not seem to change (3 for |
Would it also be a good idea to generate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'm feeling much better now and the workweek is over. Thanks for your patience in the meantime. I have also had some time to think about this and the slower build times from the code generation makes me think we should move away from build.rs
for this. Considering none of this really depends on anything other than cargo features, I think it would be possible to generate the code once and making sure the larger tables are opt-in.
I'll try some things with a separate codegen crate...
It's also going to take me a moment to digest what everything does here. Some of the generated code becomes a bit hard to follow. Bear with me.
Would it also be a good idea to generate the u8 to f32 tables separately? I'm not sure how much of a performance cost it is to convert from f64 to f32
You can try and see if it affects the benchmarks. Something seems to have made it a bit slower.
palette/build/lut.rs
Outdated
\n\tfn from_linear(linear: f32) -> u8 {{\ | ||
\n\t\tconst MAX_FLOAT_BITS: u32 = 0x3f7fffff; // 1.0 - f32::EPSILON\ | ||
\n\t\tconst MIN_FLOAT_BITS: u32 = {min_float_string}; // 2^(-{exp_table_size})\ | ||
\n\t\tlet max_float = f32::from_bits(MAX_FLOAT_BITS);\ | ||
\n\t\tlet min_float = f32::from_bits(MIN_FLOAT_BITS);\ | ||
\n\n\t\tlet mut input = linear;\ | ||
\n\t\tif input.partial_cmp(&min_float) != Some(core::cmp::Ordering::Greater) {{\ | ||
\n\t\t\tinput = min_float;\ | ||
\n\t\t}} else if input > max_float {{\ | ||
\n\t\t\tinput = max_float;\ | ||
\n\t\t}} | ||
\n\t\tlet input_bits = input.to_bits();\ | ||
\n\t\t#[cfg(test)]\ | ||
\n\t\t{{\ | ||
\n\t\t\tdebug_assert!((MIN_FLOAT_BITS..=MAX_FLOAT_BITS).contains(&input_bits));\ | ||
\n\t\t}}\ | ||
\n\n\t\tlet entry = {{\ | ||
\n\t\t\tlet i = ((input_bits - MIN_FLOAT_BITS) >> {entry_shift}) as usize;\ | ||
\n\t\t\t#[cfg(test)]\ | ||
\n\t\t\t{{\ | ||
\n\t\t\t\tdebug_assert!({table_name}.get(i).is_some());\ | ||
\n\t\t\t}}\ | ||
\n\t\t\tunsafe {{ *{table_name}.get_unchecked(i) }}\ | ||
\n\t\t}};\ | ||
\n\t\tlet bias = (entry >> 16) << 9;\ | ||
\n\t\tlet scale = entry & 0xffff;\ | ||
\n\n\t\tlet t = (input_bits >> {man_shift}) & 0xff;\ | ||
\n\t\tlet res = (bias + scale * t) >> 16;\ | ||
\n\t\t#[cfg(test)]\ | ||
\n\t\t{{\ | ||
\n\t\t\tdebug_assert!(res < 256, \"{{}}\", res);\ | ||
\n\t\t}}\ | ||
\n\t\tres as u8\ | ||
\n\t}}\ | ||
\n}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I think we don't need to generate every time. It looks to me like it could be a library function that takes a table reference and input value.
palette/build/lut.rs
Outdated
\n\t\tconst MAX_FLOAT_BITS: u32 = 0x3f7fffff; // 1.0 - f32::EPSILON\ | ||
\n\t\tconst MIN_FLOAT_BITS: u32 = {min_float_string}; // 2^(-{exp_table_size})\ | ||
\n\t\tlet max_float = f32::from_bits(MAX_FLOAT_BITS);\ | ||
\n\t\tlet min_float = f32::from_bits(MIN_FLOAT_BITS);\ | ||
\n\n\t\tlet mut input = linear;" | ||
) | ||
.unwrap(); | ||
writeln!( | ||
writer, | ||
"\ | ||
\t\tif input.partial_cmp(&{0}) != Some(core::cmp::Ordering::Greater) {{\ | ||
\n\t\t\tinput = {0};", | ||
if linear_scale.is_some() { | ||
"0.0" | ||
} else { | ||
"min_float" | ||
} | ||
) | ||
.unwrap(); | ||
writeln!( | ||
writer, | ||
"\ | ||
\t\t}} else if input > max_float {{\ | ||
\n\t\t\tinput = max_float;\ | ||
\n\t\t}}" | ||
) | ||
.unwrap(); | ||
if let Some(scale) = linear_scale { | ||
let adj_scale = scale * 65535.0; | ||
let magic_value = f32::from_bits((127 + 23) << 23); | ||
writeln!( | ||
writer, | ||
"\ | ||
\t\tif input < min_float {{\ | ||
\n\t\t\treturn (({adj_scale}f32 * input + {magic_value}f32).to_bits() & 65535) as u16;\ | ||
\n\t\t}}" | ||
).unwrap(); | ||
} | ||
writeln!( | ||
writer, | ||
"\ | ||
\n\t\tlet input_bits = input.to_bits();\ | ||
\n\t\t#[cfg(test)]\ | ||
\n\t\t{{\ | ||
\n\t\t\tdebug_assert!((MIN_FLOAT_BITS..=MAX_FLOAT_BITS).contains(&input_bits));\ | ||
\n\t\t}}\ | ||
\n\n\t\tlet entry = {{\ | ||
\n\t\t\tlet i = ((input_bits - MIN_FLOAT_BITS) >> {entry_shift}) as usize;\ | ||
\n\t\t\t#[cfg(test)]\ | ||
\n\t\t\t{{\ | ||
\n\t\t\t\tdebug_assert!({table_name}.get(i).is_some());\ | ||
\n\t\t\t}}\ | ||
\n\t\t\tunsafe {{ *{table_name}.get_unchecked(i) }}\ | ||
\n\t\t}};\ | ||
\n\t\tlet bias = (entry >> 32) << 17;\ | ||
\n\t\tlet scale = entry & 0xffff_ffff;" | ||
).unwrap(); | ||
if man_shift == 0 { | ||
writeln!(writer, "\n\t\tlet t = input_bits as u64 & 0xffff;").unwrap(); | ||
} else { | ||
writeln!( | ||
writer, | ||
"\n\t\tlet t = (input_bits as u64 >> {man_shift}) & 0xffff;" | ||
) | ||
.unwrap(); | ||
} | ||
writeln!( | ||
writer, | ||
"\ | ||
\t\tlet res = (bias + scale * t) >> 32;\ | ||
\n\t\t#[cfg(test)]\ | ||
\n\t\t{{\ | ||
\n\t\t\tdebug_assert!(res < 65536, \"{{}}\", res);\ | ||
\n\t\t}}\ | ||
\n\t\tres as u16\ | ||
\n\t}}\ | ||
\n}}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be unified as one function that we don't generate each time?
I'm porting over the named colors in #417. I have also modernized it and made use of |
I'll start working on consolidating those into singular library functions and moving things to the codegen crate. I'll also try to document the code a bit more so it's clearer what's going on |
Thank you and sorry for the extra work. I do think this direction will be better. Now when the cost of generating the code is paid for in advance, we could change the cargo feature setup to something simpler (sorry again). I would suggest something like this:
Simple as that. This avoids inflating the test set too much and relies on the compiler's ability to disregard constants it doesn't use. The feature is more there so it's possible to keep the binary size slim. A As for documentation, anything that explains the reasoning (especially where it diverges from the original) will help future us or other people who haven't been part of the process. Like with other things, I don't want this to turn into a black box later. I would also like to have references to the original implementation(s). That's key for tracing them back to the original reasoning and having something to compare to if there refactoring is required or if there are any issues. I know it's perhaps not the most fun thing to do, but it's even less fun to not have them later. So thanks in advance for taking some time to explain it. 🙏 A side note with the new codegen crate; let me know if you run into any rough edges. One thing I have noticed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you! This is already easier to follow. The split with generated tables and handwritten functions seems to work well too. I will go into more details later.
const MAX_FLOAT_BITS: u32 = 0x3f7fffff; // 1.0 - f32::EPSILON | ||
|
||
// SAFETY: Only use this macro if `input` is clamped between `min_float` and `max_float`. | ||
macro_rules! linear_float_to_encoded_uint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's prefix this with unsafe
to make it more obvious at the call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to define a macro as unsafe. Just adding unsafe
before macro_rules!
gets removed when I save the file (likely due to rustfmt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, I mean make the name say unsafe_...
. It makes it look more dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we remove the unsafe blocks from inside it. That may be better, so we are forced to mark the call site. 🤔
/// `float_lut` feature (enabled by default) is being used. | ||
/// * When converting from `f32` or `f64` to `u8`, while converting from linear | ||
/// space. This uses [fast_srgb8::f32_to_srgb8]. | ||
/// space if the `fast_uint_lut` feature (enabled by default) is being used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mentions earlier feature names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope all is well and sorry for the absence. I think I lost more momentum than expected when I got sick, but it's better now after some time away from programming related things during my free time.
I have added a few comments where I would like to have some documentation. I think this will be good to go once them and the other comments have been resolved. How does that sound?
let (linear_scale, alpha, beta) = | ||
if let Some((linear_scale, linear_end)) = is_linear_as_until { | ||
( | ||
Some(linear_scale), | ||
(linear_scale * linear_end - 1.0) / (linear_end.powf(gamma.recip()) - 1.0), | ||
linear_end, | ||
) | ||
} else { | ||
(None, 1.0, 0.0) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment that describes what this calculates and how it relates to the transfer function would be nice. The u16
version can refer to this.
impl IntoLinear<f64, u8> for #fn_type { | ||
#[inline] | ||
fn into_linear(encoded: u8) -> f64 { | ||
#table_ident[encoded as usize] | ||
} | ||
} | ||
|
||
impl IntoLinear<f32, u8> for #fn_type { | ||
#[inline] | ||
fn into_linear(encoded: u8) -> f32 { | ||
#table_ident[encoded as usize] as f32 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be "hand written", to keep them closer to the type's definition. They are trivial enough.
impl IntoLinear<f32, u8> for #fn_type { | ||
#[inline] | ||
fn into_linear(encoded: u8) -> f32 { | ||
#table_ident[encoded as usize] as f32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try adding an f32
table to be used here, and see if that has any effect on the benchmark that got a slow-down.
#[cfg(feature = "gamma_lut_u16")] | ||
impl IntoLinear<f64, u16> for #fn_type { | ||
#[inline] | ||
fn into_linear(encoded: u16) -> f64 { | ||
#table_ident[encoded as usize] | ||
} | ||
} | ||
|
||
#[cfg(feature = "gamma_lut_u16")] | ||
impl IntoLinear<f32, u16> for #fn_type { | ||
#[inline] | ||
fn into_linear(encoded: u16) -> f32 { | ||
#table_ident[encoded as usize] as f32 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thing here as for u8
. I wonder if it's also worth having a u16
to f32
table here. Let's see if it helps the u8
case.
) | ||
} | ||
|
||
fn build_f32_to_u8_lut(entries: &[LutEntryU8]) -> TokenStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate a comment that links to https://gist.github.com/2203834 for credit and explains any differences from it. Particularly the integration. It should be possible for someone who hasn't been part of the discussion to understand and bug fix it, with aid from the linked reference.
impl FromLinear<f64, u8> for #fn_type { | ||
#[inline] | ||
fn from_linear(linear: f64) -> u8 { | ||
<#fn_type>::from_linear(linear as f32) | ||
} | ||
} | ||
|
||
impl FromLinear<f32, u8> for #fn_type { | ||
#[inline] | ||
fn from_linear(linear: f32) -> u8 { | ||
lut::linear_f32_to_encoded_u8(linear, #min_float_bits, &#table_ident) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hand write these as well? The same for u16
below.
const MAX_FLOAT_BITS: u32 = 0x3f7fffff; // 1.0 - f32::EPSILON | ||
|
||
// SAFETY: Only use this macro if `input` is clamped between `min_float` and `max_float`. | ||
macro_rules! linear_float_to_encoded_uint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we remove the unsafe blocks from inside it. That may be better, so we are forced to mark the call site. 🤔
Sorry that I've been less active. The school semester started and is kicking my butt. Thank you for the extra comments and I'll try to address them soon |
No worries, I know how it can be. 😬 Thanks for the update, and let me know if it turns out to be too stressful to find time. Good luck with the studies and don't forget to sleep! |
Hi, how's it going? I just wanted to check in and also let you know that there are some fixes in How does it look regarding getting a moment for this PR? Would it help if we scope it down a bit? The most important change from my perspective is to have the rationale for the algorithm changes in your words. The rest are things I can take care of. |
Yeah, scoping it down would help. I also am trying to type up a short document describing how I arrived at the formulae that I did, since it's a bit too in-depth to be written in comments in the code. I'll still include a basic explanation of what's going on in the documentation, though. Does that sound alright? |
That sounds great, thank you! I don't want you to feel like you are stuck with this, so the basic explanation could be good enough on its own if you want to keep it short. The point is to answer why this change was made (performance?) and what makes it equivalent to the original (referring to or showing an equation, algorithm, etc.). That usually works as reference for refactoring and bug fixing. Details for extra clarity are still appreciated, but don't sweat it. When you feel like this is done, I would be grateful if you could also squash the commits into a single commit, like before. That should be it unless you want to address any of the other comments. |
This PR adds the creation of float ↔ uint conversion lookup tables for
FromLinear
andIntoLinear
to the crate's build script. As a result, this crate no longer has a dependency onfast-srgb8
(I have confirmed that the lookup table used byfast-srgb8
is identical to the one generated by the build script).Along with this, I have added the features:
float_lut
(foru8
to float conversions),float_lut16
(foru8
to float andu16
to float conversions),fast_uint_lut
(for fastf32
tou8
conversions), andfast_uint_lut16
(for fastf32
tou8
andf32
tou16
conversions). Of these, I have addedfloat_lut
andfast_uint_lut16
as the default features.float_lut
must be default to not cause a breaking change forSrgb
andRecOetf
, whose lookup tables were replaced by the build script. I includedfast_uint_lut16
as a default feature since its largest generated table contains only 1152u64
s, although I would understand wanting to replace it withfast_uint_lut
as the default since that only generates tables of less than 200u32
s.Building the crate seems to take considerably longer now due to the new code. I added some statements to the main.rs file in the build script that might prevent the crate being built unnecessarily often (although I doubt it since every time I run a test it seems to rerun the build script even though I haven't changed anything). If there are improvements to be made, please let me know and I'll implement them.
Also, does removing a dependency qualify as a breaking change? If so, I can add back
fast-srgb8
so that it can be removed at the next major update.