-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conversion of unnormalized BF16->BF16 weights #7843
Conversation
Doing this would make the result of $ ./quantize ggml-model-f32.gguf ggml-model-bf16.gguf bf16 for these models which have subnormal1 bf16 weights, no? The goal of #7158 was to make the two paths result in exactly the same model files. I'd be curious to find a model affected by this. I think I'll try and compare. (EDIT: oh, right in #7825 you mention @jart, you might want to chime in, since you likely have more background on why subnormals are flushed to zero in Lines 98 to 101 in 172c825
Footnotes
|
Ah, didn't realize that was an option. Yeah, if the F32 originates from BF16 they would be different.
You would have to do this if the lower 16 bits are non-zero at least, but in the case of BF16->F32->BF16 they never are. |
I found this online tool very handy while diagnosing F32 bits btw: |
The lower 16 bits don't matter for this, only for rounding (which indeed changes nothing in the
Nice! There's also https://float.exposed/, which supports |
Oh, errr, I think I see the problem now when looking more closely at |
Lines 87 to 104 in 172c825
Yep, lots of returns. And the Numpy implementation still behaves in the same way despite it not early-returning, because NANs are not subnormals and their lower-bits are cleared to avoid rounding. llama.cpp/gguf-py/gguf/quants.py Lines 27 to 35 in 172c825
If it was not working exactly in the same way, then the checksums in #7158 and #7234 would not match (but they all do). At least going the other way (from Lines 71 to 78 in 172c825
|
Only problem with that is that you are right-shifting a possible negatively signed integer here: llama.cpp/gguf-py/gguf/quants.py Line 34 in 172c825
Actually, just tested, and it seems Sigh, it does the same if The same with |
@compilade So, an alternate fix would be to use Any opinion if I should do that and remove the simpler truncation, or if it's worth having both? |
@CISC Huh, you're right about the import numpy as np
rand = np.random.Generator(np.random.PCG64(seed=42))
n = rand.standard_normal((3, 4), dtype=np.float32)
print("n:")
print(n, n.dtype)
n = n.view(np.int32)
print("n:")
print(n, n.dtype)
sign = n & 0x80000000
print("sign:")
print(sign, sign.dtype)
# round to nearest even
rounded = ((n + (0x7fff + ((n >> 16) & 1))) >> 16).astype(np.int16)
rounded_sign = ((sign + (0x7fff + ((sign >> 16) & 1))) >> 16).astype(np.int16)
print("rounded:")
print(rounded, rounded.dtype)
print("rounded_sign:")
print(rounded_sign, rounded_sign.dtype) Output:
I agree that the C version is using Your suggestion does make the implementation more sane, but it won't change anything about the resulting bits in the model files (if truncation isn't used). (so it's not really "an alternate fix", it's more like a refactor) If the previous implementation didn't result in the exact same bits as the C version, I would definitely have noticed in #7234. |
Yeah, it seems to work out correctly in the end except when doing the flushing, the question then is just if we should flush to zero? If it's from an actual F32 it might make sense to do so, but it seems wrong to do it for BF16->BF16... |
(emphasis mine) What is it doing incorrectly? From my testing, flushing to zero keeps the sign in the Numpy implementation (even with wrong implicit conversions), as with the original C code it was based on. Or is it that you consider flushing to be incorrect in itself? (to be clear I agree with fixing the types in the Numpy implementation of bfloat16 conversion)
@CISC I've searched deeper, since you made me curious. From https://en.wikipedia.org/wiki/Subnormal_number#Performance_issues, calculations with subnormals can be slower, and from https://cloud.google.com/tpu/docs/bfloat16#details_about_format_conversion, Google Cloud TPUs don't support subnormals in From https://developer.arm.com/documentation/ddi0602/2023-06/Shared-Pseudocode/shared-functions-float?lang=en#impl-shared.FPConvertBF.2, it seems Aarch64 processors do flush subnormals to zero when in boolean altfp = HaveAltFP() && !UsingAArch32() && fpcr.AH == '1'; boolean fpexc = !altfp; // Generate no floating-point exceptions if altfp then fpcr. = '11'; // Flush denormal input and output to zero if altfp then rounding = FPRounding_TIEEVEN; // Use RNE rounding mode // Unpack floating-point operand, with always flush-to-zero if fpcr.AH == '1'. (fptype,sign,value) = FPUnpack(op, fpcr, fpexc); And regarding AMD and Intel, https://www.felixcloutier.com/x86/vcvtne2ps2bf16 is the instruction referred to by the comment above Lines 83 to 85 in 172c825
Yeah, I don't really know either. On the one hand, it makes all From #7825 (comment):
Maybe subnormal values are problematic anyway, so it's fine to flush them to zero? |
Yeah, as it changes the original values.
Ok, that's interesting.
They are probably insignificant, however see below...
They are indeed problematic, however it's not just the actual subnormals, but the ones that turn subnormal as F16, these are the ones that mess up certain quants and inference when doing KQ multiplication in F16. |
@compilade So, looks like the implicit upcasting and signed shifting fails with BTW, I'm leaving for my annual European tour tomorrow, so will likely not be available to do any changes for a while, but I'll grant you access to the branch just in case. |
@compilade (1) it makes things faster, and (2) I wasn't able to determine a flawless elegant performant way to handle them in the short amount of time I had to focus on this. That function actually fully vectorizes from vanilla C. I know it's possible to make it fast and support subnormals too. I think it's something that should be pursued and is worthwhile to pursue. It's even more important to do in the conversion / quantization tools. I'll put some time aside today to see what I can do to help. |
OK so I looked into this. When I wrote So I asked Claude what to do and I got a very silly answer involving // Returns the increment to add to the bits of a finite F32 value to round a
// finite F32 to the nearest BF16 value
static HWY_INLINE HWY_MAYBE_UNUSED constexpr uint32_t F32BitsToBF16RoundIncr(
const uint32_t f32_bits) {
return static_cast<uint32_t>(((f32_bits & 0x7FFFFFFFu) < 0x7F800000u)
? (0x7FFFu + ((f32_bits >> 16) & 1u))
: 0u);
}
// Converts f32_bits (which is the bits of a F32 value) to BF16 bits,
// rounded to the nearest F16 value
static HWY_INLINE HWY_MAYBE_UNUSED constexpr uint16_t F32BitsToBF16Bits(
const uint32_t f32_bits) {
// Round f32_bits to the nearest BF16 by first adding
// F32BitsToBF16RoundIncr(f32_bits) to f32_bits and then right shifting
// f32_bits + F32BitsToBF16RoundIncr(f32_bits) by 16
// If f32_bits is the bit representation of a NaN F32 value, make sure that
// bit 6 of the BF16 result is set to convert SNaN F32 values to QNaN BF16
// values and to prevent NaN F32 values from being converted to an infinite
// BF16 value
return static_cast<uint16_t>(
((f32_bits + F32BitsToBF16RoundIncr(f32_bits)) >> 16) |
(static_cast<uint32_t>((f32_bits & 0x7FFFFFFFu) > 0x7F800000u) << 6));
}
HWY_API HWY_BF16_CONSTEXPR bfloat16_t BF16FromF32(float f) {
#if HWY_HAVE_SCALAR_BF16_OPERATORS
return static_cast<bfloat16_t>(f);
#else
return bfloat16_t::FromBits(
detail::F32BitsToBF16Bits(BitCastScalar<uint32_t>(f)));
#endif
} And that's binary equivalent to what I wrote with the FTZ clause removed. So you're right. The smartest thing here would have probably been to do nothing. Plus our implementation actually goes slightly faster than Highway's. static ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
union {
float f;
unsigned i;
} u = {s};
ggml_bf16_t h;
if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */
h.x = u.i >> 16; /* unrounded conversion */
h.x |= 64; /* maintain nan and have it be quiet */
return h;
}
h.x = (u.i + (0x7fff + ((u.i >> 16) & 1))) >> 16;
return h;
} The question is probably better asked of Intel and AMD. It's probably just because x86 is famous for having very slow handling of subnormals. But it should be dependent on the MXCSR register FTZ bit being set. So we should probably just remove that if statement. |
Fun fact; |
@jart If I understood you correctly you're suggesting that the flush-to-zero should be removed from both conversion and quantizing (in ggml)? |
Yes. It makes the f32->bf16 algorithm simpler, faster, and more consistent with brain codebases. With normal floating point, flush to zero is only enabled in Please note functions like In any case, it's likely of little consequence, since subnormals are rare in the models whose floats I've analyzed, e.g. Mistral. |
* ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used.
@jart Doing so would require to change that masterfully aligned comment. I might have made it bad though. diff --git a/ggml-impl.h b/ggml-impl.h
index 5e77471f..397e22a6 100644
--- a/ggml-impl.h
+++ b/ggml-impl.h
@@ -80,8 +80,9 @@ static inline float ggml_compute_bf16_to_fp32(ggml_bf16_t h) {
/**
* Converts float32 to brain16.
*
- * This function is binary identical to AMD Zen4 VCVTNEPS2BF16.
- * Subnormals shall be flushed to zero, and NANs will be quiet.
+ * This is binary identical with Google Brain float conversion.
+ * Floats shall round to nearest even, and NANs shall be quiet.
+ * Subnormals aren't flushed to zero, except perhaps when used.
* This code should vectorize nicely if using modern compilers.
*/
static inline ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
@@ -95,10 +96,6 @@ static inline ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
- if (!(u.i & 0x7f800000)) { /* subnormal */
- h.bits = (u.i & 0x80000000) >> 16; /* flush to zero */
- return h;
- }
h.bits = (u.i + (0x7fff + ((u.i >> 16) & 1))) >> 16;
return h;
}
This seems to be used by A workaround is to add Although I don't know if |
I ran some tests with some small models I had (less than 3B), and I can't find one with subnormals. It seems like they are usually already flushed to zero upstream. (EDIT: |
I recommend having the quantize example program set a global variable, that tells the ggml function to not use the AVX512 version. |
No need, there's already a clean way to handle reference implementations: Lines 894 to 895 in 675a741
And during inference, Lines 12141 to 12142 in 675a741
Which is then used shortly after: Lines 12203 to 12205 in 675a741
So I think simply using the reference implementation in Footnotes
|
After closely analyzing Google Brain codebases, we decided that flushing to zero was the wrong thing to do. Intel and AMD probably designed their microprocessors to always flush to zero for the wrong reasons. It should have been made conditional on FTZ being set in MXCSR like other opcodes. See ggerganov/llama.cpp#7843
I'm back. @compilade Anything you need me to do? |
* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <[email protected]>
If source model tensors were unnormalized BF16 weights and you converted with outtype BF16, the target GGUF weights would deviate from source weights due to intermediate transition to FP32 (for numpy) and flushing/rounding to BF16.
This PR simply truncates the FP32 if the source is BF16.Also fixes implicit upcasting and signed shifting in
__compute_fp32_to_bf16
, which fails in latestnumpy
.Fixes #8147