Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support for f128::f128 and half::{ f16, bf16 } #46

Closed
Evrey opened this issue Aug 2, 2020 · 21 comments
Closed

[FEATURE] Add support for f128::f128 and half::{ f16, bf16 } #46

Evrey opened this issue Aug 2, 2020 · 21 comments
Assignees
Labels
enhancement New feature or request normal priority Normal Priority
Milestone

Comments

@Evrey
Copy link

Evrey commented Aug 2, 2020

Problem

Currently, lexical-core can only parse f32 and f64, but especially for designers of programming languages supporting more number formats than Rust does would be nice.

Solution

Offer a feature-gated default impl for f128 using the f128 crate and f16, bf16 from the half crate.

Prerequisites

  • lexical version : 0.7.*
  • lexical compilation features used: format, correct, radix

Alternatives

Don't see any beyond »let's not«.

Additional Context

Rust has u128, as having it for e.g. crypto is convenient, despite no mainstream CPU having 128-bit integer arithmetic and registers. f16 is very often used, e.g. in GPU code, bf16 specifically in neural network code, and f128 also finds some use here and there.

There's also 8-bit floats, though not IEEE-standardised, and there's IEEE 754 binary256. However, I know of no handy softfloat crates for these.

As lexical-core aims to be a proglang-agnostic number parser, i.e. not tied to Rust formats and types, I see no reason to completely restrict oneself to just the built-in Rust machine types.

@Evrey Evrey added the enhancement New feature or request label Aug 2, 2020
@myrrlyn
Copy link
Collaborator

myrrlyn commented Apr 18, 2021

I don't know nearly enough about either the parse algorithms or the bit layout of these types to begin to take this on myself. I'll reach out to Alex and see if he has the time or capability to weigh in.

@Alexhuszagh
Copy link
Owner

@myrrlyn This would be doable for sure. The only issue is that the number of significant digits would have to increase dramatically for the arbitrary-precision arithmetic. Currently, in base 10, we need 768 digits to accurately represent a 64-bit float for rounding,

The formula, which you can find in comments in my source code, is as follows:

///
/// `−emin + p2 + ⌊(emin + 1) log(2, b) − log(1 − 2^(−p2), b)⌋`
///
/// For f32, this follows as:
///     emin = -126
///     p2 = 24
///
/// For f64, this follows as:
///     emin = -1022
///     p2 = 53

For f128, the exponent size is changed from 11 bits to 15 bits, so we now have the following numbers:

/// For f128, this follows as:
///    emin = −16382
///    p2 = 113

Which, would then require 11563 digits for a 128-bit float, rather than 769 for a 64-digit float. Which is why I didn't implement it originally.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 19, 2021

For the actual links to the documentation in the code, the comments documenting all this can be found here.

The formula so you can test it on your own is here:

def digits(emin, p2, b):
    return -emin + p2 + math.floor((emin+ 1)*math.log(2, b)-math.log(1-2**(-p2), b))

Where in the above case, emin = 16382, p = 113, and b = 10.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 19, 2021

Support for f16 and bf16 should be trivial, however. For f128, I"m very worried about the following however, so it would never be the default (maybe on option like radix?).

pub(super) fn parse_mantissa<'a, Data>(data: Data, radix: u32, max_digits: usize)
-> Bigint
where Data: SlowDataInterface<'a>
{
let small_powers = Bigint::small_powers(radix);
let count = data.mantissa_digits();
let bits = count / integral_binary_factor(radix).as_usize();
let bytes = bits / Limb::BITS;
// Main loop
let step = small_powers.len() - 2;
let base = as_limb(radix);
let max_digits = max_digits - 1;
let mut counter = 0;
let mut value: Limb = 0;
let mut i: usize = 0;
let mut result = Bigint::default();
result.data.reserve(bytes);
// Iteratively process all the data in the mantissa.
let mut integer_iter = data.integer_iter();
let mut fraction_iter = data.significant_fraction_iter();
add_digits!(integer_iter, result, value, i, counter, step, small_powers, base, radix, max_digits);
if integer_iter.consumed() {
// Continue if we haven't already processed the max digits.
add_digits!(fraction_iter, result, value, i, counter, step, small_powers, base, radix, max_digits);
}
// We will always have a remainder, as long as we entered the loop
// once, or counter % step is 0.
if counter != 0 {
result.imul_small(small_powers[counter]);
result.iadd_small(value);
}
// If we have any remaining digits after the last value, we need
// to add a 1 after the rest of the array, it doesn't matter where,
// just move it up. This is good for the worst-possible float
// representation. We also need to return an index.
// Since we already trimmed trailing zeros, we know there has
// to be a non-zero digit if there are any left.
let is_consumed = integer_iter.consumed() && fraction_iter.consumed();
if !is_consumed {
result.imul_small(base);
result.iadd_small(1);
}
result
}

This is because it's fairly trivial with 767 mantissa digits + 324 exponent digits to stack-allocate everything which is ~1091 digits or ~3600 bits, meanwhile, with an f128, we'd need 11563 mantissa digits + 4932 exponent digits, or ~55,000 bits. We're not stack-allocating 7KB just for a single operation, which is totally doable using a Vec, but would require an allocator.

EDIT:

This might be mitigated a bit if we change to use associated type in the Float trait for the size. We currently just overestimate because it's really not a big deal, since at max we're dealing with like 800 bytes. This might have a few benefits outside of just the specific f128 case.

Currently, our two storage options (for different correctness algorithms, which have speed tradeoffs) are as follows:

https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/atof/algorithm/bignum.rs#L9-L21
https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/atof/algorithm/bignum.rs#L118-L124

If we make it an associated type for the Float trait, we could easily lower these values for f32, keep them the same for F64, and use a Vec for f128. This should avoid any performance penalties for simpler cases and allow an f128 case without defaulting to a dynamically-allocated Vec.

@Evrey
Copy link
Author

Evrey commented Apr 19, 2021

Given how often f16 and rarely f128 is used out in the wild, at least i would be totally fine with just going f16/bf16 for now. That stack memory usage for f128 does sound really expensive. That said, i do like the associated type solution.

@Alexhuszagh
Copy link
Owner

@Evrey What implementations do you want for f16, bf16, and f128. I see the following:

  • half for f16, bf16

I don't see a Rust version of f128 that doesn't rely on GCC intrinsics, which I'd rather avoid. It might definitely be possible to write our own library based off of the LLVM intrinsics, but that's rather low-level and I'd rather avoid that for now (and that would be beyond the scope of lexical, and be an external dependency).

In the meantime, I will be feature-gating the code with features that don't (yet) exist for the associated constants, and better documentation for the required storage for each type in the comments, and add associated types for the storage required.

Alexhuszagh added a commit that referenced this issue Apr 20, 2021
Adds the associated constants `BIGINT_LIMBS`, `BIGFLOAT_LIMBS`, and `EXPONENT_SIZE` to `Float`.
Adds the associated types `BigintStorage` and `BigfloatStorage` to `Float`.
Implements `Bigfloat` and `Bigint` in terms of `FloatType`, and the associated storage can differ depending on the float size.
@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 20, 2021

@RazrFalcon The preliminary logic for all of this is implemented, however, I need concrete types for anything more significant.

The following additions have been added, with appropriate types, should make everything work out-of-the-box.

In Float:

  • BIGINT_LIMBS
  • BIGFLOAT_LIMBS
  • BigintStorage
  • BigfloatStorage
  • EXPONENT_SIZE

I then re-implemented Bigint and Bigfloat in terms of these associated types (right now, the *_LIMBS constants are just decorative until later versions, when we can use them in ArrayVec directly).

I also added extensive comments to ensure the choices for these sizes were clear:

/// The minimum, denormal exponent can be calculated as follows: given
/// the number of exponent bits `exp_bits`, and the number of bits
/// in the mantissa `mantissa_bits`, we have an exponent bias
/// `exp_bias` equal to `2^(exp_bits-1) - 1 + mantissa_bits`. We
/// therefore have a denormal exponent `denormal_exp` equal to
/// `1 - exp_bias` and the minimum, denormal float `min_float` is
/// therefore `2^denormal_exp`.
///
/// For f16, this follows as:
///     exp_bits = 5
///     mantissa_bits = 10
///     exp_bias = 25
///     denormal_exp = -24
///     min_float = 5.96 * 10^−8
///
/// For bfloat16, this follows as:
///     exp_bits = 8
///     mantissa_bits = 7
///     exp_bias = 134
///     denormal_exp = -133
///     min_float = 9.18 * 10^−41
///
/// For f32, this follows as:
///     exp_bits = 8
///     mantissa_bits = 23
///     exp_bias = 150
///     denormal_exp = -149
///     min_float = 1.40 * 10^−45
///
/// For f64, this follows as:
///     exp_bits = 11
///     mantissa_bits = 52
///     exp_bias = 1075
///     denormal_exp = -1074
///     min_float = 5.00 * 10^−324
///
/// For f128, this follows as:
///     exp_bits = 15
///     mantissa_bits = 112
///     exp_bias = 16495
///     denormal_exp = -16494
///     min_float = 6.48 * 10^−4966
pub(super) fn max_digits<F>(radix: u32)
    -> Option<usize>
    where F: Float
{
    match F::BITS {
        32 => max_digits_f32(radix),
        64 => max_digits_f64(radix),
        _  => unreachable!(),
    }
}}
/// Storage for a big integer type.
///
/// This is used for the bhcomp::large_atof and bhcomp::small_atof
/// algorithms. Specifically, it stores all the significant digits
/// scaled to the proper exponent, as an integral type,
/// and then directly compares these digits.
///
/// This requires us to store the number of significant bits, plus the
/// number of exponent bits (required) since we scale everything
/// to the same exponent.
/// This therefore only needs the following number of digits to
/// determine the correct representation (the algorithm can be found in
/// `max_digits` in `bhcomp.rs`):
///  * `bfloat16` - 138
///  * `f16`      - 29
///  * `f32`      - 158
///  * `f64`      - 1092
///  * `f128`     - 16530
#[derive(Clone, PartialEq, Eq)]
#[cfg_attr(test, derive(Debug))]
pub(crate) struct Bigint<F: Float> {
    /// Internal storage for the Bigint, in little-endian order.
    pub(crate) data: F::BigintStorage,
}
/// Storage for a big floating-point type.
///
/// This is used for the bigcomp::atof algorithm, which crates a
/// representation of `b+h` and the float scaled into the range `[1, 10)`.
/// This therefore only needs the following number of digits to
/// determine the correct representation (the algorithm can be found in
/// `max_digits` in `bhcomp.rs`):
///  * `bfloat16` - 97
///  * `f16`      - 22
///  * `f32`      - 113
///  * `f64`      - 768
///  * `f128`     - 11564
#[derive(Clone, PartialEq, Eq)]
#[cfg_attr(test, derive(Debug))]
pub struct Bigfloat<F: Float> {
    /// Internal storage for the Bigfloat, in little-endian order.
    ///
    /// Enough storage for up to 10^345, which is 2^1146, or more than
    /// the max for f64.
    pub(crate) data: F::BigfloatStorage,
    /// It also makes sense to store an exponent, since this simplifies
    /// normalizing and powers of 2.
    pub(crate) exp: i32,
}

This should simplify maintainability while also adding support for these types later down the road. If all checks pass, I'll merge this branch into master and keep this branch until I get concrete implementation details on the floats we want to use.

@Evrey
Copy link
Author

Evrey commented Apr 21, 2021

@Alexhuszagh I'm fine with not having f128 for now. I'd be nice for symmetry, but alas it is rarely used/needed. Unlike f16/bf16. half is really the only impl for that i know of, so it's the one i suggested. But if you happen to know a better one (according to whichever criteria), then have at it. Maybe multiple impls could be added if demand is there. I just need it for compiler design stuff, so i mostly treat f16 as a storage format.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 21, 2021

@Evrey I've just looked at the half crate, and there is no support for any mathematical operations, which prevents the fast path from being implemented. In order to implement this then, we would need to:

  1. Create an f16, bf16, and f128 type using LLVM intrinsics (if possible).
  2. Ensure there are conversion operations to and from half in this crate if the half feature is enabled.
  3. Implement parsing in terms of these types.

I'm not exactly the world's best maintainer, so I would likely ask a few, well-trusted maintainers to ensure the crate is well-maintained (like with lexical) in the case my mental health deteriorates.

@Evrey
Copy link
Author

Evrey commented Apr 21, 2021

Uff, okay, that's a lot of work. D: Sounds like effort is instead best put in preparation for more types, then, and pushing whatever f16 RFC may exist in that RFC ocean?

E: Btw., while GPUs do f16 math in hardware, CPUs rarely implement actual f16 math, and instead just offer hardware type conversion to/from f32. Perhaps these »fast paths« are doable with explicit f32 casts?

@Alexhuszagh Alexhuszagh added this to the 1.0 milestone Apr 21, 2021
@Alexhuszagh Alexhuszagh added the normal priority Normal Priority label Apr 21, 2021
@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 21, 2021

@RazrFalcon I believe LLVM supports via it's intrinsics all of the ones we want. Not everything, so it would have to be feature-gated and implemented on supported hardware.

Here's documentation from LLVM on f16 and bf16 and f128. This would likely have to be unstable, since Rustc inherently checks for any LLVM intrinsics. The other, fallback solution would simply be software emulation, which could be decently fast and not too difficult. The only reason this would be doable without too much work is there's a pretty big body of permissively licensed work I believe that has robust implementations of this.

I would likely be using llvmint-like wrappers for this (since llvmint doesn't seem to be maintained).

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Apr 29, 2021

Ok I've taken a closer look at this, and it seems we have the following cases:

  • __fp16, __bf16 in Clang's own specifications seem to be storage-only formats. After a careful look at x86 and ARM ISAs, the only instructions seem to be conversion to and from these binary formats. For x86, AVX512 extensions seem to only support packing single-precision floats to bf16, and then an intrinsic dot-product. This does not affect anything we do here, and it seems all other operations are lossy.
    • Operations using these types will therefore intrinsically be truncated, so accuracy is not exactly needed (round up to f32 and down would be perfectly valid).
  • _Float16 is a C-language extension that does near-native 16-bit floating point arithmetic if supported by the architecture. If it is not, then the arithmetic is performed as if it were a single-precision float, and then truncated back
  • f128 seems to be hardware-specific, as does ppcf128 (2x 64-bit floats for a larger dynamic range).

References for f16, bf16, and _Float16. References for f128. References for ARM half-precision types. References for x86 F16C extensions. References for x86 AVX-512 extensions for BF16.

So, after a careful look, this seems a lot more trivial than initially imagined:

  1. Support for f16, bf16 using float promotion and truncation. This is extremely trivial, since all mathematical operations are pretty much lossy, without any need for guard digits. The fast-path would then be pretty simple, and the slower paths would likewise be trivial.
  2. Support for _Float16 using either software-emulated floats or the LLVM intrinsics. This would lead to correct results (when supported by LLVM) for 16-bit float conversions
  3. Support for f128 using LLVM intrinsics or software-emulated floats. This should have perfect accuracy in all cases.

In terms of accuracy... f16 and bf16 would only have rounding issues with the following part of the fast-path algorithm, due to the lack of guard digits in this operation. Since it would be promoted to an f32, processed, and then truncated, it could lead to errors of a few ULP (not sure, I'd assume at max 1 but I'd have to test):

float.pow(radix, exponent)

I'm creating an initial, proof-of-concept crate now for support for these types. It's extremely preliminary, and currently private, but I'm working on it now.

Alexhuszagh added a commit that referenced this issue May 4, 2021
- Readded trivial implementation of algorithm_m.
    - Not compiled, just for future algorithm developments.
- Removed hard-coded logic for u64 in correct and bigint parsing.
- Should allow initial support for #46.
@Alexhuszagh
Copy link
Owner

Ok, I've thought about it a bit more carefully, and as long as we implement the logic to and from f32 properly, no loss of accuracy is possible since the fast path only operates if, and only if, the mantissa is capable of completely storing the digits (including exponents being shifted from the exponent to the mantissa).

The means, for the fast path, with:

  • f16: we have (-4, 4) for the max exponent range, along with 3 for the mantissa limit.
  • bf16, we have (-3, 3) for the max exponent range, along with 2 for the mantissa limit.

This means the fast path will only be valid for exponents [-4, 7] for f16, and [-3, 5] for bf16 where the mantissa is not truncated during parsing. This is trivial to test accuracy for: we should not have any issues here.

The code to generate these limits, in case you're wondering, can be found in the documentation:

// ```python
// import math
//
// def is_pow2(value):
// '''Calculate if a value is a power of 2.'''
//
// floor = int(math.log2(value))
// return value == 2**floor
//
//
// def remove_pow2(value):
// '''Remove a power of 2 from the value.'''
//
// while math.floor(value / 2) == value / 2:
// value //= 2
// return value
//
//
// def exponent_limit(radix, mantissa_size, min_exp, max_exp):
// '''
// Calculate the exponent limit for a float, for a given
// float type, where `radix` is the numerical base
// for the float type, and mantissa size is the length
// of the mantissa in bits. min_exp is the minimum,
// denormal binary exponent, and max_exp is the maximum
// binary exponent.
// '''
//
// if is_pow2(radix):
// # Can always be exactly represented, calculate relative
// # to min and max exp.
// scaled_min_exp = int(min_exp / math.log2(radix))
// scaled_max_exp = int(max_exp / math.log2(radix))
// return (scaled_min_exp, scaled_max_exp)
// else:
// # Positive and negative should be the same,
// # since we need to find the maximum digit
// # representable with mantissa digits.
// # We first need to remove the highest power-of-
// # two from the radix, since these will be represented
// # with exponent digits.
// base = remove_pow2(radix)
// precision = mantissa_size + 1
// exp_limit = int(precision / math.log2(base))
// return (-exp_limit, exp_limit)
//
//
// def mantissa_limit(radix, mantissa_size):
// '''
// Calculate mantissa limit for a float type, given
// the radix and the length of the mantissa in bits.
// '''
//
// precision = mantissa_size + 1
// return int(precision / math.log2(radix))
//
//
// def all_limits(mantissa_size, min_exp, max_exp):
// '''Print limits for all radixes.'''
//
// print('match radix.as_i32() {')
// for radix in range(2, 37):
// exp_limit = exponent_limit(radix, mantissa_size, min_exp, max_exp)
// print(f' {radix} => {exp_limit},')
// print('}')
//
// print('match radix.as_i32() {')
// for radix in range(2, 37):
// mant_limit = mantissa_limit(radix, mantissa_size)
// print(f' {radix} => {mant_limit},')
// print('}')
// ```

@Alexhuszagh
Copy link
Owner

On a separate note, I believe I've removed all the hard-coded logic internally to support f128, removing all the assumptions about an f64 mantissa used to simplify some trait impls. I needed to refactor this anyway, but it means we have most of the initial barriers removed for support for f16, bf16, and f128 (both will be feature-gated, f16 and bf16 under the feature = "f16" gate, and f128 under the feature = "f128" gate.

@Evrey
Copy link
Author

Evrey commented May 4, 2021

This is exciting news!

@Alexhuszagh
Copy link
Owner

OK, I've been looking at this carefully and currently am implementing bf16 and f16 support. However, I don't see f128 support coming any time soon, just because of the complexity of the algorithms required: it breaks most of the existing logic (for parsing, it would require large pre-computed Bellerophon tables, for the slow path, it would require a massive big integer of which would require a system allocator)... Likewise, I've removed a lot of the more complicated algorithms like Karatsuba multiplication because they scale poorly for f64 and smaller values, but they're asymptotically better for f128 and it would likely make a difference.

The current number of bits required to parse a f128 correctly would be:

let max_digits = 11564;
let min_denormal_exp = 4966;

So, we'd need enough bits to store 10^16530, or ~55KB. This absolutely requires a system allocator, since we're in danger of overflowing the stack just by that on musl libc. The pre-computed Bellerophon powers wouldn't be too bad for decimal strings (~76 u128), however, the slow path would absolutely be a pain currently.

The float-writing algorithms would be a lot easier, but would require support for 256-bit integer math (not too difficult to do, since have native 128-bit math).

I'll be implementing f16 and bf16 logic, crafting a release, and then consider implementing f128 logic. This will likely first only support float writing, then later, float parsing, but the latter is unlikely to occur in the near future.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Sep 3, 2021

Ok full support for f16 and bf16 has been implemented, with minimal types bf16 and f16 being publicly exported in lexical-core and lexical. These have to_bits() and from_bits() which take a u16 type, for easy conversion to-and-from other half-precision floating point libraries, without having to rely on any external dependencies.

This can be enabled using the f16 feature as of lexical v6.0.0 and lexical-core v0.8.0.

@Evrey
Copy link
Author

Evrey commented Sep 4, 2021

What a lovely bulk-update!

@Alexhuszagh
Copy link
Owner

Linking this to #93.

@Alexhuszagh
Copy link
Owner

I'm going to drop considering support for f128 because there's a lot of optimizations that are effectively impossible, so using a more naive library would be ideal. It would only really make sense if using an antiquated algorithm like Grisu for floating point printing and even slow algorithms that have been removed from general support would which I have no plan on supporting right now and dealing with the preconditions and code changes to implement this.

Also, 128-bit floats aren't commonly used like the bf16 and f16 variants and I know of practically no support for them outside one that requires linking to glibc.

@Evrey
Copy link
Author

Evrey commented Sep 18, 2024

Good reasoning, and still a big win on half floats. Thanks for all the effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request normal priority Normal Priority
Projects
None yet
Development

No branches or pull requests

3 participants