Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-compile data as binary blobs #23

Closed
overlookmotel opened this issue May 31, 2024 · 9 comments
Closed

Pre-compile data as binary blobs #23

overlookmotel opened this issue May 31, 2024 · 9 comments

Comments

@overlookmotel
Copy link

overlookmotel commented May 31, 2024

Just to expand on discussion we had earlier...

In my opinion, it is never going to be possible to make this crate as fast to compile as we want it to be, no matter what we throw at it, without taking a different approach. It's just tons of data, so using a codegen to generate tons of code and then asking rust to parse and compile it all is always going to take a long time.

A possible solution could be to pre-compile it as binary data. Something like this:

At build time:

  • Convert JSON to HashMap/Vec/etc data structures.
  • Serialize those data structures with rkyv.
  • Save the serialized binary data to disk.

At runtime:

  • Load the binary data from disk.
  • Deserialize (which with rkyv is zero cost - just casting a pointer).
  • Bonus points: Don't even load the data - just mmap it from disk.

Background: rkyv's "special sauce" is relative pointers: https://rkyv.org/architecture/relative-pointers.html

@Boshen
Copy link
Member

Boshen commented Jun 1, 2024

I managed to shrink the code by "surface area" in https://github.com/oxc-project/oxc-browserslist/pull/32/files, compile time is halved from 8s to 4s.

I'll stop optimizing as I need to do some real work ...

@Boshen Boshen closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024
@Boshen
Copy link
Member

Boshen commented Jun 1, 2024

While cleaning up criterion2, I discovered https://crates.io/crates/ciborium which may help us here.

@Boshen Boshen reopened this Jun 1, 2024
@Boshen
Copy link
Member

Boshen commented Jun 1, 2024

@Boshen
Copy link
Member

Boshen commented Jun 1, 2024

I counldn't find it, but we can reduce a lot of chars if we can code generate a raw string to remove all the escaped double quotes ... r#"huge string without escaped quotes"#

@overlookmotel
Copy link
Author

overlookmotel commented Jun 1, 2024

ciborium looks good for compact data representation. However, it has a deserialization step which might be quite costly at runtime.

rkyv's advantage is no deserialization - it stores the data in a form where it can just be loaded into memory and is ready to go. You can load it statically with a zero-cost transmute:

static DATA: &Data = {
    #[repr(C)] // Guarantee 'bytes' comes after '_align'
    struct Aligned<Bytes: ?Sized> {
        _align: [Data; 0],
        bytes: Bytes,
    }

    static ALIGNED: &Aligned<[u8]> =
        &Aligned { _align: [], bytes: *include_bytes!("./data.bin") };
    unsafe { &*(ALIGNED as *const _ as *const Data) }
};

(filtched this code from https://users.rust-lang.org/t/can-i-conveniently-compile-bytes-into-a-rust-program-with-a-specific-alignment/24049/2)

@Boshen Boshen added the good first issue Good for newcomers label Jun 2, 2024
@Boshen
Copy link
Member

Boshen commented Jun 2, 2024

I've labeled this "good first issue" if anyone wants to try and reduce the compilation speed of this crate.

The current bottleneck comes from these two files where the data are huge: https://github.com/oxc-project/oxc-browserslist/blob/main/src/generated/caniuse_feature_matching.rs and https://github.com/oxc-project/oxc-browserslist/blob/main/src/generated/caniuse_region_matching.rs

The data is generated from cargo codegen

@barvirm
Copy link

barvirm commented Nov 10, 2024

Hi, I don't think this is necessary. I tried to rewrite all generated files with rkyv v8 (https://github.com/barvirm/oxc-browserslist/tree/rkyv_v8).

// master
compilation time after clean: +/- 23s
library size: 11MB


// rkyv_v8: 
compilation time after clean: +/- 26s
library size: 8.9MB
benchmark: 0-15% regression from master in some cases due to iteration over ArchivedTypes.

@Boshen
Copy link
Member

Boshen commented Nov 10, 2024

Hi, I don't think this is necessary. I tried to rewrite all generated files with rkyv v8 (https://github.com/barvirm/oxc-browserslist/tree/rkyv_v8).

// master
compilation time after clean: +/- 23s
library size: 11MB


// rkyv_v8: 
compilation time after clean: +/- 26s
library size: 8.9MB
benchmark: 0-15% regression from master in some cases due to iteration over ArchivedTypes.

This looks promising, can you pr?

@overlookmotel
Copy link
Author

Just to say, I think there are other changes we could make to make the data structures in this crate more efficient (e.g. reducing the size of data structures, replacing hash maps keyed by browser name as &str with arrays indexed by browser ID) which would make those structures more suitable for serialization with rkyv, without so much of the cost of rkyv's Archived types.

Personally, I suspect that the rkyv approach would yield faster compile times, smaller binary, and possibly faster runtime too (no serde deserialization at runtime), but we should attempt to optimize the data structures first. Trying to implement rkyv before that's done is unlikely to produce improvements.

I removed the "good first issue" label because I think this is a bit of a hornets nest! Sorry if I've wasted everyone's time by raising it prematurely.

@Boshen Boshen closed this as not planned Won't fix, can't repro, duplicate, stale Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants