New format with different endianess for better performance #6

zommiommy · 2023-04-04T14:50:31Z

zommiommy
Apr 4, 2023
Collaborator

During today's review of the code we noticed that we read codes using a MSB to LSB order which implies the need to use big-endian words. Since most modern computers are little-endian we might want to add backends to support LSB to MSB order, thus little-endian order, which might result in better performance.

We will discuss this further in the future as it might lead only to marginal improvements and needs to be accurately benchmarked.

vigna · 2023-04-04T16:03:38Z

vigna
Apr 4, 2023
Maintainer

Yes, let's make this more explicit. When we had the first meeting I had a completely twisted memory of how the bitstream is implemented, and I now believe that it is fundamental to change the bit endianness of the current format, which was conceived around Java's inherently big-endian design.

The modifications are as follows:

Instead of bit k of the stream being the bit of index 7 - k mod 8 of byte k div 8, bit k of the stream should be the bit of index k mod 8 of byte k div 8. That is, we are reversing bit endianness at the byte level. With this change, we can indeed read files (on little-endian architectures) with words of arbitrary size. That is true of big-endian architectures in this moment, and it's not good.
However, all fixed-size read or write of bit blocks should happen with the bits reversed, because this is the "natural" order with small-endian bits. That is, reading bit blocks from the bit buffer or writing to the bit buffer should happen with a couple of logical operations; writing fixed-size bit blocks in a true small-endian order would be prohibitively slow.

This has the consequence that we will not be able to convert between the two formats just by reversing the bits in each byte, and in particular that no reasonably fast on-the-fly conversion will be possible (but I might be wrong on this).

More precisely: if we write the ɣ code for 4, namely 00101, in a bit stream presently we get the byte 00101000. We would like to switch to a format where writing the same code is written 00001100. Note that the unary part is reversed, but the "payload" of ɣ code is always written in the same direction, which is essential for quick insertion and extraction.

@zommiommy, please correct me if I'm wrong.

0 replies

zacchiro · 2023-04-05T06:56:46Z

zacchiro
Apr 5, 2023
Collaborator

It's great that you have designed a new/better format. YAY for fresh looks at old problems!
About this (emphasis mine):

This has the consequence that we will not be able to convert between the two formats just by reversing the bits in each byte, and in particular that no reasonably fast on-the-fly conversion will be possible (but I might be wrong on this).

We can still have a batch and slow conversion tool from the current webgraph-java format to the new upcoming webgraph-rs format, correct?
Because if not that would force us to postpone experimenting with the current compressed Software Heritage graph (which uses the previous webgraph-java format) to the point we also have an implementation of compression in webgraph-rs, which would be unfortunate.

0 replies

vigna · 2023-04-05T07:31:38Z

vigna
Apr 5, 2023
Maintainer

Yes. I was thinking about this issue this morning.

I think that the correct thing to do is to work on readers and writers in parallel and for both formats. It is a bit more work initially, but we have the advantage of having the possibility of unit-testing since the start reading and writing in both formats, and in particular converting graphs. (I'm not saying to develop compression now, just bit-writing code, even a not particularly efficient one).

An alternative is to write the conversion software in Java. But I think that it would also be useful to have a big-endian Rust version as people would be able to use (more slowly, if my intuition is correct) the current graphs.

So I would suggest that @zommiommy starts in parallel a reader and writer for big and little endian formats. For the WordReader, I guess a const template parameter is sufficient. In the end, the only difference is a single call to the byte-rearrangement function if the native endianness and the format of the file disagree. We will have to add an endianness field to the graph .properties file, and assume that it is big if not specified.

For the bit reader, code is slightly different—in the big-endian case we keep the bit buffer filled "at the top" (i.e., the most significant bits are correct), whereas in the little-endian case we keep the bit buffer filled "at the bottom" (i.e., the least significant bits are correct). In the big-endian case, to read a unary code we count the number of leading zeros, whereas in the little-endian case we count the number of trailing zeros. In the big-endian case, to extract k bits we take the k most significant bits of the bit buffer, whereas in the little-endian case we take the k least significant bits of the bit buffer.

But this is where the difference end. All other codes can be read using unary codes and bit blocks, so the rest of the code would be common (including minimal binary codes, albeit in the end the sequence of written bits would be different in the two cases—Tommaso, we can talk about this).

Note that the weird lack of symmetry between little and big endianness of my previous message is due to the same lack of symmetry at the hardware level—big-endian architectures and small-endian architectures orders bits in a byte in the same way, and that's essentially a small-endian way: bit 0 is the least significant bit.

0 replies

zommiommy · 2023-04-05T08:26:21Z

zommiommy
Apr 5, 2023
Collaborator Author

Ok implemented both readers, note that I explicitly use to_be, and to_le so these should work on both big endian and little endian machines. Of course, you will have better performance if the format and machine endianess matches. Now I think I'll proceed implementing the writers, if you agree.

Something I'm not sure yet is if we should have a single struct with a generic const to set the endianess, or have two different structs.

My concerns are the followings:
Two structs: Code duplication
Generic Const: Currently rust supports only integers, char, and bool as generic consts. The most explicit way would be to have an enumeration with two variants, but that requires the generic_const_exprs feature which is instable and incomplete.

0 replies

vigna · 2023-04-05T08:34:42Z

vigna
Apr 5, 2023
Maintainer

Maybe having a macro for the time being, hoping that generic_const_exprs becomes stable? Or having 0/1 as const and we replace them later with a nice enumeration (0 is smaller, so little endian, 1 is bigger, so big endian)?

I'd say that a reasonable immediate target to is arrive at a point where we have unit tests writing and reading ɣ codes in both formats. Then the rest should be relatively easy. And at that point we can also do some benchmarking.

0 replies

zommiommy · 2023-04-05T08:35:40Z

zommiommy
Apr 5, 2023
Collaborator Author

In code the alterantives I see are:

Generic const enum: (Sorry the feature I mentioned is the wrong one, the right one is adt_const_params)

#![feature(adt_const_params)]

#[derive(PartialEq, Eq)]
enum Endianess {
    Big,
    Little,
}

struct Reader<const E: Endianess>;

fn main() {
    let r = Reader::<{Endianess::Big}>;
}

Generic const bool:

struct Reader<const LittleEndian: bool>;

Two structs:

struct ReaderLittleEndian;
struct ReaderBigEndian;

Define two types as genercis. As bitvec does

trait Endianess{}

struct LittleEndian;
struct BigEndian;

impl Endianess for LittleEndian {}
impl Endianess for BigEndian {}

struct Reader<E: Endianess>(std::marker::PhantomData<E>);

fn main() {
    let r = Reader::<LittleEndian>;
}

0 replies

zommiommy · 2023-04-05T08:36:54Z

zommiommy
Apr 5, 2023
Collaborator Author

te target to is arrive at a point where we have unit tests

Sure, let's start with anyone, and then, once the proof of concept works, migrate the code to the format we hope is the best.

0 replies

vigna · 2023-04-05T08:41:31Z

vigna
Apr 5, 2023
Maintainer

I'd follow the bitvec path. Good point. If possible, maybe we can use the same structs of bitvec—certainly we will need it at some point.

0 replies

zommiommy · 2023-04-07T09:13:17Z

zommiommy
Apr 7, 2023
Collaborator Author

Status Update: Currently both Readers and Writers for M2L and L2M are implemented and they pass long fuzzing sessions of writing and reading back the codes.

The current performance are the followings on my laptop (Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz):

test buffered::fixed_len::L2M::read  ... bench:      62,560 ns/iter (+/- 4,155) = 1158 MB/s
test buffered::fixed_len::L2M::write ... bench:      67,954 ns/iter (+/- 13,642) = 1066 MB/s
test buffered::fixed_len::M2L::read  ... bench:      67,558 ns/iter (+/- 8,317) = 1073 MB/s
test buffered::fixed_len::M2L::write ... bench:      54,887 ns/iter (+/- 8,924) = 1320 MB/s
test buffered::gamma::L2M::read      ... bench:      78,075 ns/iter (+/- 8,984) = 224 MB/s
test buffered::gamma::L2M::write     ... bench:      57,240 ns/iter (+/- 6,995) = 305 MB/s
test buffered::gamma::M2L::read      ... bench:      77,617 ns/iter (+/- 9,673) = 225 MB/s
test buffered::gamma::M2L::write     ... bench:      50,448 ns/iter (+/- 8,908) = 346 MB/s
test buffered::unary::L2M::read      ... bench:      65,376 ns/iter (+/- 7,454) = 3632 MB/s
test buffered::unary::L2M::write     ... bench:     154,238 ns/iter (+/- 74,608) = 1539 MB/s
test buffered::unary::M2L::read      ... bench:      67,255 ns/iter (+/- 9,179) = 3531 MB/s
test buffered::unary::M2L::write     ... bench:     152,559 ns/iter (+/- 25,530) = 1556 MB/s

I will now proceed to add the use of tables to speedup the Gamma code.

1 reply

zommiommy Apr 7, 2023
Collaborator Author

Here you can find the code I used for the benchmark. https://github.com/vigna/webgraph-rs/blob/bitstream/benches/codes.rs

To run it execute:

cargo bench

To enable CPU arch specific instructions use:

RUSTFLAGS="-C target-cpu=native" cargo bench

zommiommy · 2023-04-07T15:24:01Z

zommiommy
Apr 7, 2023
Collaborator Author

@vigna @zacchiro I've implemented the tables for unary and gamma codes.
I started benching with rust default cargo bench, but I got weird results, so I wrote a benchmark:
https://github.com/vigna/webgraph-rs/blob/bitstream/benchmarks/src/main.rs

I used rdtscp and fences to get better results and pinned the execution to core number 5. Even executing it multiple times, I get the same result. The benchmark does 100 warmup iterations and then 10_000 measurement iterations.
The data was sampled with a xorshift64 using rng.next() % (1 + rng.next() % 1000) to create a skewed distribution.

I run these benchmarks on my Desktop with a Ryzen 3900x:

And these on my laptop with a Intel(R) Core(TM) i7-8750H CPU, of course it has much more noise, but the trend seems to be there:

I'm executing a longer benchmark with 100_000 measurement iterations to validate these results.

6 replies

vigna Apr 7, 2023
Maintainer

If you could report the ns per code instead of (or in addition of) the throughput that'd be great. There are advantages in both approaches but I'm used absolute timings.

vigna Apr 7, 2023
Maintainer

I expect tables to be entirely detrimental for unary but as usual testing is better.

vigna Apr 7, 2023
Maintainer

Note that in general you might have to cache the test values, as the cost of generating a value might be comparable to the cost of reading or writing it.

vigna Apr 7, 2023
Maintainer

You can use SmallRng from the rand crate, that's my xoshiro256++ and it's really fast.

zommiommy Apr 7, 2023
Collaborator Author

I know that! I've implemented it in the past to fill matrices with random floats. https://github.com/zommiommy/vec_rand/blob/master/src/xorshiro256plus/xorshiro256plus_avx.rs

zommiommy · 2023-04-07T17:55:12Z

zommiommy
Apr 7, 2023
Collaborator Author

Now they look better! We decided to use 8-bit tables for gamma and no tables for unary.

0 replies

vigna · 2023-04-11T14:22:15Z

vigna
Apr 11, 2023
Maintainer

I just noticed that I didn't notice that the zipf crate implements a finite Zipf distribution. This is not correct—our codes are infinite. The difference is probably small but I'd replace the zipf crate with the rand_dist crate (or similar), which provides a Zeta distribution which is an infinite Zipf distribution. Also other crates with power-law distributions (however named) would be OK.

This is just a methodological issue—I do not expect the results of our tests to change.

1 reply

zommiommy Apr 11, 2023
Collaborator Author

Ok, fixed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New format with different endianess for better performance #6

{{title}}

Replies: 12 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New format with different endianess for better performance #6

zommiommy Apr 4, 2023 Collaborator

Replies: 12 comments · 8 replies

vigna Apr 4, 2023 Maintainer

zacchiro Apr 5, 2023 Collaborator

vigna Apr 5, 2023 Maintainer

zommiommy Apr 5, 2023 Collaborator Author

vigna Apr 5, 2023 Maintainer

zommiommy Apr 5, 2023 Collaborator Author

zommiommy Apr 5, 2023 Collaborator Author

vigna Apr 5, 2023 Maintainer

zommiommy Apr 7, 2023 Collaborator Author

zommiommy Apr 7, 2023 Collaborator Author

zommiommy Apr 7, 2023 Collaborator Author

vigna Apr 7, 2023 Maintainer

vigna Apr 7, 2023 Maintainer

vigna Apr 7, 2023 Maintainer

vigna Apr 7, 2023 Maintainer

zommiommy Apr 7, 2023 Collaborator Author

zommiommy Apr 7, 2023 Collaborator Author

vigna Apr 11, 2023 Maintainer

zommiommy Apr 11, 2023 Collaborator Author

zommiommy
Apr 4, 2023
Collaborator

Replies: 12 comments 8 replies

vigna
Apr 4, 2023
Maintainer

zacchiro
Apr 5, 2023
Collaborator

vigna
Apr 5, 2023
Maintainer

zommiommy
Apr 5, 2023
Collaborator Author

vigna
Apr 5, 2023
Maintainer

zommiommy
Apr 5, 2023
Collaborator Author

zommiommy
Apr 5, 2023
Collaborator Author

vigna
Apr 5, 2023
Maintainer

zommiommy
Apr 7, 2023
Collaborator Author

zommiommy Apr 7, 2023
Collaborator Author

zommiommy
Apr 7, 2023
Collaborator Author

vigna Apr 7, 2023
Maintainer

vigna Apr 7, 2023
Maintainer

vigna Apr 7, 2023
Maintainer

vigna Apr 7, 2023
Maintainer

zommiommy Apr 7, 2023
Collaborator Author

zommiommy
Apr 7, 2023
Collaborator Author

vigna
Apr 11, 2023
Maintainer

zommiommy Apr 11, 2023
Collaborator Author