Replies: 12 comments 8 replies
-
Yes, let's make this more explicit. When we had the first meeting I had a completely twisted memory of how the bitstream is implemented, and I now believe that it is fundamental to change the bit endianness of the current format, which was conceived around Java's inherently big-endian design. The modifications are as follows:
This has the consequence that we will not be able to convert between the two formats just by reversing the bits in each byte, and in particular that no reasonably fast on-the-fly conversion will be possible (but I might be wrong on this). More precisely: if we write the ɣ code for 4, namely 00101, in a bit stream presently we get the byte 00101000. We would like to switch to a format where writing the same code is written 00001100. Note that the unary part is reversed, but the "payload" of ɣ code is always written in the same direction, which is essential for quick insertion and extraction. @zommiommy, please correct me if I'm wrong. |
Beta Was this translation helpful? Give feedback.
-
It's great that you have designed a new/better format. YAY for fresh looks at old problems!
We can still have a batch and slow conversion tool from the current webgraph-java format to the new upcoming webgraph-rs format, correct? |
Beta Was this translation helpful? Give feedback.
-
Yes. I was thinking about this issue this morning. I think that the correct thing to do is to work on readers and writers in parallel and for both formats. It is a bit more work initially, but we have the advantage of having the possibility of unit-testing since the start reading and writing in both formats, and in particular converting graphs. (I'm not saying to develop compression now, just bit-writing code, even a not particularly efficient one). An alternative is to write the conversion software in Java. But I think that it would also be useful to have a big-endian Rust version as people would be able to use (more slowly, if my intuition is correct) the current graphs. So I would suggest that @zommiommy starts in parallel a reader and writer for big and little endian formats. For the WordReader, I guess a const template parameter is sufficient. In the end, the only difference is a single call to the byte-rearrangement function if the native endianness and the format of the file disagree. We will have to add an endianness field to the graph For the bit reader, code is slightly different—in the big-endian case we keep the bit buffer filled "at the top" (i.e., the most significant bits are correct), whereas in the little-endian case we keep the bit buffer filled "at the bottom" (i.e., the least significant bits are correct). In the big-endian case, to read a unary code we count the number of leading zeros, whereas in the little-endian case we count the number of trailing zeros. In the big-endian case, to extract k bits we take the k most significant bits of the bit buffer, whereas in the little-endian case we take the k least significant bits of the bit buffer. But this is where the difference end. All other codes can be read using unary codes and bit blocks, so the rest of the code would be common (including minimal binary codes, albeit in the end the sequence of written bits would be different in the two cases—Tommaso, we can talk about this). Note that the weird lack of symmetry between little and big endianness of my previous message is due to the same lack of symmetry at the hardware level—big-endian architectures and small-endian architectures orders bits in a byte in the same way, and that's essentially a small-endian way: bit 0 is the least significant bit. |
Beta Was this translation helpful? Give feedback.
-
Ok implemented both readers, note that I explicitly use Something I'm not sure yet is if we should have a single struct with a generic const to set the endianess, or have two different structs. My concerns are the followings: |
Beta Was this translation helpful? Give feedback.
-
Maybe having a macro for the time being, hoping that I'd say that a reasonable immediate target to is arrive at a point where we have unit tests writing and reading ɣ codes in both formats. Then the rest should be relatively easy. And at that point we can also do some benchmarking. |
Beta Was this translation helpful? Give feedback.
-
In code the alterantives I see are: Generic const enum: (Sorry the feature I mentioned is the wrong one, the right one is #![feature(adt_const_params)]
#[derive(PartialEq, Eq)]
enum Endianess {
Big,
Little,
}
struct Reader<const E: Endianess>;
fn main() {
let r = Reader::<{Endianess::Big}>;
} Generic const bool: struct Reader<const LittleEndian: bool>; Two structs: struct ReaderLittleEndian;
struct ReaderBigEndian; Define two types as genercis. As trait Endianess{}
struct LittleEndian;
struct BigEndian;
impl Endianess for LittleEndian {}
impl Endianess for BigEndian {}
struct Reader<E: Endianess>(std::marker::PhantomData<E>);
fn main() {
let r = Reader::<LittleEndian>;
} |
Beta Was this translation helpful? Give feedback.
-
Sure, let's start with anyone, and then, once the proof of concept works, migrate the code to the format we hope is the best. |
Beta Was this translation helpful? Give feedback.
-
I'd follow the |
Beta Was this translation helpful? Give feedback.
-
Status Update: Currently both Readers and Writers for The current performance are the followings on my laptop (Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz):
I will now proceed to add the use of tables to speedup the Gamma code. |
Beta Was this translation helpful? Give feedback.
-
@vigna @zacchiro I've implemented the tables for unary and gamma codes. I used I run these benchmarks on my Desktop with a Ryzen 3900x: I'm executing a longer benchmark with 100_000 measurement iterations to validate these results. |
Beta Was this translation helpful? Give feedback.
-
Now they look better! We decided to use 8-bit tables for gamma and no tables for unary. |
Beta Was this translation helpful? Give feedback.
-
I just noticed that I didn't notice that the This is just a methodological issue—I do not expect the results of our tests to change. |
Beta Was this translation helpful? Give feedback.
-
During today's review of the code we noticed that we read codes using a MSB to LSB order which implies the need to use big-endian words. Since most modern computers are little-endian we might want to add backends to support LSB to MSB order, thus little-endian order, which might result in better performance.
We will discuss this further in the future as it might lead only to marginal improvements and needs to be accurately benchmarked.
Beta Was this translation helpful? Give feedback.
All reactions