Using map_stream for checksums assumes that reads and writes are sequential #237

micolous · 2023-11-03T05:00:32Z

micolous
Nov 3, 2023

I'm implementing a reader and writer for an existing file format which is laid out something like this:

struct Data {
  checksum: [u8; 32], // SHA256 of `payload` at the start of the file
  payload: Vec<u8>, // actually a different structure, but I've made it `Vec<u8>` here for simplicity
}

Based on the map_stream examples (which have a hash at the end of the file), I've ended up with this:

#[binrw]
#[derive(Debug, Default, Clone)]
#[brw(big, stream = r, map_stream = Checksum::new)]
pub struct Data {
    #[brw(pad_before = 32)]
    #[br(parse_with = until_eof)]
    pub payload: Vec<u8>,

    #[brw(seek_before = SeekFrom::Start(0))]
    #[br(temp, assert(checksum == r.check(), "bad checksum: {:x?} != {:x?}", checksum, r.check()))]
    #[bw(calc(r.check()))]
    pub checksum: [u8; 32],
}

struct Checksum<T> {
    inner: T,
    sha256: Sha256,
    p: u64,
}

impl<T> Checksum<T> {
    fn new(inner: T) -> Self {
        Self {
            inner,
            sha256: Sha256::new(),
            p: 0,
        }
    }

    fn check(&self) -> [u8; 32] {
        self.sha256.clone().finalize().into()
    }
}

impl<T: Read> Read for Checksum<T> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let size = self.inner.read(buf)?;
        if self.p < 32 {
            // Don't hash the first 32 bytes of the file.
            // Position of byte 32 in this buffer
            let x = (32 - self.p) as usize;

            if x < size {
                // Byte 32 is in this buffer
                self.sha256.update(&buf[x..]);
            }
        } else {
            // We're past the first 32 bytes of the file.
            self.sha256.update(&buf[0..size]);
        }
        self.p += size as u64;
        Ok(size)
    }
}

impl<T: Seek> Seek for Checksum<T> {
    fn seek(&mut self, pos: SeekFrom) -> std::io::Result<u64> {
        self.p = self.inner.seek(pos)?;
        Ok(self.p)
    }
}

impl<T: Write> Write for Checksum<T> {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        let size = self.inner.write(buf)?;
        if self.p < 32 {
            // Don't hash the first 32 bytes of the file
            // Position of byte 32 in this buffer
            let x = (32 - self.p) as usize;

            if x < size {
                // Byte 32 is in this buffer
                self.sha256.update(&buf[x..size]);
            }
        } else {
            // We're past the first 32 bytes of the file.
            self.sha256.update(&buf[..size]);
        }
        self.p += size as u64;
        Ok(size)
    }

    fn flush(&mut self) -> std::io::Result<()> {
        self.inner.flush()
    }
}

This works, but are edge cases with this approach:

hashing at read() time assumes that reads are always sequential from byte 32 onwards, and it is never seek()ed to a position after byte 32.
hashing at write() time also assumes that writes are always sequential from byte 32 onwards, and it is never seek()ed to a position after byte 32

I think the reader_var test cases also makes the assumption of sequential reads, though it's got the checksum field at the end.

Tangential to this, it'd be nice to be able to use this to verify a MAC, and do that verification before deserialising other data structures. However, this would require two read passes on a file, or to buffer the entire file (which could be large) in memory.

csnover · 2023-11-05T05:32:55Z

csnover
Nov 5, 2023
Collaborator

Hi! I don’t see an issue with binrw itself in this report so I’m not sure I understand what is being requested here. Could you clarify what outcome you’re looking for? Was this meant to be opened as a Q&A discussion instead of a ticket? Let me know. Thanks!

0 replies

micolous · 2023-11-06T06:16:11Z

micolous
Nov 6, 2023
Author

This might be question pointing at a documentation gap, or it could be a limitation of the existing API. I'm not sure that it's fixable without some buffering or an API change; or if this is just a limitation a developer needs to keep in mind.

The cookbook for validating checksums on read and calculating checksums on write both use the map_stream option. Those take a struct as a parameter which implements std::io's (or an internal equivalent on no_std) Read + Seek and/or Write + Seek traits.

My implementation of a checksum (based on that cookbook) in my first comment follows that same pattern: whenever there's a read, call Digest::update on that read data, and whenever there's a write, call Digest::update on the written data. The Digest::update pattern on partial data could be used by any hashing algorithm which relies on sequential data access.

The reader_var test case also follows this pattern, though it only implements Read + Seek and uses a checksum algorithm where order of bytes doesn't affect the result.

That pattern makes an assumption, either:

reads and writes of checksummed data are always sequential, and there are no seeks (for algorithms where the order of bytes affects the output, eg: MD5, SHA1, SHA2, etc.)
every byte is only read or written once (or writes of 0 don't matter), but the order of those bytes doesn't matter for your checksum (eg: if your checksum is just SUM(bytes) as u8 like the one in the reader_var test)

I've got a highly-contrived example here.

In it, I've put in some seeks and backtracking in there to simulate something like what FilePtr or calc() fields may do; so this violates the assumption that reads and writes are sequential, so the checksum generated on the read and write paths are both wrong (and differently wrong!)

One other way around it in my highly-contrived example would be to treat the Data::payload as a Vec<u8>, load it entirely into memory and parse it as a Payload in a second pass.

0 replies

csnover · 2023-11-06T18:32:37Z

csnover
Nov 6, 2023
Collaborator

Thank you for your thoughts. I think there is a misunderstanding about the documentation that might be causing some confusion here. https://docs.rs/binrw is a reference guide, not a cookbook. Examples aren’t recipes; they are demonstrations to help authors understand how a feature might be used. They aren’t intended to cover every possible situation.

The implementation of the checksumming stream in the documentation is deliberately elided because the goal is to show how map_stream could be used, not how to write an implementation of std::io::Read + std::io::Seek. If the data you are parsing isn’t compatible with using map_stream to checksum bytes the same time they’re being read, that’s fine, it just means you need to do something different that matches the original implementation more closely.

Depending on what is covered by the MAC—i.e. whether or not whatever you’re seeking to during parsing is supposed to be included or not—you could just track the last hashed byte position in your stream implementation and only add bytes to the hash when the position being read matches, so non-sequential reads are ignored when hashing.

You could also use wrapper types like:

#[derive(BinRead)]
#[br(stream = s, map_stream = HashStream::new)]
struct Hashed<T> where T: BinRead {
  inner: T,
  #[br(calc(s.hash()))]
  computed_hash: [u8; 32]
}

#[derive(BinRead)]
struct StoredHash<T> where T: BinRead {
  stored_hash: [u8; 32],
  #[br(assert(stored_hash == value.computed_hash))]
  value: Hashed<T>,
}

(Use temp, map, and parse_with as desired to reduce the number of wrappers.)

And then avoid non-sequential reads by storing the offsets and lazy-loading or using some of the other helpers from the file_ptr module.

Or some combination of these approaches.

However, if you are trying to do a verify-then-parse, you will never be able to avoid two passes since you can’t start parsing until you have verified the message. In this case, all of this is moot and you have no choice but to read the raw data once to calculate the hash, and then read it a second time when you are parsing.

In any case, I don’t know of anything binrw can really do to make this easier since it doesn’t seem like you’re describing something that can be solved more easily than it already is in the generic case, and even if there was reasonable way of mapping a stream only for some fields and then consuming the stream to retrieve a value from it (I can’t think of a way to do this that wouldn’t be shitty, since whatever grammar is used needs to be compatible with a normal Rust struct grammar), that still won’t get you what you want when it comes to sequentially hashing bytes that aren’t being read sequentially.

Let me know if this makes sense or you have any other questions. I’ll convert this to a discussion since it isn’t reporting a specific defect in binrw but is more asking a question about how to parse a particular data format which is what discussions are best for. Thanks!

1 reply

micolous Nov 8, 2023
Author

Thanks for that, that approach makes more sense.

I'm aware that it is likely pushing up against the limit of what can be expressed with Rust syntax. I'm just getting excited about being about to delete a lot of hand-rolled parsing and serialisation code... so far I'm looking at 20% less lines of code (~1200 lines) for a network protocol implementation, and I haven't finished removing all the old code! Given that I've only implemented about 30% of that protocol, that should make things much simpler for the rest.

Something I was going to bring up as a separate discussion, but I think relates well to your point about the documentation's style, was that it'd be nice to have some cookbook-like approaches in the documentation similar to what's there for using FilePtr to parse offset tables, for common patterns, like:

compute and check a checksum; handling both check-then-parse and parse-then-check approaches
handling byte-length-prefixed arrays (not just element-count-prefixed arrays)
magic block types (which we previously discussed)

It'd be good to tie it to some simple, real-world and well documented format. Some things which come to mind are:

IPv4 + TCP has all of those, and also brings in some optional fields and offsets, while also being a little less complicated structurally
PNG (at least on a chunk level) would exercise all of those, at least reading metadata (and avoiding compressed chunks and image data)
RIFF has length-prefixing and chunk types; maybe tying it to something like MIDI (General MIDI 1.0, not any of the extensions) would make it something usable on a text prompt (ie: dump a list of channels and notes played) while not being overly complicated.

There's a WAV implementation which already uses this library, but it doesn't appear to use the length prefixes for anything but the Data chunks.

cdbennett · 2024-10-25T22:00:26Z

cdbennett
Oct 25, 2024

I tried to base my checksum handling using binrw off the example in the documentation, using the read/write trait wrapper. The real problem is that binrw tries to seek within the stream, apparently when there are different enum variants to try.

So the checksum is calculated over a subset of bytes, as the first variant is tried to parse. Then binrw seeks back to the beginning and tries the second variant. This now has a checksum that was calculated over an unknown portion of the header bytes and including the variant 1 bytes that were read before it was discovered that couldn't be parsed.

I think it would be worth adding a cautionary note to the documentation example for the checksum, because in all but the most trivial protocol, binrw is going to have to seek and it will make checksum calculation with this method impossible. Moreover, it has taken me hours to figure out why...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using map_stream for checksums assumes that reads and writes are sequential #237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using map_stream for checksums assumes that reads and writes are sequential #237

micolous Nov 3, 2023

Replies: 4 comments · 1 reply

csnover Nov 5, 2023 Collaborator

micolous Nov 6, 2023 Author

csnover Nov 6, 2023 Collaborator

micolous Nov 8, 2023 Author

cdbennett Oct 25, 2024

micolous
Nov 3, 2023

Replies: 4 comments 1 reply

csnover
Nov 5, 2023
Collaborator

micolous
Nov 6, 2023
Author

csnover
Nov 6, 2023
Collaborator

micolous Nov 8, 2023
Author

cdbennett
Oct 25, 2024