Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of FramesManager with Ghost Frame Optimization #68

Closed
wants to merge 2 commits into from

Conversation

Daniil-Golikov
Copy link
Contributor

Overview
As outlined in our project roadmap, this pull request introduces the fully implemented FramesManager. This module incorporates our anticipated Ghost Frames, which are designed to enhance memory efficiency by avoiding the storage of duplicate frame data.

Features

  • Ghost Frame Optimization: Implements a smart storage mechanism to ensure that only unique frames are stored and managed in memory.
  • HashMap Storage: Utilizes Rust's HashMap to keep track of frames by their hashes
  • Reference Counting with Arc: Leverages the atomic reference counting features of Arc to allow multiple references to the same frame without duplicating the underlying data.

Implementation

The FramesManager is built around two core data structures:

  • A Vec<Arc> to maintain the insertion order of frames.
  • A HashMap<usize, Arc> to reference frames by their hash values.

When a new frame is added, we compute its hash and utilize the entry API for efficient handling of existing or new frames. If a frame with the same hash is found, we reuse the Arc pointing to that frame; otherwise, we create a new Arc and store it.

Test Details
The added tests validate various scenarios, encompassing both the compression and decompression processes, as well as the reference handling of frames:

  • test_constant_decompression_many_different_chunks: This test ensures that the decompression yields the correct output from multiple different compressed chunks.

  • test_many_same_chunks: Validates that the FramesManager correctly handles the addition of frames, expecting two separate frames when identical chunks are compressed.

  • test_constant_decompression_many_same_chunks: Checks that decompression is accurate when multiple identical chunks are compressed, ensuring the output is as expected.

  • ref_check_same_chunks: Uses Arc::ptr_eq to confirm that identical chunks result in the same Arc reference within the FramesManager, proving that the Ghost Frame logic is correctly identifying and managing duplicate frames.

  • ref_check_different_chunks: Confirms that different chunks do not share the same Arc reference, which would validate the FramesManager's ability to distinguish and store unique frames separately.

@Daniil-Golikov
Copy link
Contributor Author

Daniil-Golikov commented Nov 6, 2023

Thoughts on the topic.

Should to_bytes and from_bytes dump/read the entire object into/from bytes? Wouldn't it be better to take individual fields from it? Our results differ from the Rust implementation of this class now, even if compressor and data are the same, is this the desired behavior?

@Daniil-Golikov
Copy link
Contributor Author

Also, I am contemplating whether ghost frames actually enhance performance. I believe we need to conduct tests on this matter because I implemented this feature since it was in our weekly plan. However, from the project's perspective, how sensible is such a mechanic?

Copy link
Collaborator

@cjrolo cjrolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going over your implementation (which is very solid) made me do some math and think if we can do better.

ConstantFrame without residuals (best case we have) will have 9 Bytes at worst case (can be as small as 2 Bytes).
The Frame manager will store, not only the Frame but also the Hashmap. The hashmap has always 16 bytes (!!). 8 for the hash, 8 for the pointer.

So, on the absolute best case, the Ghost Frame structure is between 2x and 8x the size of our best case. In other words, we would need to replace 3 to 9 frames per file for this to make sense.

My suggestion since we have a bench data done already. Run this branch against it, see what we get out of that, and do a data guided approach?

Let's keep this discussion going. I love the implementation simplicity.

@@ -94,18 +95,18 @@ mod tests {
let mut cs = CompressedStream::new();
cs.compress_chunk_with(&vector1, Compressor::Constant);
let b = cs.to_bytes();
assert_eq!(b, [66, 82, 82, 79, 0, 1, 41, 251, 0, 4, 3, 3, 0, 2, 0]);
assert_eq!(b, [66, 82, 82, 79, 0, 1, 41, 251, 0, 4, 3, 3, 0, 2, 0, 1, 253, 184, 80, 252, 43, 206, 253, 150, 101, 41, 251, 0, 4, 3, 3, 0, 2, 0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it more than doubles our best data usage. Although our best case is EXCEPTIONALLY good, we must make sure this doesn't propagate (as, in, this is the cost of the static structure). Otherwise, the ghost frames might not bring a ton (or any decent benefit). If they do bring (I do still believe they do), and this space usage is creating issues, we might have to look into other structures.

@cjrolo
Copy link
Collaborator

cjrolo commented Nov 6, 2023

Another option is we throw RKYV at this, it does support pointers well: https://docs.rs/rkyv/latest/rkyv/ (We would have to replace bincode that does a better job regarding data size).

@cjrolo
Copy link
Collaborator

cjrolo commented Nov 28, 2024

Closing since this was not improving performance but was an excelent step into pursuing other ways.

@cjrolo cjrolo closed this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants