Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tantivy document memory experiment #2371

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

tantivy document memory experiment #2371

wants to merge 1 commit into from

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Apr 23, 2024

Some test regarding the memory consumption of TantivyDocument

Experiment

Parse data set lines and store all Documents in a Vec.
hdfs: 3 fields (timestmap, body, severity), raw dataset 22MB
gh: all fields in a json field (dynamic mode). raw dataset 2.3MB

Note: The root level in hdfs fields are stored as Field id instead as string

Variant1: TantivyDocumentMedVec

replace Vec in OwnedValue with 32 bit versions of the Vec and drop Facet and Pretokstr

Variant2: DocContainerRef

The nodes all store their data in 2 vecs and just reference the position there

#[derive(Default)]
struct OwnedValueRefContainer {
    nodes: mediumvec::Vec32<ValueContainerRef>,
    node_data: mediumvec::Vec32<u8>,
}

Results

cargo run --example doc_mem
[examples/doc_mem.rs:21:5] std::mem::size_of::<TantivyDocument>() = 24
[examples/doc_mem.rs:22:5] std::mem::size_of::<DocContainerRef>() = 48
[examples/doc_mem.rs:23:5] std::mem::size_of::<OwnedValue>() = 48
[examples/doc_mem.rs:24:5] std::mem::size_of::<OwnedValueMedVec>() = 24
[examples/doc_mem.rs:25:5] std::mem::size_of::<ValueContainerRef>() = 12
[examples/doc_mem.rs:26:5] std::mem::size_of::<mediumvec::vec32::Vec32<u8>>() = 16
Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec "
Peak Memory 27555817 : "hdfs DocContainerRef "
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec "
Peak Memory 3533176 : "gh DocContainerRef "

Conclusion

There should be some easy gains by using 32 bit vecs, which only use 16byte instead of 24 bytes.
DocContainerRef could provide additional gains, but adds some complexity.

quickwit-oss/quickwit#4890

@PSeitz PSeitz changed the title tantivy document memory test tantivy document memory experiment Apr 23, 2024
@PSeitz
Copy link
Contributor Author

PSeitz commented May 20, 2024

Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec"
Peak Memory 25155841 : "hdfs DocContainerRef"
Peak Memory 25456237 : "hdfs CompactDoc" // Current version in PR https://github.com/quickwit-oss/tantivy/pull/2402
Peak Memory 27857662 : "hdfs RkyvDoc"         // zero deserialization rkyv
Peak Memory 21055858 : "hdfs PostcardDoc" // postcard serialized
Peak Memory 20106059 : "hdfs ZstdDoc"         // postcard + Zstd
Peak Memory 22555843 : "hdfs BinarySerializable"
Peak Memory 25309370 : "hdfs JsonSerialized"
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec"
Peak Memory 2735326 : "gh DocContainerRef"
Peak Memory 2543967 : "gh CompactDoc"
Peak Memory 3274042 : "gh RkyvDoc"
Peak Memory 2197615 : "gh PostcardDoc"
Peak Memory 862839 : "gh ZstdDoc"
Peak Memory 2325673 : "gh BinarySerialized"
Peak Memory 2508695 : "gh JsonSerialized"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant