Slides: PowerPoint, pdf
Poster: VLDB 2024
Code: see below
Bf-Tree is a modern read-write-optimized concurrent larger-than-memory range index.
-
Modern: designed for modern SSDs, implemented with modern programming languages (Rust).
-
Read-write-optimized: 2.5× faster than RocksDB for scan operations, 6× faster than a B-Tree for write operations, and 2× faster than both B-Trees and LSM-Trees for point lookup -- for small records (e.g. ~100 bytes).
-
Concurrent: scale to high thread count.
-
Larger-than-memory: scale to large data sets.
-
Range index: records are sorted.
The core of Bf-Tree are the mini-page abstraction and the buffer pool to support it.
Thank you for your time on Bf-Tree, I'm all ears for feedbacks!
More fine-grained than page-level cache, which is more efficient at identifying individual hot records.
Mini pages absorb writes records and batch flush them to disk.
Mini-pages grow and shrink in size to be more precise in memory usage.
Mini-pages are flushed to disk when they are too large or too cold.
The buffer pool is a circular buffer, allocated space is defined by head and tail address.
Naive circular buffer is a fifo queue, we make it a LRU-approximation using the second chance region.
Mini-pages in the second-chance region are:
- Copy-on-accessed to the tail address
- Evicted to disk if not being accessed while in the region
Mini-pages are copied to a larger mini-page when they need to grow. The old space is added to a free list for future allocations.
Bf-Tree is currently internal to Microsoft. I'll update here once we figured out the next steps.
No. The prototype I implemented is not deployed in production yet.
However, the circular buffer pool design is very similar to FASTER's hybrid log, which is deployed in production at Microsoft.
-
Bf-Tree only works for modern SSDs where parallel random 4KB writes have similar throughput to sequential writes. While not all SSDs have this property, it is not uncommon for modern SSDs.
-
Bf-Tree is heavily optimized for small records (e.g., 100 bytes, a common size when used as secondary indexes). Large records will have a similar performance to B-Trees or LSM-Trees.
-
Bf-Tree's buffer pool is more complex than B-Tree and LSM-Trees, as it needs to handle variable length mini-pages. But it is simpler than Bw-Tree in my opinion, which is implemented & deployed in production.
Add your opinions here!
-
Bf-Tree in-place writes to disk pages, which can cause burden to SSD garbage collection. If it is indeed a problem, we should consider using log-structured write to disk pages.
-
Bf-Tree's mini-page eviction/promotion policy is dead simple. More advanced policies can be used to ensure fairness, improve hit rate, and reduce copying. Our current paper focus on mini-page/buffer pool mechanisms, and exploring advanced policies is left as future work.
-
Better async. Bf-Tree relies on OS threads to interleave I/O operations, many people believe this is not ideal. Implement Bf-Tree with user-space async I/O (e.g., tokio) might be a way to publish paper.
-
Go lock/latch free. Bf-Tree is lock-based, and is carefully designed so that no dead lock is possible. Adding
lock-free
to the paper title is cool -- if you are a lock-free believer. -
Future hardware. It's not too difficult to imagine 10 more papers on applying Bf-Tree to future hardwares, like CXL, RDMA, PM, GPU, SmartNic etc.
If you encounter any problems or have questions about implementation details, I'm more than happy to help you out and give you some hints! Feel free to reach out to me at [email protected] or open an issue here.
Some notable changes:
- We have fixed a legend typo in Figure 1.
@article{bf-tree,
title={Bf-Tree: A Modern Read-Write-Optimized Concurrent Larger-Than-Memory Range Index},
author={Hao, Xiangpeng and Chandramouli, Badrish},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={11},
pages={3442--3455},
year={2024},
publisher={VLDB Endowment}
}
- https://x.com/badrishc/status/1828290910431703365
- https://x.com/MarkCallaghanDB/status/1827906983619649562
- https://x.com/MarkCallaghanDB/status/1828466545347252694
- https://discord.com/channels/824628143205384202/1278432084070891523/1278432617858990235
- https://x.com/FilasienoF/status/1830986520808833175
Add yours here!