Skip to content

Commit

Permalink
Sarthak | Updates README
Browse files Browse the repository at this point in the history
  • Loading branch information
SarthakMakhija authored Sep 22, 2024
1 parent cffcf18 commit d7d73bd
Showing 1 changed file with 17 additions and 24 deletions.
41 changes: 17 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,26 @@ Inspired by [LSM in a Week](https://skyzh.github.io/mini-lsm/00-preface.html)
Memtable uses [Skiplist](https://tech-lessons.in/en/blog/serializable_snapshot_isolation/#skiplist-and-mvcc) as its storage data structure.
The [Skiplist](https://github.com/SarthakMakhija/go-lsm/blob/main/memory/external/skiplist.go) implementation in this repository is shamelessly take from [Badger](https://github.com/dgraph-io/badger).
It is a lock-free implementation of Skiplist. It is important to have a lock-free implementation, otherwise scan operation will take lock(s) (/read-locks) and it will start interfering with write operations.
Check [Memtable](https://github.com/SarthakMakhija/go-lsm/blob/main/memory/memtable.go).

2. **WAL** is a write-ahead log. Every transactional write is stored in a memtable which is backed by a WAL. Every write to memtable (typically a [TimestampedBatch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go)) involves writing every key/value pair from the batch to WAL.
This implementation writes every key/value pair from the batch to WAL individually. An alternate would be to serialize the entire [TimestampedBatch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go) and write to WAL.
3. **WAL** is a write-ahead log. Every transactional write is stored in a memtable which is backed by a WAL. Every write to memtable (typically a [TimestampedBatch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go)) involves writing every key/value pair from the batch to WAL.
This implementation writes every key/value pair from the batch to WAL individually. An alternate would be to serialize the entire [TimestampedBatch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go) and write to WAL. Check [WAL](https://github.com/SarthakMakhija/go-lsm/blob/main/log/wal.go).

3. **Recovery of Memtable from WAL** involves the following:
4. **Recovery of Memtable from WAL** involves the following:
1) Reading the file in READONLY mode.
2) Reading the whole file in one go.
3) Iterating through the file buffer (/bytes) and decoding each the bytes to get [key](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/key.go) and [value](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/value.go) pairs.
4) Storing the key/value pairs in the Memtable.

Check [recovery of Memtable from WAL](https://github.com/SarthakMakhija/go-lsm/blob/main/log/wal.go#L41).

There are a few approaches for reading the WAL:

- Read the whole file.
- Implement a page-aligned WAL, which means the data in the WAL will be aligned to the page (say, 4KB application page). Read page by page. This implementation will however result in fragmentation in WAL (during writing).
- Read as per the encoding of data. Instead of reading the whole file, multiple file reads will be issued: to read the key size, key, value size and value. [Cassandra](https://github.com/apache/cassandra) implements WAL using this approach.
- Implement WAL as a memory-mapped file. [Badger](https://github.com/dgraph-io/badger) implements WAL as memory-mapped file.

4. **Manifest** records different events in the system. This implementation supports `MemtableCreatedEventType`, `SSTableFlushedEventType` and `CompactionDoneEventType` event types. This concept is used in recovering the state of the
LSM ([StorageState](https://github.com/SarthakMakhija/go-lsm/blob/main/state/storage_state.go)) when it restarts.
5. **Manifest** records different events in the system. This implementation supports `MemtableCreatedEventType`, `SSTableFlushedEventType` and `CompactionDoneEventType` event types. This concept is used in recovering the state of the
LSM ([StorageState](https://github.com/SarthakMakhija/go-lsm/blob/main/state/storage_state.go)) when it restarts. Check [Manifest](https://github.com/SarthakMakhija/go-lsm/blob/main/manifest/manifest.go).

5. **Bloom filter** is a probabilistic data structure used to test whether an element maybe present in the dataset. A bloom filter can query against large amounts of data and return either “possibly in the add” or “definitely not in the add”. It depends on M-sized bit vector and K-hash functions. It is used to check if the application should read an [SSTable](https://github.com/SarthakMakhija/go-lsm/blob/main/table/table.go#L173) during a get operation.
6. **Bloom filter** is a probabilistic data structure used to test whether an element maybe present in the dataset. A bloom filter can query against large amounts of data and return either “possibly in the add” or “definitely not in the add”. It depends on M-sized bit vector and K-hash functions. It is used to check if the application should read an [SSTable](https://github.com/SarthakMakhija/go-lsm/blob/main/table/table.go#L173) during a get operation.
Check [Bloom filter](https://github.com/SarthakMakhija/go-lsm/blob/main/table/bloom/filter.go).

6. **Transaction** represents an atomic unit of work. This repository implements concepts to implement ACID properties:
8. **Transaction** represents an atomic unit of work. This repository implements concepts to implement ACID properties:
- [Batch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go) and [TimestampedBatch](https://github.com/SarthakMakhija/go-lsm/blob/main/kv/timestamped_batch.go) for atomicity.
- [Serialized-snapshot-isolation](https://github.com/SarthakMakhija/go-lsm/blob/main/txn/transaction.go) for isolation
- [WAL](https://github.com/SarthakMakhija/go-lsm/blob/main/log/wal.go) for durability.
Expand All @@ -52,22 +49,18 @@ A brief over of serialized-snapshot-isolation:
7) Readonly transactions never abort.
8) Serialized-snapshot-isolation prevents: dirty-read, fuzzy-read, phantom-read, write-skew and lost-update.

More details are available [here](https://tech-lessons.in/en/blog/serializable_snapshot_isolation/).
More details are available [here](https://tech-lessons.in/en/blog/serializable_snapshot_isolation/). Start understanding [Transaction](https://github.com/SarthakMakhija/go-lsm/blob/main/txn/transaction.go).

7. **Compaction** implementation in this repository is a simpled-leveled compaction. Simple-leveled compaction considers two options for deciding if compaction needs to run.

- **Option1**: `Level0FilesCompactionTrigger`. Consider `Level0FilesCompactionTrigger` = 2, and number of SSTable files at level0 = 3. This means all SSTable files present at level0 are eligible for undergoing compaction with all the SSTable files at level1.

- **Option2:** `NumberOfSSTablesRatioPercentage`. This defines the ratio between the number of SSTable files present in two adjacent levels: number of files at lower level / number of files at upper level.
Consider `NumberOfSSTablesRatioPercentage` = 200, and number of SSTable files at level1 = 2, and at level2 = 1. Ratio = (1/2)*100 = 50%. This is less than the configured NumberOfSSTablesRatioPercentage. Hence, table.SSTable files will undergo compaction betweenlevel1 and level2.
- **Option1**: `Level0FilesCompactionTrigger`. Consider `Level0FilesCompactionTrigger` = 2, and number of SSTable files at level0 = 3. This means all SSTable files present at level0 are eligible for undergoing compaction with all the SSTable files at level1.
- **Option2:** `NumberOfSSTablesRatioPercentage`. This defines the ratio between the number of SSTable files present in two adjacent levels: number of files at lower level / number of files at upper level.
Consider `NumberOfSSTablesRatioPercentage` = 200, and number of SSTable files at level1 = 2, and at level2 = 1. Ratio = (1/2)*100 = 50%. This is less than the configured NumberOfSSTablesRatioPercentage. Hence, table.SSTable files will undergo compaction betweenlevel1 and level2.

In the actual Simple-leveled compaction, we consider the file size instead of number of files.
In the actual Simple-leveled compaction, we consider the file size instead of number of files. Check [Compaction](https://github.com/SarthakMakhija/go-lsm/blob/main/compact/compaction.go).

8. **Iterators** form one of the core building blocks of LSM based key/value storage operations. Iterators are used in operations like [Scan](https://github.com/SarthakMakhija/go-lsm/blob/main/state/storage_state.go#L184) and [Compaction](https://github.com/SarthakMakhija/go-lsm/blob/main/compact/compaction.go#L75). This repository provides various iterators, (listing a few here): [MergeIterator](https://github.com/SarthakMakhija/go-lsm/blob/main/iterator/merge_iterator.go), [SSTableIterator](https://github.com/SarthakMakhija/go-lsm/blob/main/table/iterator.go) and [InclusiveBoundedIterator](https://github.com/SarthakMakhija/go-lsm/blob/main/iterator/iterator.go).

### Approach to go though the repository

### Development plan
![LSM development items](https://github.com/user-attachments/assets/47731c33-a642-432e-8a02-1d3146d88e8d)


0 comments on commit d7d73bd

Please sign in to comment.