This is how scdb works under the hood. Keep in mind that scdb is basically a hashtable with the memory-mapped database file acting as the underlying array.
There are six main operations
- This creates the database file if it does not exist
- adds the 100-byte header basing on user configuration and default values
- adds the placeholders for the index blocks, each item pre-initialized with a zero.
- And loads the derived and non-derived properties like 'max_keys', 'block_size', 'redundant_blocks',
'number_of_index_blocks' (
(max_keys / number_of_items_per_index_block).ceil() + redundant_blocks
), 'number_of_items_per_index_block' ((block_size / 8).floor()
), 'key_values_start_point' (100 + (net_block_size * number_of_index_blocks)
), 'net_block_size' (number_of_items_per_index_block * 8
)
- The key supplied is run through a hash function with modulo
number_of_items_per_index_block
and answer multiplied by 8 to get the byte offset. Let the hashed value behash
- Set
index_block_offset
to zero to start from the first block. - The
index_address
is set toindex_block_offset + 100 + hash
. - The 8-byte offset at the
index_address
offset is read. This is the first possible pointer to the key-value entry. Let's call itkey_value_offset
. - If this
key_value_offset
is zero, this means that no value has been set for that key yet.- So the key-value entry (with all its data including
key_size
,expiry
(got from ttl from user),value_size
,value
,deleted
) is appended to the end of the file at offsetlast_offset
- the
last_offset
is then inserted atindex_address
in place of the zero - the
last_offset
header is then updated tolast_offset + get_size_of_kv(kv)
[get_size_of_kv gets the total size of the entry in bytes]
- So the key-value entry (with all its data including
- If this
key_value_offset
is non-zero, it is possible that the value for that key has already been set.- retrieve the key at the given
key_value_offset
. (Do note that there is a 4-byte numbersize
before the key. That number gives the size of the key-value entry). - if this key is the same as the key passed, we have to update it appending it to the bottom of file and overwriting
its index value:
- The key-value entry (with all its data including
key_size
,expiry
(got from ttl from user),value_size
,value
,deleted
) is appended to the end of the file at offsetlast_offset
- the
last_offset
is then inserted atindex_address
in place of the former offset - the
last_offset
header is then updated tolast_offset + get_size_of_kv(kv)
- The key-value entry (with all its data including
- else increment the
index_block_offset
bynet_block_size
- if the new
index_block_offset
is equal to or greater than thekey_values_start_point
, raise theCollisionSaturatedError
error. We have run out of blocks without getting a free slot to add the key-value entry. - else go back to step 3.
- if the new
- retrieve the key at the given
Note: this uses a form of separate chaining to handle hash collisions. Having multiple index blocks is a form of separate chaining
- Time complexity: This operation is O(k) where k is the
number_of_index_blocks
. - Auxiliary Space: This operation is O(kn+m) where n = key length, m = value length and k =
number_of_index_blocks
. The key-value entry is copied into a contiguous byte array before insertion and if the hash for the given key already has keys associated with it, each will be allocated in memory thuskn
.
- There is a possibility that when one sets a given key, we may find all index blocks for the given hash already filled.
We thus have to throw a
CollisionSaturatedError
and abort inserting the key. This means that the occurrence of such errors will increase in frequency as the number of keys comes closer to themax_keys
value. One possible remedy to this is to add a more redundant index block(s) i.e. increaseredundant_blocks
. Keep in mind that this consumes extra disk and memory space.
- The key supplied is run through a hash function with modulo
number_of_items_per_index_block
and answer multiplied by 8 to get the byte offset. Let the hashed value behash
. - Set
index_block_offset
to zero to start from the first block. - The
index_address
is set toindex_block_offset + 100 + hash
. - The 8-byte offset at the
index_address
offset is read. This is the first possible pointer to the key-value entry. Let's call itkey_value_offset
. - If this
key_value_offset
is non-zero, it is possible that the value for that key exists.- retrieve the key at the given
key_value_offset
.- if this key is the same as the key passed, we delete it:
- update the
IS_DELETED
value of the key-value entry and exit
- update the
- if this key is the same as the key passed, we delete it:
- else increment the
index_block_offset
bynet_block_size
- if the new
index_block_offset
is equal to or greater than thekey_values_start_point
, stop and return. The key does not exist. - else go back to step 3.
- if the new
- retrieve the key at the given
- Time complexity: This operation is O(k) where k is the
number_of_index_blocks
. - Auxiliary space: This operation is O(kn) where n = buffer size (default is virtual memory page size), and k
=
number_of_index_blocks
. The hash for the key is checked for each index block, until a block is found that contains the key. If the hash for the given key is not already buffered, we read a new block from file into memory. The worst case is when the key is non-existent and there were no buffers already in memory.
- The key supplied is run through a hash function with modulo
number_of_items_per_index_block
and answer multiplied by 8 to get the byte offset. Let the hashed value behash
. - Set
index_block_offset
to zero to start from the first block. - The
index_address
is set toindex_block_offset + 100 + hash
. - The 8-byte offset at the
index_address
offset is read. This is the first possible pointer to the key-value entry. Let's call itkey_value_offset
. - If this
key_value_offset
is non-zero, it is possible that the value for that key exists.- retrieve the key at the given
key_value_offset
.- if this key is the same as the key passed:
- if the
expiry
is greater than thecurrent_timestamp
, returnNone
- else return
value
for this value
- if the
- if this key is the same as the key passed:
- else increment the
index_block_offset
bynet_block_size
- if the new
index_block_offset
is equal to or greater than thekey_values_start_point
, stop and returnNone
. The key does not exist. - else go back to step 3.
- if the new
- retrieve the key at the given
- Time complexity: This operation is O(k) where k is the
number_of_index_blocks
. - Auxiliary space: This operation is O(kn) where n = buffer size (default is virtual memory page size), and k
=
number_of_index_blocks
. The hash for the key is checked for each index block, until a block is found that contains the key. If the hash for the given key is not already buffered, we read a new block from file into memory. The worst case is when the key is non-existent and there were no buffers already in memory.
Compaction can run automatically every few hours. During that time, the database would be locked. No read, nor write would be allowed. Compaction can also be started by the user.
- Create new file
- Copy header into the new file
- Copy index into new file. This done index block by block.
- In each index block, find any non-zero index entries. For each of these:
- if the entry has not yet expired and is not deleted
- append that key-value entry to the new file,
- update the index for that entry in the new file. The index being the offset where it was appended (i.e. bottom of file)
- else
- update the index for that entry in the new file to be 0 (zero)
- if the entry has not yet expired and is not deleted
- Clear the buffers
- Update file_size to the new file's file size
- Point the buffer pool's file to that new file
- Delete the old file
- Rename the new file to the old file's name
- Time complexity: This operation is O(N) where N is the number of keys in the file before compaction.
- Auxiliary space: This operation is O(2m) where m is the
file_size
of the original file because we copy one file to another. The worst case is when there is no key that has been deleted or expired.
Clear the entire database.
- Initialize a new header basing on the old settings
- Write the new header into the file
- Shrink the file to the size of the header
- Expand the file to the expected size of file if it had headers and index blocks. This fills any gaps with 0s.
- Time complexity: This operation is O(1).
- Auxiliary space: This operation is O(1).
-
The BufferPool's buffers can be split up into two kinds; index buffers and key-value buffers. This is to deal with the multiple cache misses we keep having (very evident in the delete-benchmarks). We need to have at least the index first block always cached. This has the following consequences.
- The index buffers are to be stored in a btree map so as to be able to expel any buffers that have a higher left-offset first
- The total capacity of the buffer pool will be split up at a 2:3 ratio for index:key-value buffers.
- The index buffer capacity will be capped to the number of index blocks such that
if the 2:3 ratio produces an index buffer capacity higher than the total number of index blocks, we increase the
key-value buffer capacity by the difference between
total number of index blocks
and the computedindex buffer capacity
. This avoids waste. - methods like
read_at
which could use either the index or the key-value buffers will specify what buffer type they are interested in i.e. index or key-value.
- This improved the reads (i.e.
get
) but had a very minimal effect on thedelete
operation, probably because the cache-misses were a result of having very large buffers being operated on in one operation. - Remedy: Changed the way deletion is done. Instead of updating the index, we update the actual key-value
entry's
IS_DELETED
attribute
-
The index can take up a lot of space in memory - so much so that updating it has to be done incrementally directly on disk. We need to use different lengths of byte arrays depending on the entry offset they hold. This is hard considering the fact that hashing requires that the number of elements per block be the same always.
- To be investigated further