Documentation for the algorithm implemented by this crate.
There is no code or public API surface in this file, just documentation for the underlying algorithm.
The purpose behind this concurrent map implementation is to minimize the overhead for reading as much as possible. This presents obvious challenges since readers cannot access the map while it is being written to. We solve this problem by keeping two copies of the underlying map - one for the writer to modify, and one for the readers to read from. When the writer wants to publish its modifications, we atomically swap the two maps such that new readers see the writer's changes. We then re-apply those changes to the old map when it is safe to do so.
This approach is already implemented by the crate evmap
, however performance suffers due to
high latency writes and a shared atomic pointer to the readable map. flashmap
's implementation
of this approach ensures the readers access no shared state with each other (other than the actual
map) in their critical path, and moreover it ensures that sufficiently infrequent writes are
effectively wait-free.
The trade-offs are that this map is eventually consistent, so existing readers can still access stale data, and there is only a single writer, so concurrent modification is not supported.
This section briefly describes where state lives in this data structure.
The two maps are stored in a length-2 array in memory. Currently the backing map implementation
is provided by hashbrown
, but this might be made generic in the future. The pointer to this array
is shared with all readers and the writer.
Readers - or more specifically ReadGuard
s - are tracked via atomic reference
counting. Each read handle has its own individual refcount, but these refcounts are shared with
the writer.
Because of the design of this map, we are able to have the writer block when it cannot make progress, and be unparked when the last reader gets out of its way. Thus the shared state of the map includes the necessarry bookkeeping to accomplish this, and will be explained in greater detail later.
Currently, readers store no additional state other than what is described above. Each read handle
does hold onto a refcount_key
which is used to deallocate that handle's refcount on drop, but
this state is immutable.
Writers, in addition to an Arc
of the aforementioned shared state, must store an operation log
of changes. These changes are re-applied to the second map at the beginning of the next write.
There are multiple components to this algorithm, each of which is explained individually below.
A core component of this algorithm is reference counting. Each read handle has its own reference
count (refcount) which it uses to track the number of outstanding guards which need to be dropped.
In addition to the number of outstanding guards, the refcount is also used to store the current
readable map (the map which new read guards should point to). Since there are only two maps, this
information is stored in the high bit of the refcount, and the actual guard count is stored in the
lower usize::BITS - 2
bits (we don't use a bit so we can check for overflow).
A refcount has three principle operations: increment, decrement, and swap maps. Part of the implementation is shown below:
pub struct RefCount {
value: CachePadded<AtomicUsize>,
}
impl RefCount {
/* The implementation of some helpers has been omitted for brevity */
pub fn increment(&self) -> MapIndex {
let old_value = self.value.fetch_add(1, Ordering::Acquire);
Self::check_overflow(old_value);
Self::to_map_index(old_value)
}
pub fn decrement(&self) -> MapIndex {
let old_value = self.value.fetch_sub(1, Ordering::Release);
Self::to_map_index(old_value)
}
pub fn swap_maps(&self) -> usize {
let old_value = self
.value
.fetch_add(Self::MAP_INDEX_FLAG, Ordering::Relaxed);
Self::to_refcount(old_value)
}
}
increment
and decrement
both change the refcount as expected, but also make use of the return
value of the fetch*
operations. The old value of the refcount (before the RMW) will, in its high
bit, tell us which map the writer expects us to be reading from. This is useful especially in
increment
, since the reader can guard its access to the map and know which map to read from in
a single atomic operation. Why we also need the map index from decrement
will be explained later.
The fetch_add
in swap_maps
has the same effect as toggling the high bit. This could be done
through fetch_xor
but I've seen that compiled to a CAS loop on most architectures. The return
value is the number of active guards at the time of swapping the maps. This quantity will be
referred to as "residual" since it's the number of residual readers still accessing the old map.
Each read handle has its own refcount, and all the refcounts for every read handle are stored in
a single contiguous array (currently backed by slab
). This array is wrapped in a Mutex
, so
any access to the entire array involves acquiring a lock. Specifically, when a read handle is
constructed, a new refcount is allocated on the heap, the refcount array is locked, and a pointer
to that refcount is added to the array. Similarly, when a read handle is dropped, the array is
locked, the pointer is removed, and the refcount deallocated.
The read algorithm is described below:
increment
the refcount.- Take the returned
MapIndex
, convert it to an offset in the lengt-2 map array, and get a reference to the map to read from. Store the map index in the read guard. - When the read guard is dropped,
decrement
the refcount. If the returned map index matches the one stored on the guard, we are done. If they do not match, proceed to 4. - Since the map indexes do not match, this is a residual guard. Call
release_residual
to signal that we are done reading. - If we see from the atomic decrement in
release_residual
that we were the last residual reader, wake the writer (this portion of the algorithm will be explained later).
release_residual
is a function that operates on an atomic counter called residual
which tracks
the number of read guards blocking the writer from making progress. This quantity will be explained
with the write algorithm.
Parking and unparking the writer is also managed through the residual
quantity. The residual
count is stored in the lower 63 bits (or however many for the given architecture), and the highest
bit signals whether or not the writer is waiting to be unparked.
When the writer prepares to park, it sets the high bit of the residual count to 1. If the return value of this RMW operation signals that no residual readers were left, then the count is set to zero and the writer does not park. Otherwise, it parks.
When the last residual reader decides to unpark the writer, it will atomically set the residual
count to 0 and then unpark
the writing thread.
The write algorithm is split into two parts: synchronize
+ start write, and publish
. When
a new write guard is created, start_write
is called, and when that guard is dropped
publish
is called.
synchronize
+ start write:
- If there are no residual readers, we are done.
- If there are residual readers, then park as described above.
- Once unblocked, obtain a reference to the writable map.
- Apply changes from previous write.
After these steps but before publish
, the changes to the writable map are made and stored in
the operation log.
publish
:
- Acquire the lock on the refcount array.
- Swap out the value of the field storing the writable map with the old map.
- Call
swap_maps
on each refcount in the array, and (non-atomically) accumulate a sum of the returned residual count. Call this suminitial_residual
. - Release the lock on the refcount array.
- Atomically add
initial_residual
to the actualresidual
counter the readers will be decrementing. If there were no residual readers, or they finished reading while performing 3, then the residual count is now 0, signaling that we don't need to park on the next call tosynchronize
.
Note that read handles are not swapped to the new map at the same time, this is done one-by-one.
An important invariant of this algorithm is that residual == 0
whenever publish
is called.
That way, either the writing thread will see residual == 0
after swapping all the maps, or one
of the residual readers will see residual & isize::MAX == 1
as the old value when it performs its
atomic decrement. In either case, this provides a definite signal that there are no more readers
lingering on the old map.
Creating a read guard corresponds to an atomic increment, and dropping a read guard corresponds to an atomic decrement, and at most one reader will execute the logic to unpark the writer. So overall the readers' critical path is wait-free if this crate is compiled on an architecture which supports wait-free atomic fetch-adds, and even if not, or the unparking logic is engaged, those operations are still fairly cheap compared to hash map operations.
Creating a write guard involves executing the synchronize
+ start write algorithm and then
flushing the operations from the previous write. When the write guard is dropped, the maps are
swapped, but no modifications to either map is made until the next creation of the write guard.
This means that the maps (after the first write) are always out of sync technically, but this is by
design. If we assume that there is some non-negligible amount of time between writes, and that
readers don't have large critical sections, then that means the writer will almost never have to
wait! The time between writes gives the residual readers a chance to move to the new map, and thus
when the writer goes to write again, it won't have to wait for the readers.