-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Batching #261
base: branch-24.12
Are you sure you want to change the base?
Dynamic Batching #261
Conversation
…the time the batch is dispatched
…of the centralized batch submit counter
…among CPU threads
Current progress noteAt the moment, all PRs, upon which this PR depends, are merged - there are no pending fixes/blockers in cuVS or raft. There's one known bug that leads to a deadlock in rare cases: |
…ffer slots for batch tokens
…sponsiveness of the system
Current progress note
|
For the record, the list of identified benchmark problems, which require a further study:
|
return index.runner->search(res, params, queries, neighbors, distances); \ | ||
} | ||
|
||
CUVS_INST_DYNAMIC_BATCHING_INDEX(float, uint32_t, cuvs::neighbors::cagra, index<float, uint32_t>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really unfortunate that we'll need to instantiate these individually for each index type. For example, Vamana is not included here. Is there any way we can remove this constaint? Can we just tie this to the search_params super class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I see couple solutions to this.
One is to go with the class-based polymorphism.
Then we'd have to make the search parameters neighbors::search_params
and the index type neighbohrs::index
virtual, by adding the virtual destructor type. We will also need a virtual clone()
method, so we can copy implementation search parameters via the base class. This goes slightly against our initial design of keeping the search parameters a POD. This also means it would be dangerous to pass search parameters struct to kernels (but I think we haven't been doing this so far).
Then we would also need to add virtual search
method to the index (and also dim()
which is currently used by the dynamic batching), which goes slightly against our initial design of having search/build functions as plain functions.
Then there will be only one, non-templated dynamic_batching
constructor taking the abstract upstream index and search parameters.
Another solution is to go with the template-based polymorphism.
I could define a template constructor in the public header file (similar to what I have in the dynamic_batching_test
at the moment). It would take the index, search params, and the search function as the template parameters, so that users can instantiate it on the user side. I think I would have to slightly rework the detail::batch_runner
and expose at least one of its constructors in the public header for that.
This obviously goes against our design of not instantiating anything on the user side (but it doesn't involve any cuda-specific code and should be fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the template-based solution would be slightly less disruptive to cuVS in general, but I also probably we should take this to a follow-on PR and a separate discussion for 25.02
std::shared_ptr<detail::batch_runner<T, IdxT>> runner; | ||
|
||
/** | ||
* @brief Construct a dynamic batching index by wrapping the upstream index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a usage example and add this to the API docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also explain how do we know if the batch is finished. Is it just a sync with the stream in res
?
const cuvs::neighbors::filtering::base_filter* sample_filter = nullptr); | ||
}; | ||
|
||
void search(raft::resources const& res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document these functions. I know it seems redundant, but it's important users can look these functions up in the docs. Also please add usage examples.
@@ -138,6 +138,10 @@ using list_data = ivf::list<list_spec, SizeT, ValueT, IdxT>; | |||
*/ | |||
template <typename T, typename IdxT> | |||
struct index : cuvs::neighbors::index { | |||
using index_params_type = ivf_flat::index_params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also include the other index types (e.g. bfknn, vamana, etc...).
NAME | ||
NEIGHBORS_DYNAMIC_BATCHING_TEST | ||
PATH | ||
neighbors/dynamic_batching/test_cagra.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include the other index types here as well (e.g. bfknn, vamana, etc...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the implementation details. Thanks Artem for the additional documentation, overall the code looks great.
To achieve high throughput and low latency, one has to watch out for intricate details for queuing and synchronization, which makes the implementation complex. I have left a few comments that requests additional explanation, and suggests potential refactoring to make the logic easier to follow.
param.dynamic_batching_n_queues = conf.at("dynamic_batching_n_queues"); | ||
} | ||
param.dynamic_batching_k = | ||
uint32_t(uint32_t(conf.at("k")) * float(conf.value("refine_ratio", 1.0f))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is inconsistency between this build param, and the regular search param k
? Do we throw a reasonable error message?
if (batch_sizes_.has_value()) { batch_sizes_.value()(i).store(0, kMemOrder); } | ||
dispatch_sequence_id_[i].store(uint32_t(-1), kMemOrder); | ||
tokens_(i).store( | ||
batch_token{static_cast<uint32_t>(slot_state::kEmpty) * kSize + kCounterLocMask}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what the token values represent? Do we have any hidden assumption on how kSize
of the batch queue relate to these state values like kEmpty
?
private: | ||
cuda::atomic<int32_t, cuda::thread_scope_system>* cpu_provided_remaining_time_us_; | ||
uint64_t timestamp_ns_ = 0; | ||
int32_t local_remaining_time_us_ = std::numeric_limits<int32_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this initialization practically mean, that the time is counted from the first call to has_time()
, because only then will the local_remaining_time_us_
variable set to the cpu provided value?
} | ||
|
||
private: | ||
raft::resources res_; // Sic! Store by value to copy the resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unusual. Why do we need to copy?
* | ||
* The CPU threads atomically increment this counter until its size reaches `max_batch_size`. | ||
* | ||
* Any (CPU or GPU thread) my atomically write to the highest byte of this value, which indicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May?
* Any (CPU or GPU thread) my atomically write to the highest byte of this value, which indicates | |
* Any (CPU or GPU thread) may atomically write to the highest byte of this value, which indicates |
const auto seq_id = batch_queue_.head(); | ||
const auto commit_result = try_commit(seq_id, n_queries); | ||
// The bool (busy or not) returned if no queries were committed: | ||
if (std::holds_alternative<bool>(commit_result)) { | ||
// Pause if the system is busy | ||
// (otherwise the progress is guaranteed due to update of the head counter) | ||
if (std::get<bool>(commit_result)) { to_commit.wait(); } | ||
continue; // Try to get a new batch token | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unaware of the intricacies of how to queue the work, but it seems that we are doin queue state management at multiple levels: head()
, is checking tail position and potentially waits, try_commit()
is checking batch_status and maybe commits, maybe not, here in the loop we are checking the status and potentially waiting and trying again.
To keep the code simple, it would be great if try_commit
not just tries, but actually commits it by moving this logit there.
But if there is a good reason to organize the logic this way, that could be also fine, after all this is an implementation detail.
// The interpretation of the token status depends on the current seq_order_id and a similar | ||
// counter in the token. This is to prevent conflicts when too many parallel requests wrap | ||
// over the whole ring buffer (batch_queue_t). | ||
token_status = batch_queue::batch_status(batch_token_observed, seq_id); | ||
// Busy status means the current thread is a whole ring buffer ahead of the token. | ||
// The thread should wait for the rest of the system. | ||
if (token_status == slot_state::kFullBusy || token_status == slot_state::kEmptyBusy) { | ||
return true; | ||
} | ||
// This branch checks if the token was recently filled or dispatched. | ||
// This means the head counter of the ring buffer is slightly outdated. | ||
if (token_status == slot_state::kEmptyPast || token_status == slot_state::kFullPast || | ||
batch_token_observed.size_committed() >= max_batch_size_) { | ||
batch_queue_.pop(seq_id); | ||
return false; | ||
} | ||
batch_token_updated = batch_token_observed; | ||
batch_token_updated.size_committed() = | ||
std::min(batch_token_observed.size_committed() + n_queries, max_batch_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user of the queue have to be aware of all the possible states? Can't we hide this as implementation detail of the queue? In other words, could we have a head() function which simply return a slot that is valid, and move these state comparison details into the queue?
local_waiter till_full{std::chrono::nanoseconds(size_t(params.dispatch_timeout_ms * 1e5)), | ||
batch_queue_.niceness(seq_id)}; | ||
while (batch_queue::batch_status(batch_token_observed, seq_id) != slot_state::kFull) { | ||
/* Note: waiting for batch IO buffers | ||
The CPU threads can commit to the incoming batches in the queue in advance (this happens in | ||
try_commit). | ||
In this loop, a thread waits for the batch IO buffer to be released by a running search on | ||
the GPU side (scatter_outputs kernel). Hence, this loop is engaged only if all buffers are | ||
currently used, which suggests that the GPU is busy (or there's not enough IO buffers). | ||
This also means the current search is not likely to meet the deadline set by the user. | ||
|
||
The scatter kernel returns its buffer id into an acquired slot in the batch queue; in this | ||
loop we wait for that id to arrive. | ||
|
||
Generally, we want to waste as little as possible CPU cycles here to let other threads wait | ||
on dispatch_sequence_id_ref below more efficiently. At the same time, we shouldn't use | ||
`.wait()` here, because `.notify_all()` would have to come from GPU. | ||
*/ | ||
till_full.wait(); | ||
batch_token_observed = batch_token_ref.load(cuda::std::memory_order_acquire); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to a helper function of batch_queue
, to keep this state checking an internal detail of the queue?
/* The remaining time may be updated on the host side: a thread with a tighter deadline may reduce | ||
it (but not increase). */ | ||
cuda::atomic<int32_t, cuda::thread_scope_system>* remaining_time_us, | ||
/* The token contains the current number of queries committed and is cleared in this kernel. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does committed
mean? What is fully_committed
?
¶ms = search_params_dynb, | ||
index = index_dynb.value(), | ||
query_view = raft::make_device_matrix_view<data_type, int64_t>( | ||
queries->data_handle() + i * ps.dim, 1, ps.dim), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we submit one query at a time, right? Could we test with more than one queries as well?
Non-blocking / stream-ordered dynamic batching as a new index type.
API
This PR implements dynamic batching as a new index type, mirroring the API of other indices.
Feature: stream-ordered dynamic batching
Non-blocking / stream-ordered dynamic batching means the batching does not involve synchronizing with a GPU stream. The control is returned to the user as soon as the necessary work is submitted to the GPU. This entails a few good-to-know features:
Overall, stream-ordered dynamic batching makes it easy to modify existing cuVS indexes, because the wrapped index has the same execution behavior as the upstream index.
Work-in-progress TODO
cpp/include/cuvs/neighbors/dynamic_batching.hpp
) [ready for review CC @cjnolet]cpp/src/neighbors/detail/dynamic_batching.cuh
) [ready for preliminary review: requests for algorithm docsting/clarifications are especially welcome]