-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fix: variable number of max decode tokens within batch #73
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (Only two extremely minor comments).
Thanks for (a) finding this bug, (b) the clean + elegant fix and (c) writing the tests so we don't accidentally introduce this again in the future.
ignore_eos=False) | ||
|
||
vllm_sampling_params = [vllm_sampling_params_normal] * 3 | ||
max_new_tokens = [max_new_tokens_warmup] * 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor but is is really necessary to construct max_new_tokens
separately? couldn't we just access it from the sampling params (e.g. sampling_params.max_new_tokens
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only for the hf model evaluation. We don't pass any sampling parameters to generate_hf_output()
, just max_new_tokens
... I could rename max_new_tokens
to hf_max_new_tokens
to make this more clear?
# number of added padding sequences to fill | ||
# batch to warmed up batch size | ||
self.num_padded_sequences = 0 | ||
# indices: True unfinished, False for finished or padded sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment suggests on first-reading that indices
is a boolean flag, but i guess it is a list of booleans or a tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a boolean tensor with True for unfinished and False for finished or padded sequences. I will update the comment to make this clearer. Thanks
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
bug fix: variable number of max decode tokens within batch (IBM#73)
This PR fixes a previously unidentified bug and adds pytests for validation.
Changes:
SpyreCausalLM.indices
containing a mask indicating the unfinished sequences in the current batch. -> commithf
andvllm
to accept different number of max decoding token for sequences within the same batch -> commitBug description:
Having a different number of requested output tokens within the same batch will lead to some sequences being removed from the batch while others are still decoding. Previously the code did not take into account the offset a removed sequence introduces in the
positions
(ids) and (attention)masks
. This error remains undetected if all prompts are of the same length (they will have the same position ids and attention masks) or if always the last sequence in a batch finishes early (the offset at the end will not affect sequences with smaller indices within the same batch).bug example:
