Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix: variable number of max decode tokens within batch #73

Merged
merged 5 commits into from
Feb 4, 2025

Conversation

yannicks1
Copy link
Contributor

@yannicks1 yannicks1 commented Jan 31, 2025

This PR fixes a previously unidentified bug and adds pytests for validation.

Changes:

  • addressing the logic error described below by introducing SpyreCausalLM.indices containing a mask indicating the unfinished sequences in the current batch. -> commit
  • adapting the generation functions in tests/spyre/spyre_util.py for hf and vllm to accept different number of max decoding token for sequences within the same batch -> commit
  • adding tests/spyre/test_spyre_max_new_tokens.py to validate functionality when sequences in a batch finish decoding before others. -> commit

Bug description:

Having a different number of requested output tokens within the same batch will lead to some sequences being removed from the batch while others are still decoding. Previously the code did not take into account the offset a removed sequence introduces in the positions (ids) and (attention) masks. This error remains undetected if all prompts are of the same length (they will have the same position ids and attention masks) or if always the last sequence in a batch finishes early (the offset at the end will not affect sequences with smaller indices within the same batch).

bug example:
Screenshot 2025-01-31 at 12 39 26

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (Only two extremely minor comments).

Thanks for (a) finding this bug, (b) the clean + elegant fix and (c) writing the tests so we don't accidentally introduce this again in the future.

ignore_eos=False)

vllm_sampling_params = [vllm_sampling_params_normal] * 3
max_new_tokens = [max_new_tokens_warmup] * 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor but is is really necessary to construct max_new_tokens separately? couldn't we just access it from the sampling params (e.g. sampling_params.max_new_tokens) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only for the hf model evaluation. We don't pass any sampling parameters to generate_hf_output(), just max_new_tokens... I could rename max_new_tokens to hf_max_new_tokens to make this more clear?

# number of added padding sequences to fill
# batch to warmed up batch size
self.num_padded_sequences = 0
# indices: True unfinished, False for finished or padded sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment suggests on first-reading that indices is a boolean flag, but i guess it is a list of booleans or a tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a boolean tensor with True for unfinished and False for finished or padded sequences. I will update the comment to make this clearer. Thanks

@yannicks1 yannicks1 merged commit 938fea3 into main Feb 4, 2025
10 checks passed
dpatel-ops added a commit to dpatel-ops/vllm that referenced this pull request Feb 6, 2025
bug fix: variable number of max decode tokens within batch (IBM#73)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants