bug fix: variable number of max decode tokens within batch #73

yannicks1 · 2025-01-31T11:42:21Z

This PR fixes a previously unidentified bug and adds pytests for validation.

Changes:

addressing the logic error described below by introducing SpyreCausalLM.indices containing a mask indicating the unfinished sequences in the current batch. -> commit
adapting the generation functions in tests/spyre/spyre_util.py for hf and vllm to accept different number of max decoding token for sequences within the same batch -> commit
adding tests/spyre/test_spyre_max_new_tokens.py to validate functionality when sequences in a batch finish decoding before others. -> commit

Bug description:

Having a different number of requested output tokens within the same batch will lead to some sequences being removed from the batch while others are still decoding. Previously the code did not take into account the offset a removed sequence introduces in the positions (ids) and (attention) masks. This error remains undetected if all prompts are of the same length (they will have the same position ids and attention masks) or if always the last sequence in a batch finishes early (the offset at the end will not affect sequences with smaller indices within the same batch).

bug example:

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-01-31T11:42:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tdoublep

LGTM (Only two extremely minor comments).

Thanks for (a) finding this bug, (b) the clean + elegant fix and (c) writing the tests so we don't accidentally introduce this again in the future.

tdoublep · 2025-01-31T12:43:37Z

tests/spyre/test_spyre_max_new_tokens.py

+        ignore_eos=False)
+
+    vllm_sampling_params = [vllm_sampling_params_normal] * 3
+    max_new_tokens = [max_new_tokens_warmup] * 3


minor but is is really necessary to construct max_new_tokens separately? couldn't we just access it from the sampling params (e.g. sampling_params.max_new_tokens) ?

this is only for the hf model evaluation. We don't pass any sampling parameters to generate_hf_output(), just max_new_tokens... I could rename max_new_tokens to hf_max_new_tokens to make this more clear?

tdoublep · 2025-01-31T12:45:27Z

vllm/model_executor/model_loader/spyre.py

-        # number of added padding sequences to fill
-        # batch to warmed up batch size
-        self.num_padded_sequences = 0
+        # indices: True unfinished, False for finished or padded sequence


the comment suggests on first-reading that indices is a boolean flag, but i guess it is a list of booleans or a tensor?

Yes, it is a boolean tensor with True for unfinished and False for finished or padded sequences. I will update the comment to make this clearer. Thanks

Signed-off-by: Yannick Schnider <[email protected]>

bug fix: variable number of max decode tokens within batch (IBM#73)

yannicks1 added 3 commits January 30, 2025 17:48

adding test to detect current bug: only 4/6 tests pass

f632e8e

Signed-off-by: Yannick Schnider <[email protected]>

format checking

cce06c2

Signed-off-by: Yannick Schnider <[email protected]>

bug fix: variable max new token test passes 6/6

3f087a7

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested review from tdoublep and sducouedic January 31, 2025 11:42

tdoublep approved these changes Jan 31, 2025

View reviewed changes

yannicks1 added 2 commits January 31, 2025 14:12

better description of indices variable in comment

fc09267

Signed-off-by: Yannick Schnider <[email protected]>

renaming variable for better code readability

f471801

Signed-off-by: Yannick Schnider <[email protected]>

sducouedic approved these changes Jan 31, 2025

View reviewed changes

yannicks1 merged commit 938fea3 into main Feb 4, 2025
10 checks passed

dpatel-ops added a commit to dpatel-ops/vllm that referenced this pull request Feb 6, 2025

Merge pull request #1 from IBM/main

421fb72

bug fix: variable number of max decode tokens within batch (IBM#73)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix: variable number of max decode tokens within batch #73

bug fix: variable number of max decode tokens within batch #73

yannicks1 commented Jan 31, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 31, 2025

tdoublep left a comment

tdoublep Jan 31, 2025

yannicks1 Jan 31, 2025

tdoublep Jan 31, 2025

yannicks1 Jan 31, 2025

bug fix: variable number of max decode tokens within batch #73

bug fix: variable number of max decode tokens within batch #73

Conversation

yannicks1 commented Jan 31, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 31, 2025

tdoublep left a comment

Choose a reason for hiding this comment

tdoublep Jan 31, 2025

Choose a reason for hiding this comment

yannicks1 Jan 31, 2025

Choose a reason for hiding this comment

tdoublep Jan 31, 2025

Choose a reason for hiding this comment

yannicks1 Jan 31, 2025

Choose a reason for hiding this comment

yannicks1 commented Jan 31, 2025 •

edited by github-actions bot

Loading