[Bugfix] Command-R Max Model Length #3727

ywang96 · 2024-03-29T07:51:53Z

Currently, the max context window for CohereForAI/c4ai-command-r-v01 is not defined by max_position_embeddings but a special model_max_length key instead. This has been discussed in these two threads: 1, 2

We still use max_position_embeddings as default max_model_len for the memory concern, but when the user specifies a value higher than max_position_embeddings but lower than or equal to model_max_length, we will allow this to go through.

This PR fixes #3676

cc @saurabhdash I'm not sure if there's a cleaner/better way to do this but please take a look.

pseudotensor · 2024-03-29T08:24:04Z

cool thanks!

saurabhdash · 2024-03-29T10:31:01Z

Currently, the max context window for CohereForAI/c4ai-command-r-v01 is not defined by max_position_embeddings but a special model_max_length key instead. This has been discussed in these two threads: 1, 2

We still use max_position_embeddings as default max_model_len for the memory concern, but when the user specifies a value higher than max_position_embeddings but lower than or equal to model_max_length, we will allow this to go through.

This PR fixes #3676

cc @saurabhdash I'm not sure if there's a cleaner/better way to do this but please take a look.

Just so that I understand correctly, this allows people to increase the context length upto 128k and throw a warning if larger?

esmeetu · 2024-03-29T10:38:26Z

A quick question: why not just change the model's max_position_embeddings value in config.json file?

pseudotensor · 2024-03-29T14:35:39Z

@saurabhdash It's not a warning, it's a fatal raise.

pseudotensor · 2024-03-29T14:36:20Z

@esmeetu Because that's not normally done for any other models and it is not maintainable when pulling weights into cached location that may be updated. It's also not correct since rope scaling is not same as just changing the embedding size AFAIK.

ywang96 · 2024-03-29T16:11:39Z

@pseudotensor That make sense. But this PR is a little bit tricky. And max_model_length is not a common parameter across the open models. Furthermore, if we apply this, it extremely probably trigger OOM error for most users' environment. Because it takes two much memory when using 128k context and seems unnecessary for them. For model cache convenience, I think you could fork that repo, and tune the parameter as you need. @simon-mo WDYT?

Just to clarify - we still use 8192 as the default max_model_len if user doesnt specify it. This PR really just allows users to go above that until the truth context length at their own risk.

simon-mo

can you put up a quick manual test for this?

ywang96 · 2024-03-29T18:38:33Z

can you put up a quick manual test for this?

@simon-mo Sure, here's a quick test script

from vllm import LLM, SamplingParams
import sys

try:
    if len(sys.argv) == 2:
        user_specified = int(sys.argv[-1])
    else:
        user_specified = None
    llm = LLM(model="CohereForAI/c4ai-command-r-v01", tensor_parallel_size=4, max_model_len=user_specified)
    print(f"Max Length: {llm.llm_engine.model_config.max_model_len}")
except Exception as e:
    print(e)

On A100-80G

Default:

INFO 03-29 17:41:43 model_runner.py:867] Graph capturing finished in 14 secs.
Max Length: 8192

Specifying max_model_len=131072 with TP4

INFO 03-29 17:51:17 ray_gpu_executor.py:240] # GPU blocks: 6158, # CPU blocks: 819
The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (98528). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Specifying max_model_len=131072 with TP8

INFO 03-29 17:53:38 model_runner.py:867] Graph capturing finished in 14 secs.
Max Length: 131072

Specifying max_model_len=131073

User-specified max_model_len (131073) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=131072 in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.

There's one small "visual" bug I fixed: the variable max_len_key was used incorrecty.

ywang96 added 3 commits March 29, 2024 00:41

fix commandr max len

64e5a98

iterate

4c5f159

update value error message

6224ec5

ywang96 requested a review from youkaichao March 29, 2024 08:39

simon-mo added the release-blocker This PR/issue blocks the next release, therefore deserves highest priority label Mar 29, 2024

simon-mo approved these changes Mar 29, 2024

View reviewed changes

fix max len key

8bd2024

simon-mo merged commit 97356f3 into vllm-project:main Mar 29, 2024
34 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 31, 2024

[Bugfix] Command-R Max Model Length (vllm-project#3727)

77ce1a5

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

ywang96 mentioned this pull request May 25, 2024

[Bugfix] Fix with verifying model max len #4885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Command-R Max Model Length #3727

[Bugfix] Command-R Max Model Length #3727

ywang96 commented Mar 29, 2024 •

edited

Loading

pseudotensor commented Mar 29, 2024

saurabhdash commented Mar 29, 2024

esmeetu commented Mar 29, 2024 •

edited

Loading

pseudotensor commented Mar 29, 2024

pseudotensor commented Mar 29, 2024 •

edited

Loading

ywang96 commented Mar 29, 2024

simon-mo left a comment

ywang96 commented Mar 29, 2024

[Bugfix] Command-R Max Model Length #3727

[Bugfix] Command-R Max Model Length #3727

Conversation

ywang96 commented Mar 29, 2024 • edited Loading

pseudotensor commented Mar 29, 2024

saurabhdash commented Mar 29, 2024

esmeetu commented Mar 29, 2024 • edited Loading

pseudotensor commented Mar 29, 2024

pseudotensor commented Mar 29, 2024 • edited Loading

ywang96 commented Mar 29, 2024

simon-mo left a comment

Choose a reason for hiding this comment

ywang96 commented Mar 29, 2024

ywang96 commented Mar 29, 2024 •

edited

Loading

esmeetu commented Mar 29, 2024 •

edited

Loading

pseudotensor commented Mar 29, 2024 •

edited

Loading