-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix deprecated max_tokens param in openai ChatCompletionRequest #3122
base: main
Are you sure you want to change the base?
Conversation
403eb6c
to
eed4b36
Compare
Can you only update the |
Much thanks for addressing #3098 so quickly and FYI I've tested this branch and it does resolve the issue! |
eed4b36
to
91f7dba
Compare
Updated. Removed the frontend part |
91f7dba
to
b48b900
Compare
@mickqian mick I will take a look on this. Thanks! |
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering what's the definition of max_tokens
in non-chat model? And why we do not keep max_tokens
for non-chat models in the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the def of
max_tokens
in completion models is tokens that can be generated in the completion, pretty much the same asmax_completion_tokens
- Yes we should keep both params, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. For embedding model, “max_tokens” means the max sequence length that can be processed. What if it exceeds the length? Should the sequence be truncated or throw an error? The chat model also. And, I personally think we should call it generation model and embedding model. That's what we typically call these models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- for embedding models, I think both ways will do, depending on the situation. Providing an option sounds good too
- Yes,
completion model
andchat completion model
fall into the category ofgeneration model
, I thought you were referring tocompletion model
. Theis_chat_model
is used to distinguish different generation models, if I'm correct. For sglang.lang, does it involve embedding models(or did I miss something?)? If not, probablyis_chat_model
would suffice for generation models inbackend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This should be sound to me. But, in my ideal:
def to_openai_kwargs(self, is_chat_model):
You mean is_chat_model
is an element of class OpenAI(BaseBackend)
in python/sglang/lang/backend/openai.py
rather than an element of class SglSamplingParams
in python/sglang/lang/ir.py
?
So this function is not:
def to_openai_kwargs(self):
Right? That make sense. But I prefer the latter one if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's supposed to be an internal property of a BaseBackend
( as it directly describes the model ), and passed to SglSamplingParams
for generating openai-compatible request.
While adding this field to SglSamplingParams
do sounds good in some cases, I personally reckon SglSamplingParams
is meant to be a model-unaware data, which can be sent to different backends, and let the actual backend decides the final openai request? Feel free to correct me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Nice work.
b48b900
to
b80b526
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think we should comment on the
protocal.py
or anywhere regarding the definition ofmax_completion_tokens
and the difference before previousmax_tokens
. -
Could you run the docs CI locally, just
make compile
is enough. Current docs CI is closed due to long queue time to compile on CI. But we should run it locally.
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This should be sound to me. But, in my ideal:
def to_openai_kwargs(self, is_chat_model):
You mean is_chat_model
is an element of class OpenAI(BaseBackend)
in python/sglang/lang/backend/openai.py
rather than an element of class SglSamplingParams
in python/sglang/lang/ir.py
?
So this function is not:
def to_openai_kwargs(self):
Right? That make sense. But I prefer the latter one if we can.
$ jupyter nbconvert --to notebook --execute --inplace ./backend/openai_api_completions.ipynb \
--ExecutePreprocessor.timeout=600 \
--ExecutePreprocessor.kernel_name=python3 || exit 1;
...
AttributeError Traceback (most recent call last)
Cell In[9], line 58
52 batch_details = client.batches.retrieve(batch_id=batch_job.id)
54 print_highlight(
55 f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
56 )
57 print_highlight(
---> 58 f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
59 )
61 time.sleep(3)
AttributeError: 'NoneType' object has no attribute 'total' Is it a known issue, or the error is from my side? |
b80b526
to
82c817b
Compare
Fixed, and |
82c817b
to
e9aa94a
Compare
Will review it today. Stay tuned! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good improvment. I am wondering do openai still have max tokens for chat API right now? For chat completion api, there should only be max completion tokens. But I don't know what's for chat API.
docs/Makefile
Outdated
@@ -14,7 +14,7 @@ help: | |||
|
|||
# New target to compile Markdown and Jupyter Notebook files | |||
compile: | |||
find $(SOURCEDIR) -path "*/_build/*" -prune -o -name "*.ipynb" -print | while read nb; do \ | |||
find $(SOURCEDIR) -path "*/$(BUILDDIR)/*" -prune -o -name "*.ipynb" -print | while read nb; do \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we change this? We need to set $(BUILDDIR)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has already been declared in line 9:
BUILDDIR = _build
# OpenAI does not support top_k, so we drop it here | ||
if self.regex is not None: | ||
warnings.warn("Regular expression is not supported in the OpenAI backend.") | ||
return { | ||
"max_tokens": self.max_new_tokens, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Nice work.
@@ -295,7 +295,12 @@ class ChatCompletionRequest(BaseModel): | |||
logit_bias: Optional[Dict[str, float]] = None | |||
logprobs: bool = False | |||
top_logprobs: Optional[int] = None | |||
# The maximum number of tokens that can be generated in the chat completion. | |||
# non-chat-completion models only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Make it a full sentence:
Non-chat-completion models only have max tokens.
-
So chat completion models count the
max_completion_tokens
and chat models (not completion models) count themax_token
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There's a nuance difference between
non-chat-completion models only
andNon-chat-completion models only have max tokens
, I'm afraid. Changed toOnly available for non-chat-completion models
, is that ok? - yes, to be more specific, in openai's legacy completion api(non-chat completion models only), they only have
max_tokens
. In their chat-completion api, they have both two params, but:

# An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens. | ||
# Almost the same as `max_tokens`, but for chat-completion models only | ||
max_completion_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two definitions look strange. To me, I will say:
-
# The maximum number of total tokens in a chat request. Note that input tokens are included.
-
# The maximum number of completion tokens for a chat completion request, including visible output tokens and reasoning tokens. But input tokens are not included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
replied in here |
10ce5fd
to
db732ae
Compare
@@ -325,6 +330,14 @@ class ChatCompletionRequest(BaseModel): | |||
lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None | |||
session_params: Optional[Dict] = None | |||
|
|||
def get_max_output_tokens(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be def get_max_output_tokens(self) -> int | None:
? And from the code change, this function feels unnecessary. Does request.max_completion_tokens or request.max_tokens
not work?
@mickqian This should be rebased. Thanks 😂 |
5d645ed
to
473fa26
Compare
Replace it with a newer one: max_completion_tokens
@shuaills Could you review this? Thnaks! |
Motivation
Address #3098
Modifications
max_tokens
param with a newer one:max_completion_tokens
, according to hereChecklist