Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix deprecated max_tokens param in openai ChatCompletionRequest #3122

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mickqian
Copy link
Contributor

@mickqian mickqian commented Jan 25, 2025

Motivation

Address #3098

Modifications

  1. replace the deprecated max_tokens param with a newer one: max_completion_tokens, according to here
image

Checklist

@merrymercy
Copy link
Contributor

Can you only update the adapter and not change other parts?

@ywang96
Copy link
Contributor

ywang96 commented Jan 26, 2025

Much thanks for addressing #3098 so quickly and FYI I've tested this branch and it does resolve the issue!

@mickqian
Copy link
Contributor Author

Can you only update the adapter and not change other parts?

Updated. Removed the frontend part

@zhaochenyang20
Copy link
Collaborator

@mickqian mick I will take a look on this. Thanks!

# OpenAI does not support top_k, so we drop it here
if self.regex is not None:
warnings.warn("Regular expression is not supported in the OpenAI backend.")
return {
"max_tokens": self.max_new_tokens,
(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering what's the definition of max_tokens in non-chat model? And why we do not keep max_tokens for non-chat models in the backend?

Copy link
Contributor Author

@mickqian mickqian Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the def of max_tokens in completion models is tokens that can be generated in the completion, pretty much the same as max_completion_tokens
  2. Yes we should keep both params, updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. For embedding model, “max_tokens” means the max sequence length that can be processed. What if it exceeds the length? Should the sequence be truncated or throw an error? The chat model also. And, I personally think we should call it generation model and embedding model. That's what we typically call these models.

Copy link
Contributor Author

@mickqian mickqian Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. for embedding models, I think both ways will do, depending on the situation. Providing an option sounds good too
  2. Yes, completion model and chat completion model fall into the category of generation model, I thought you were referring to completion model. The is_chat_model is used to distinguish different generation models, if I'm correct. For sglang.lang, does it involve embedding models(or did I miss something?)? If not, probably is_chat_model would suffice for generation models in backend

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This should be sound to me. But, in my ideal:

def to_openai_kwargs(self, is_chat_model):

You mean is_chat_model is an element of class OpenAI(BaseBackend) in python/sglang/lang/backend/openai.py rather than an element of class SglSamplingParams in python/sglang/lang/ir.py ?

So this function is not:

def to_openai_kwargs(self):

Right? That make sense. But I prefer the latter one if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's supposed to be an internal property of a BaseBackend( as it directly describes the model ), and passed to SglSamplingParams for generating openai-compatible request.
While adding this field to SglSamplingParams do sounds good in some cases, I personally reckon SglSamplingParams is meant to be a model-unaware data, which can be sent to different backends, and let the actual backend decides the final openai request? Feel free to correct me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Nice work.

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we should comment on the protocal.py or anywhere regarding the definition of max_completion_tokens and the difference before previous max_tokens.

  2. Could you run the docs CI locally, just make compile is enough. Current docs CI is closed due to long queue time to compile on CI. But we should run it locally.

# OpenAI does not support top_k, so we drop it here
if self.regex is not None:
warnings.warn("Regular expression is not supported in the OpenAI backend.")
return {
"max_tokens": self.max_new_tokens,
(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This should be sound to me. But, in my ideal:

def to_openai_kwargs(self, is_chat_model):

You mean is_chat_model is an element of class OpenAI(BaseBackend) in python/sglang/lang/backend/openai.py rather than an element of class SglSamplingParams in python/sglang/lang/ir.py ?

So this function is not:

def to_openai_kwargs(self):

Right? That make sense. But I prefer the latter one if we can.

@mickqian
Copy link
Contributor Author

mickqian commented Feb 4, 2025

  1. I think we should comment on the protocal.py or anywhere regarding the definition of max_completion_tokens and the difference before previous max_tokens.
  2. Could you run the docs CI locally, just make compile is enough. Current docs CI is closed due to long queue time to compile on CI. But we should run it locally.

make compile failed even in main branch:

$ jupyter nbconvert --to notebook --execute --inplace ./backend/openai_api_completions.ipynb \
                                --ExecutePreprocessor.timeout=600 \
                                --ExecutePreprocessor.kernel_name=python3 || exit 1; 
                                
...
AttributeError                            Traceback (most recent call last)
Cell In[9], line 58
     52 batch_details = client.batches.retrieve(batch_id=batch_job.id)
     54 print_highlight(
     55     f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
     56 )
     57 print_highlight(
---> 58     f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
     59 )
     61 time.sleep(3)

AttributeError: 'NoneType' object has no attribute 'total'

Is it a known issue, or the error is from my side?

@mickqian
Copy link
Contributor Author

mickqian commented Feb 4, 2025

  1. I think we should comment on the protocal.py or anywhere regarding the definition of max_completion_tokens and the difference before previous max_tokens.
  2. Could you run the docs CI locally, just make compile is enough. Current docs CI is closed due to long queue time to compile on CI. But we should run it locally.

make compile failed even in main branch:

$ jupyter nbconvert --to notebook --execute --inplace ./backend/openai_api_completions.ipynb \
                                --ExecutePreprocessor.timeout=600 \
                                --ExecutePreprocessor.kernel_name=python3 || exit 1; 
                                
...
AttributeError                            Traceback (most recent call last)
Cell In[9], line 58
     52 batch_details = client.batches.retrieve(batch_id=batch_job.id)
     54 print_highlight(
     55     f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
     56 )
     57 print_highlight(
---> 58     f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
     59 )
     61 time.sleep(3)

AttributeError: 'NoneType' object has no attribute 'total'

Is it a known issue, or the error is from my side?

Fixed, and make compile test passed locally.

@zhaochenyang20
Copy link
Collaborator

Will review it today. Stay tuned!

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good improvment. I am wondering do openai still have max tokens for chat API right now? For chat completion api, there should only be max completion tokens. But I don't know what's for chat API.

docs/Makefile Outdated
@@ -14,7 +14,7 @@ help:

# New target to compile Markdown and Jupyter Notebook files
compile:
find $(SOURCEDIR) -path "*/_build/*" -prune -o -name "*.ipynb" -print | while read nb; do \
find $(SOURCEDIR) -path "*/$(BUILDDIR)/*" -prune -o -name "*.ipynb" -print | while read nb; do \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we change this? We need to set $(BUILDDIR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has already been declared in line 9:

BUILDDIR = _build

# OpenAI does not support top_k, so we drop it here
if self.regex is not None:
warnings.warn("Regular expression is not supported in the OpenAI backend.")
return {
"max_tokens": self.max_new_tokens,
(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Nice work.

@@ -295,7 +295,12 @@ class ChatCompletionRequest(BaseModel):
logit_bias: Optional[Dict[str, float]] = None
logprobs: bool = False
top_logprobs: Optional[int] = None
# The maximum number of tokens that can be generated in the chat completion.
# non-chat-completion models only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Make it a full sentence: Non-chat-completion models only have max tokens.

  2. So chat completion models count the max_completion_tokens and chat models (not completion models) count the max_token right?

Copy link
Contributor Author

@mickqian mickqian Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There's a nuance difference between non-chat-completion models only and Non-chat-completion models only have max tokens, I'm afraid. Changed to Only available for non-chat-completion models, is that ok?
  2. yes, to be more specific, in openai's legacy completion api(non-chat completion models only), they only have max_tokens. In their chat-completion api, they have both two params, but:
image

Comment on lines 301 to 315
# An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens.
# Almost the same as `max_tokens`, but for chat-completion models only
max_completion_tokens: Optional[int] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two definitions look strange. To me, I will say:

  1. # The maximum number of total tokens in a chat request. Note that input tokens are included.

  2. # The maximum number of completion tokens for a chat completion request, including visible output tokens and reasoning tokens. But input tokens are not included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@mickqian
Copy link
Contributor Author

mickqian commented Feb 5, 2025

Good improvment. I am wondering do openai still have max tokens for chat API right now? For chat completion api, there should only be max completion tokens. But I don't know what's for chat API.

replied in here

@mickqian mickqian force-pushed the fix-max-tokens branch 3 times, most recently from 10ce5fd to db732ae Compare February 5, 2025 02:39
@@ -325,6 +330,14 @@ class ChatCompletionRequest(BaseModel):
lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None
session_params: Optional[Dict] = None

def get_max_output_tokens(self) -> int:
Copy link
Contributor

@CatherineSue CatherineSue Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be def get_max_output_tokens(self) -> int | None:? And from the code change, this function feels unnecessary. Does request.max_completion_tokens or request.max_tokens not work?

@zhaochenyang20
Copy link
Collaborator

@mickqian This should be rebased. Thanks 😂

@mickqian mickqian force-pushed the fix-max-tokens branch 3 times, most recently from 5d645ed to 473fa26 Compare March 4, 2025 09:51
Replace it with a newer one: max_completion_tokens
@zhaochenyang20
Copy link
Collaborator

@shuaills Could you review this? Thnaks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants