Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add vlm to token in & out example #3941

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mickqian
Copy link
Contributor

@mickqian mickqian commented Feb 28, 2025

Motivation

  1. Support vlm in skip_tokenizer_init, ref [Feature] Support token-in-token-out for Vision LM #3871
  2. Add docs for offline token in & token out, ref [Feature] Add docs for Offline Engine token-in token-out #2968

Modifications

Checklist

@mickqian mickqian changed the title Skip tokenizer vlm test: add vlm to skip_tokenizer_init test Feb 28, 2025
@mickqian mickqian changed the title test: add vlm to skip_tokenizer_init test test: add vlm to token in & out example Feb 28, 2025
@mickqian mickqian force-pushed the skip-tokenizer-vlm branch from 9406e5b to f1692ca Compare March 1, 2025 03:31
@zhaochenyang20
Copy link
Collaborator

(sglang) chayenne@lmsys:~/token-in-token-out/sglang/examples/runtime/engine$ python token_in_token_out_vlm.py 
INFO 03-03 05:48:20 __init__.py:190] Automatically detected platform cuda.
parser
usage: token_in_token_out_vlm.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--tokenizer-mode {auto,slow}]
                                 [--skip-tokenizer-init] [--load-format {auto,pt,safetensors,npcache,dummy,gguf,bitsandbytes,layered}] [--trust-remote-code]
                                 [--dtype {auto,half,float16,bfloat16,float,float32}] [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
                                 [--quantization-param-path QUANTIZATION_PARAM_PATH]
                                 [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,w8a8_int8}]
                                 [--context-length CONTEXT_LENGTH] [--device {cuda,xpu,hpu,cpu}] [--served-model-name SERVED_MODEL_NAME]
                                 [--chat-template CHAT_TEMPLATE] [--is-embedding] [--revision REVISION] [--mem-fraction-static MEM_FRACTION_STATIC]
                                 [--max-running-requests MAX_RUNNING_REQUESTS] [--max-total-tokens MAX_TOTAL_TOKENS]
                                 [--chunked-prefill-size CHUNKED_PREFILL_SIZE] [--max-prefill-tokens MAX_PREFILL_TOKENS]
                                 [--schedule-policy {lpm,random,fcfs,dfs-weight}] [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
                                 [--cpu-offload-gb CPU_OFFLOAD_GB] [--prefill-only-one-req PREFILL_ONLY_ONE_REQ]
                                 [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--stream-interval STREAM_INTERVAL] [--stream-output]
                                 [--random-seed RANDOM_SEED] [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
                                 [--watchdog-timeout WATCHDOG_TIMEOUT] [--dist-timeout DIST_TIMEOUT] [--download-dir DOWNLOAD_DIR]
                                 [--base-gpu-id BASE_GPU_ID] [--gpu-id-step GPU_ID_STEP] [--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP]
                                 [--log-requests] [--show-time-cost] [--enable-metrics] [--decode-log-interval DECODE_LOG_INTERVAL] [--api-key API_KEY]
                                 [--file-storage-pth FILE_STORAGE_PTH] [--enable-cache-report] [--data-parallel-size DATA_PARALLEL_SIZE]
                                 [--load-balance-method {round_robin,shortest_queue}] [--expert-parallel-size EXPERT_PARALLEL_SIZE]
                                 [--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES] [--node-rank NODE_RANK]
                                 [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS] [--lora-paths [LORA_PATHS ...]]
                                 [--max-loras-per-batch MAX_LORAS_PER_BATCH] [--lora-backend LORA_BACKEND]
                                 [--attention-backend {flashinfer,triton,torch_native}] [--sampling-backend {flashinfer,pytorch}]
                                 [--grammar-backend {xgrammar,outlines,llguidance}] [--enable-flashinfer-mla] [--flashinfer-mla-disable-ragged]
                                 [--speculative-algorithm {EAGLE,NEXTN}] [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
                                 [--speculative-num-steps SPECULATIVE_NUM_STEPS] [--speculative-eagle-topk {1,2,4,8}]
                                 [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS] [--speculative-token-map SPECULATIVE_TOKEN_MAP]
                                 [--enable-double-sparsity] [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH] [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
                                 [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM] [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
                                 [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD] [--disable-radix-cache] [--disable-jump-forward]
                                 [--disable-cuda-graph] [--disable-cuda-graph-padding] [--enable-nccl-nvls] [--disable-outlines-disk-cache]
                                 [--disable-custom-all-reduce] [--disable-mla] [--disable-overlap-schedule] [--enable-mixed-chunk] [--enable-dp-attention]
                                 [--enable-ep-moe] [--enable-torch-compile] [--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
                                 [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS] [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
                                 [--torchao-config TORCHAO_CONFIG] [--enable-nan-detection] [--enable-p2p-check] [--triton-attention-reduce-in-fp32]
                                 [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
                                 [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS] [--delete-ckpt-after-loading] [--enable-memory-saver]
                                 [--allow-auto-truncate] [--enable-custom-logit-processor] [--tool-call-parser {qwen25,mistral,llama3}]
                                 [--enable-hierarchical-cache]
token_in_token_out_vlm.py: error: the following arguments are required: --model-path

Hey Mick, could you fix the bugs here?

@mickqian mickqian force-pushed the skip-tokenizer-vlm branch from 332c96d to 4883cf0 Compare March 3, 2025 08:15
@mickqian
Copy link
Contributor Author

mickqian commented Mar 3, 2025

@zhaochenyang20 fixed

@zhaochenyang20
Copy link
Collaborator

@mickqian Fix the conflicts and rebase plz

@mickqian mickqian force-pushed the skip-tokenizer-vlm branch from 4883cf0 to d7ec05a Compare March 3, 2025 10:08
@mickqian
Copy link
Contributor Author

mickqian commented Mar 3, 2025

After rebasing, the stream test of skip_tokenizer_init needs some additional work, while the none-stream version is ready.

@zhaochenyang20
Copy link
Collaborator

Hey mick. Could you try to simplify your examples:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/token_in_token_out_llm.py

Following this and name it token_in_token_out_vlm.py?

@zhaochenyang20
Copy link
Collaborator

@mickqian Thanks!

@mickqian mickqian force-pushed the skip-tokenizer-vlm branch from f1d39fa to a7a6665 Compare March 4, 2025 14:57
@mickqian mickqian force-pushed the skip-tokenizer-vlm branch from a7a6665 to f2aa687 Compare March 4, 2025 15:16
@zhaochenyang20 zhaochenyang20 mentioned this pull request Mar 4, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants