test: add vlm to token in & out example #3941

mickqian · 2025-02-28T03:28:02Z

Motivation

Support vlm in skip_tokenizer_init, ref [Feature] Support token-in-token-out for Vision LM #3871
Add docs for offline token in & token out, ref [Feature] Add docs for Offline Engine token-in token-out #2968

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20 · 2025-03-03T05:49:22Z

(sglang) chayenne@lmsys:~/token-in-token-out/sglang/examples/runtime/engine$ python token_in_token_out_vlm.py 
INFO 03-03 05:48:20 __init__.py:190] Automatically detected platform cuda.
parser
usage: token_in_token_out_vlm.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--tokenizer-mode {auto,slow}]
                                 [--skip-tokenizer-init] [--load-format {auto,pt,safetensors,npcache,dummy,gguf,bitsandbytes,layered}] [--trust-remote-code]
                                 [--dtype {auto,half,float16,bfloat16,float,float32}] [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
                                 [--quantization-param-path QUANTIZATION_PARAM_PATH]
                                 [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,w8a8_int8}]
                                 [--context-length CONTEXT_LENGTH] [--device {cuda,xpu,hpu,cpu}] [--served-model-name SERVED_MODEL_NAME]
                                 [--chat-template CHAT_TEMPLATE] [--is-embedding] [--revision REVISION] [--mem-fraction-static MEM_FRACTION_STATIC]
                                 [--max-running-requests MAX_RUNNING_REQUESTS] [--max-total-tokens MAX_TOTAL_TOKENS]
                                 [--chunked-prefill-size CHUNKED_PREFILL_SIZE] [--max-prefill-tokens MAX_PREFILL_TOKENS]
                                 [--schedule-policy {lpm,random,fcfs,dfs-weight}] [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
                                 [--cpu-offload-gb CPU_OFFLOAD_GB] [--prefill-only-one-req PREFILL_ONLY_ONE_REQ]
                                 [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--stream-interval STREAM_INTERVAL] [--stream-output]
                                 [--random-seed RANDOM_SEED] [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
                                 [--watchdog-timeout WATCHDOG_TIMEOUT] [--dist-timeout DIST_TIMEOUT] [--download-dir DOWNLOAD_DIR]
                                 [--base-gpu-id BASE_GPU_ID] [--gpu-id-step GPU_ID_STEP] [--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP]
                                 [--log-requests] [--show-time-cost] [--enable-metrics] [--decode-log-interval DECODE_LOG_INTERVAL] [--api-key API_KEY]
                                 [--file-storage-pth FILE_STORAGE_PTH] [--enable-cache-report] [--data-parallel-size DATA_PARALLEL_SIZE]
                                 [--load-balance-method {round_robin,shortest_queue}] [--expert-parallel-size EXPERT_PARALLEL_SIZE]
                                 [--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES] [--node-rank NODE_RANK]
                                 [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS] [--lora-paths [LORA_PATHS ...]]
                                 [--max-loras-per-batch MAX_LORAS_PER_BATCH] [--lora-backend LORA_BACKEND]
                                 [--attention-backend {flashinfer,triton,torch_native}] [--sampling-backend {flashinfer,pytorch}]
                                 [--grammar-backend {xgrammar,outlines,llguidance}] [--enable-flashinfer-mla] [--flashinfer-mla-disable-ragged]
                                 [--speculative-algorithm {EAGLE,NEXTN}] [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
                                 [--speculative-num-steps SPECULATIVE_NUM_STEPS] [--speculative-eagle-topk {1,2,4,8}]
                                 [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS] [--speculative-token-map SPECULATIVE_TOKEN_MAP]
                                 [--enable-double-sparsity] [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH] [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
                                 [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM] [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
                                 [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD] [--disable-radix-cache] [--disable-jump-forward]
                                 [--disable-cuda-graph] [--disable-cuda-graph-padding] [--enable-nccl-nvls] [--disable-outlines-disk-cache]
                                 [--disable-custom-all-reduce] [--disable-mla] [--disable-overlap-schedule] [--enable-mixed-chunk] [--enable-dp-attention]
                                 [--enable-ep-moe] [--enable-torch-compile] [--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
                                 [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS] [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
                                 [--torchao-config TORCHAO_CONFIG] [--enable-nan-detection] [--enable-p2p-check] [--triton-attention-reduce-in-fp32]
                                 [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
                                 [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS] [--delete-ckpt-after-loading] [--enable-memory-saver]
                                 [--allow-auto-truncate] [--enable-custom-logit-processor] [--tool-call-parser {qwen25,mistral,llama3}]
                                 [--enable-hierarchical-cache]
token_in_token_out_vlm.py: error: the following arguments are required: --model-path

Hey Mick, could you fix the bugs here?

mickqian · 2025-03-03T08:16:02Z

@zhaochenyang20 fixed

zhaochenyang20 · 2025-03-03T08:21:46Z

@mickqian Fix the conflicts and rebase plz

mickqian · 2025-03-03T11:12:42Z

After rebasing, the stream test of skip_tokenizer_init needs some additional work, while the none-stream version is ready.

zhaochenyang20 · 2025-03-03T17:38:34Z

Hey mick. Could you try to simplify your examples:

https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/token_in_token_out_llm.py

Following this and name it token_in_token_out_vlm.py?

zhaochenyang20 · 2025-03-03T17:38:39Z

@mickqian Thanks！

mickqian requested review from merrymercy, Ying1123 and zhyncs as code owners February 28, 2025 03:28

mickqian changed the title ~~Skip tokenizer vlm~~ test: add vlm to skip_tokenizer_init test Feb 28, 2025

mickqian force-pushed the skip-tokenizer-vlm branch from 9a280a3 to 5b4a899 Compare February 28, 2025 03:29

mickqian changed the title ~~test: add vlm to skip_tokenizer_init test~~ test: add vlm to token in & out example Feb 28, 2025

mickqian force-pushed the skip-tokenizer-vlm branch from 5b4a899 to 9406e5b Compare February 28, 2025 13:37

mickqian requested review from hnyls2002, ispobock and ByronHsu as code owners February 28, 2025 13:37

mickqian force-pushed the skip-tokenizer-vlm branch from 9406e5b to f1692ca Compare March 1, 2025 03:31

mickqian force-pushed the skip-tokenizer-vlm branch from 332c96d to 4883cf0 Compare March 3, 2025 08:15

test: add vlm to skip_tokenizer_init test

d7ec05a

mickqian force-pushed the skip-tokenizer-vlm branch from 4883cf0 to d7ec05a Compare March 3, 2025 10:08

mickqian force-pushed the skip-tokenizer-vlm branch from f1d39fa to a7a6665 Compare March 4, 2025 14:57

rebase

f2aa687

mickqian force-pushed the skip-tokenizer-vlm branch from a7a6665 to f2aa687 Compare March 4, 2025 15:16

zhaochenyang20 added 2 commits March 4, 2025 09:35

Merge branch 'main' into skip-tokenizer-vlm

ec4e8a1

add docs to token in token out

f1f8759

zhaochenyang20 mentioned this pull request Mar 4, 2025

[Bug] Qwen 2.5 VL #3321

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add vlm to token in & out example #3941

test: add vlm to token in & out example #3941

mickqian commented Feb 28, 2025 •

edited

Loading

zhaochenyang20 commented Mar 3, 2025

mickqian commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

mickqian commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

test: add vlm to token in & out example #3941

Are you sure you want to change the base?

test: add vlm to token in & out example #3941

Conversation

mickqian commented Feb 28, 2025 • edited Loading

Motivation

Modifications

Checklist

zhaochenyang20 commented Mar 3, 2025

mickqian commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

mickqian commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

mickqian commented Feb 28, 2025 •

edited

Loading