-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add vlm to token in & out example #3941
base: main
Are you sure you want to change the base?
Conversation
9a280a3
to
5b4a899
Compare
5b4a899
to
9406e5b
Compare
9406e5b
to
f1692ca
Compare
(sglang) chayenne@lmsys:~/token-in-token-out/sglang/examples/runtime/engine$ python token_in_token_out_vlm.py
INFO 03-03 05:48:20 __init__.py:190] Automatically detected platform cuda.
parser
usage: token_in_token_out_vlm.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--tokenizer-mode {auto,slow}]
[--skip-tokenizer-init] [--load-format {auto,pt,safetensors,npcache,dummy,gguf,bitsandbytes,layered}] [--trust-remote-code]
[--dtype {auto,half,float16,bfloat16,float,float32}] [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
[--quantization-param-path QUANTIZATION_PARAM_PATH]
[--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,w8a8_int8}]
[--context-length CONTEXT_LENGTH] [--device {cuda,xpu,hpu,cpu}] [--served-model-name SERVED_MODEL_NAME]
[--chat-template CHAT_TEMPLATE] [--is-embedding] [--revision REVISION] [--mem-fraction-static MEM_FRACTION_STATIC]
[--max-running-requests MAX_RUNNING_REQUESTS] [--max-total-tokens MAX_TOTAL_TOKENS]
[--chunked-prefill-size CHUNKED_PREFILL_SIZE] [--max-prefill-tokens MAX_PREFILL_TOKENS]
[--schedule-policy {lpm,random,fcfs,dfs-weight}] [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
[--cpu-offload-gb CPU_OFFLOAD_GB] [--prefill-only-one-req PREFILL_ONLY_ONE_REQ]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--stream-interval STREAM_INTERVAL] [--stream-output]
[--random-seed RANDOM_SEED] [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
[--watchdog-timeout WATCHDOG_TIMEOUT] [--dist-timeout DIST_TIMEOUT] [--download-dir DOWNLOAD_DIR]
[--base-gpu-id BASE_GPU_ID] [--gpu-id-step GPU_ID_STEP] [--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP]
[--log-requests] [--show-time-cost] [--enable-metrics] [--decode-log-interval DECODE_LOG_INTERVAL] [--api-key API_KEY]
[--file-storage-pth FILE_STORAGE_PTH] [--enable-cache-report] [--data-parallel-size DATA_PARALLEL_SIZE]
[--load-balance-method {round_robin,shortest_queue}] [--expert-parallel-size EXPERT_PARALLEL_SIZE]
[--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES] [--node-rank NODE_RANK]
[--json-model-override-args JSON_MODEL_OVERRIDE_ARGS] [--lora-paths [LORA_PATHS ...]]
[--max-loras-per-batch MAX_LORAS_PER_BATCH] [--lora-backend LORA_BACKEND]
[--attention-backend {flashinfer,triton,torch_native}] [--sampling-backend {flashinfer,pytorch}]
[--grammar-backend {xgrammar,outlines,llguidance}] [--enable-flashinfer-mla] [--flashinfer-mla-disable-ragged]
[--speculative-algorithm {EAGLE,NEXTN}] [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
[--speculative-num-steps SPECULATIVE_NUM_STEPS] [--speculative-eagle-topk {1,2,4,8}]
[--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS] [--speculative-token-map SPECULATIVE_TOKEN_MAP]
[--enable-double-sparsity] [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH] [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
[--ds-heavy-token-num DS_HEAVY_TOKEN_NUM] [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
[--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD] [--disable-radix-cache] [--disable-jump-forward]
[--disable-cuda-graph] [--disable-cuda-graph-padding] [--enable-nccl-nvls] [--disable-outlines-disk-cache]
[--disable-custom-all-reduce] [--disable-mla] [--disable-overlap-schedule] [--enable-mixed-chunk] [--enable-dp-attention]
[--enable-ep-moe] [--enable-torch-compile] [--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
[--cuda-graph-max-bs CUDA_GRAPH_MAX_BS] [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
[--torchao-config TORCHAO_CONFIG] [--enable-nan-detection] [--enable-p2p-check] [--triton-attention-reduce-in-fp32]
[--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
[--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS] [--delete-ckpt-after-loading] [--enable-memory-saver]
[--allow-auto-truncate] [--enable-custom-logit-processor] [--tool-call-parser {qwen25,mistral,llama3}]
[--enable-hierarchical-cache]
token_in_token_out_vlm.py: error: the following arguments are required: --model-path
Hey Mick, could you fix the bugs here? |
332c96d
to
4883cf0
Compare
@zhaochenyang20 fixed |
@mickqian Fix the conflicts and rebase plz |
4883cf0
to
d7ec05a
Compare
After rebasing, the stream test of skip_tokenizer_init needs some additional work, while the none-stream version is ready. |
Hey mick. Could you try to simplify your examples: https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/token_in_token_out_llm.py Following this and name it |
@mickqian Thanks! |
f1d39fa
to
a7a6665
Compare
a7a6665
to
f2aa687
Compare
Motivation
Modifications
Checklist