ll3.1_test

server setup

docker pull srjung/ll3.1_release:latest

# when you saved the HF checkpoint files under $PWD as Meta_Llama-3.1-405B-MP16_HF

docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $PWD:/model -e HUGGINGFACE_HUB_CACHE=/model -e VLLM_USE_TRITON_FLASH_ATTN=0 -p 8000:8000 srjung/ll3.1_release:latest python -m vllm.entrypoints.api_server --tensor-parallel-size 8 --enforce-eager  --worker-use-ray  --max-model-len=8192 --model /model/Meta_Llama-3.1-405B-MP16_HF

#when the meta-llama/Meta-Llama-3.1-405B is saved under $PWD you can also use meta-llama/Meta-Llama-3.1-405B as --model args

docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined -v $PWD:/model -e HUGGINGFACE_HUB_CACHE=/model -e VLLM_USE_TRITON_FLASH_ATTN=0 -p 8000:8000 srjung/ll3.1_release:latest python -m vllm.entrypoints.api_server --tensor-parallel-size 8 --enforce-eager  --worker-use-ray  --max-model-len=8192 --model meta-llama/Meta-Llama-3.1-405B

client setup: in another shell

curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{ "prompt": "San Francisco City is ", "max_tokens": 200, "temperature": 0.9}'

Expected behavior

server

INFO 07-25 03:59:46 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/model/Meta_Llama-3.1-405B-MP16_HF', speculative_config=None, tokenizer='/model/Meta_Llama-3.1-405B-MP16_HF', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/model/Meta_Llama-3.1-405B-MP16_HF)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-25 04:00:10 selector.py:56] Using ROCmFlashAttention backend.
(RayWorkerWrapper pid=21448) INFO 07-25 04:00:10 selector.py:56] Using ROCmFlashAttention backend.
INFO 07-25 04:00:12 selector.py:56] Using ROCmFlashAttention backend.
INFO 07-25 04:01:16 model_runner.py:146] Loading model weights took 95.5194 GB
(RayWorkerWrapper pid=22543) INFO 07-25 04:01:44 model_runner.py:146] Loading model weights took 95.5194 GB
(RayWorkerWrapper pid=22697) INFO 07-25 04:00:12 selector.py:56] Using ROCmFlashAttention backend. [repeated 13x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=22697) INFO 07-25 04:01:50 model_runner.py:146] Loading model weights took 95.5194 GB [repeated 6x across cluster]
INFO 07-25 04:02:33 distributed_gpu_executor.py:56] # GPU blocks: 34501, # CPU blocks: 2080
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-25 04:03:59 async_llm_engine.py:553] Received request ecc91bf242964262a0cc05ca4d2b6c48: prompt: 'San Francisco City is ', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.9, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 07-25 04:04:04 async_llm_engine.py:124] Finished request ecc91bf242964262a0cc05ca4d2b6c48.
INFO:     172.17.0.1:36956 - "POST /generate HTTP/1.1" 200 OK

client

curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{ "prompt": "San Francisco City is ", "max_tokens": 200, "temperature": 0.9}'
{"text":["San Francisco City is 1 hour behind Santa Rosa City.\nSanta Rosa City, California, USA is located in PST (Pacific Standard Time) time zone and San Francisco City, California, USA is located in PDT (Pacific Daylight Time) time zone. There is an 1:0 hours time difference between Santa Rosa City and San Francisco City right now.\nPlease note that Daylight Saving Time (DST) / Summer Time is taken into account for the calculation of hour difference between Santa Rosa City and San Francisco City."]}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ll3.1_test

server setup

client setup: in another shell

Expected behavior

server

client

About

Releases

Packages

AlexHe99/ll3.1_test

Folders and files

Latest commit

History

Repository files navigation

ll3.1_test

server setup

client setup: in another shell

Expected behavior

server

client

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages