Skip to content

Commit

Permalink
[Refactor] Enable help and code refactor on several scripts (#43)
Browse files Browse the repository at this point in the history
* enable --help of several scripts

* remove

* remove iter

* nit

* nit

* change to lowercase
  • Loading branch information
Deegue authored Jan 11, 2024
1 parent a816ab1 commit 98edae1
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 141 deletions.
16 changes: 8 additions & 8 deletions .github/workflows/workflow_inference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,16 +109,16 @@ jobs:
else
docker exec "${TARGET}" bash -c "MODEL_TO_SERVE=\"${{ matrix.model }}\" python inference/serve.py --serve_simple"
fi
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
- name: Run Inference Test with Deltatuner
if: ${{ matrix.dtuner_model }}
run: |
TARGET=${{steps.target.outputs.target}}
docker exec "${TARGET}" bash -c "python inference/serve.py --config_file .github/workflows/config/mpt_deltatuner.yaml --serve_simple"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
- name: Run Inference Test with DeepSpeed
run: |
Expand All @@ -128,8 +128,8 @@ jobs:
else
docker exec "${TARGET}" bash -c "python .github/workflows/config/update_inference_config.py --config_file inference/models/\"${{ matrix.model }}\".yaml --output_file \"${{ matrix.model }}\".yaml.deepspeed --deepspeed"
docker exec "${TARGET}" bash -c "python inference/serve.py --config_file \"${{ matrix.model }}\".yaml.deepspeed --serve_simple"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
fi
- name: Run Inference Test with DeepSpeed and Deltatuner
Expand All @@ -140,8 +140,8 @@ jobs:
echo ${{ matrix.model }} is not supported!
else
docker exec "${TARGET}" bash -c "python inference/serve.py --config_file .github/workflows/config/mpt_deltatuner_deepspeed.yaml --serve_simple"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }}"
docker exec "${TARGET}" bash -c "python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/${{ matrix.model }} --streaming_response"
fi
- name: Run Inference Test with REST API
Expand Down
88 changes: 0 additions & 88 deletions examples/inference/api_server_simple/query_batch.py

This file was deleted.

44 changes: 16 additions & 28 deletions examples/inference/api_server_simple/query_single.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,13 @@
import time
import argparse

parser = argparse.ArgumentParser(description="Model Inference Script", add_help=False)
parser.add_argument("--model_endpoint", default="http://127.0.0.1:8000", type=str, help="deployed model endpoint")
parser.add_argument("--streaming_response", default=False, action="store_true", help="whether to enable streaming response")
parser.add_argument("--max_new_tokens", default=None, help="The maximum numbers of tokens to generate")
parser.add_argument("--temperature", default=None, help="The value used to modulate the next token probabilities")
parser.add_argument("--top_p", default=None, help="If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to`Top p` or higher are kept for generation")
parser.add_argument("--top_k", default=None, help="The number of highest probability vocabulary tokens to keep for top-k-filtering")
parser.add_argument("--num_iter", default=10, type=int, help="The Number of inference iterations")
parser = argparse.ArgumentParser(description="Example script to query with single request", add_help=True)
parser.add_argument("--model_endpoint", default="http://127.0.0.1:8000", type=str, help="Deployed model endpoint.")
parser.add_argument("--streaming_response", default=False, action="store_true", help="Whether to enable streaming response.")
parser.add_argument("--max_new_tokens", default=None, help="The maximum numbers of tokens to generate.")
parser.add_argument("--temperature", default=None, help="The value used to modulate the next token probabilities.")
parser.add_argument("--top_p", default=None, help="If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to`Top p` or higher are kept for generation.")
parser.add_argument("--top_k", default=None, help="The number of highest probability vocabulary tokens to keep for top-k-filtering.")

args = parser.parse_args()
prompt = "Once upon a time,"
Expand All @@ -41,23 +40,12 @@

sample_input = {"text": prompt, "config": config, "stream": args.streaming_response}

total_time = 0.0
num_iter = args.num_iter
num_warmup = 3
for i in range(num_iter):
print("iter: ", i)
tic = time.time()
proxies = { "http": None, "https": None}
outputs = requests.post(args.model_endpoint, proxies=proxies, json=sample_input, stream=args.streaming_response)
if args.streaming_response:
outputs.raise_for_status()
for output in outputs.iter_content(chunk_size=None, decode_unicode=True):
print(output, end='', flush=True)
print()
else:
print(outputs.text, flush=True)
toc = time.time()
if i >= num_warmup:
total_time += (toc - tic)

print("Inference latency: %.3f ms." % (total_time / (num_iter - num_warmup) * 1000))
proxies = { "http": None, "https": None}
outputs = requests.post(args.model_endpoint, proxies=proxies, json=sample_input, stream=args.streaming_response)
if args.streaming_response:
outputs.raise_for_status()
for output in outputs.iter_content(chunk_size=None, decode_unicode=True):
print(output, end='', flush=True)
print()
else:
print(outputs.text, flush=True)
32 changes: 16 additions & 16 deletions inference/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,22 +58,22 @@ def get_deployed_models(args):
def main(argv=None):
# args
import argparse
parser = argparse.ArgumentParser(description="Model Serve Script", add_help=False)
parser.add_argument("--config_file", type=str, help="inference configuration file in YAML. If specified, all other arguments are ignored")
parser.add_argument("--model", default=None, type=str, help="model name or path")
parser.add_argument("--tokenizer", default=None, type=str, help="tokenizer name or path")
parser.add_argument("--port", default=8000, type=int, help="the port of deployment address")
parser.add_argument("--route_prefix", default=None, type=str, help="the route prefix for HTTP requests.")
parser.add_argument("--cpus_per_worker", default="24", type=int, help="cpus per worker")
parser.add_argument("--gpus_per_worker", default=0, type=float, help="gpus per worker, used when --device is cuda")
parser.add_argument("--hpus_per_worker", default=0, type=float, help="hpus per worker, used when --device is hpu")
parser.add_argument("--deepspeed", action='store_true', help="enable deepspeed inference")
parser.add_argument("--workers_per_group", default="2", type=int, help="workers per group, used with --deepspeed")
parser.add_argument("--ipex", action='store_true', help="enable ipex optimization")
parser.add_argument("--device", default="cpu", type=str, help="cpu, xpu, hpu or cuda")
parser.add_argument("--serve_local_only", action="store_true", help="only support local access to url")
parser.add_argument("--serve_simple", action="store_true", help="whether to serve OpenAI-compatible API for all models or serve simple endpoint based on model conf files.")
parser.add_argument("--keep_serve_terminal", action="store_true", help="whether to keep serve terminal.")
parser = argparse.ArgumentParser(description="Model Serve Script", add_help=True)
parser.add_argument("--config_file", type=str, help="Inference configuration file in YAML. If specified, all other arguments will be ignored.")
parser.add_argument("--model", default=None, type=str, help="Model name or path.")
parser.add_argument("--tokenizer", default=None, type=str, help="Tokenizer name or path.")
parser.add_argument("--port", default=8000, type=int, help="The port of deployment address.")
parser.add_argument("--route_prefix", default=None, type=str, help="The route prefix for HTTP requests.")
parser.add_argument("--cpus_per_worker", default="24", type=int, help="CPUs per worker.")
parser.add_argument("--gpus_per_worker", default=0, type=float, help="GPUs per worker, used when --device is cuda.")
parser.add_argument("--hpus_per_worker", default=0, type=float, help="HPUs per worker, used when --device is hpu.")
parser.add_argument("--deepspeed", action='store_true', help="Enable deepspeed inference.")
parser.add_argument("--workers_per_group", default="2", type=int, help="Workers per group, used with --deepspeed.")
parser.add_argument("--ipex", action='store_true', help="Enable ipex optimization.")
parser.add_argument("--device", default="cpu", type=str, help="cpu, xpu, hpu or cuda.")
parser.add_argument("--serve_local_only", action="store_true", help="Only support local access to url.")
parser.add_argument("--serve_simple", action="store_true", help="Whether to serve OpenAI-compatible API for all models or serve simple endpoint based on model conf files.")
parser.add_argument("--keep_serve_terminal", action="store_true", help="Whether to keep serve terminal.")

args = parser.parse_args(argv)

Expand Down
2 changes: 1 addition & 1 deletion ui/start_ui.py
Original file line number Diff line number Diff line change
Expand Up @@ -864,7 +864,7 @@ def _init_ui(self):
self.gr_chat = gr_chat

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Start UI", add_help=False)
parser = argparse.ArgumentParser(description="Web UI for LLMs", add_help=True)
parser.add_argument("--finetune_model_path", default="./", type=str, help="Where to save the finetune model.")
parser.add_argument("--finetune_checkpoint_path", default="", type=str, help="Where to save checkpoints.")
parser.add_argument("--default_rag_path", default="./vector_store/", type=str, help="The path of vectorstore used by RAG.")
Expand Down

0 comments on commit 98edae1

Please sign in to comment.