Skip to content

Commit

Permalink
Updating index.md to fix 8 broken links (pytorch#2329)
Browse files Browse the repository at this point in the history
* Update index.md

Update to fix a broken link in index.md where the trailing .md is cut off from the management_api.md. Added an anchor link to force the .md to show up.

* Update to index.md

Update to index.md to fix several links ending in .md that sphinx is breaking. Added anchor links to each link and a corresponding anchor in the affected doc. Tested locally and seems to be working.

* Update inference_api.md

* Updated typos

Fixed typos and updated wordslist.txt

* Update wordlist.txt

---------

Co-authored-by: sekyonda <[email protected]>
Co-authored-by: lxning <[email protected]>
  • Loading branch information
3 people authored May 15, 2023
1 parent 35fb574 commit f01868f
Show file tree
Hide file tree
Showing 8 changed files with 28 additions and 26 deletions.
16 changes: 8 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch mo


## ⚡ Why TorchServe
* [Model Management API](https://github.com/pytorch/serve/blob/master/docs/management_api.md): multi model management with optimized worker to model allocation
* [Inference API](https://github.com/pytorch/serve/blob/master/docs/inference_api.md): REST and gRPC support for batched inference
* [TorchServe Workflows](https://github.com/pytorch/serve/blob/master/examples/Workflows/README.md): deploy complex DAGs with multiple interdependent models
* [Model Management API](https://github.com/pytorch/serve/blob/master/docs/management_api.md#management-api): multi model management with optimized worker to model allocation
* [Inference API](https://github.com/pytorch/serve/blob/master/docs/inference_api.md#inference-api): REST and gRPC support for batched inference
* [TorchServe Workflows](https://github.com/pytorch/serve/blob/master/examples/Workflows/README.md#workflow-examples): deploy complex DAGs with multiple interdependent models
* Default way to serve PyTorch models in
* [Kubeflow](https://v0-5.kubeflow.org/docs/components/pytorchserving/)
* [MLflow](https://github.com/mlflow/mlflow-torchserve)
* [Sagemaker](https://aws.amazon.com/blogs/machine-learning/serving-pytorch-models-in-production-with-the-amazon-sagemaker-native-torchserve-integration/)
* [Kserve](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/): Supports both v1 and v2 API
* [Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-deploy-pytorch-models-vertex-ai)
* Export your model for optimized inference. Torchscript out of the box, [ORT and ONNX](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [IPEX](https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch), [TensorRT](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [FasterTransformer](https://github.com/pytorch/serve/tree/master/examples/FasterTransformer_HuggingFace_Bert)
* [Performance Guide](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md): builtin support to optimize, benchmark and profile PyTorch and TorchServe performance
* [Expressive handlers](https://github.com/pytorch/serve/blob/master/CONTRIBUTING.md): An expressive handler architecture that makes it trivial to support inferencing for your usecase with [many supported out of the box](https://github.com/pytorch/serve/tree/master/ts/torch_handler)
* [Metrics API](https://github.com/pytorch/serve/blob/master/docs/metrics.md): out of box support for system level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics and PyTorch profiler support
* Export your model for optimized inference. Torchscript out of the box, [ORT and ONNX](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#performance-guide), [IPEX](https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch), [TensorRT](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#performance-guide), [FasterTransformer](https://github.com/pytorch/serve/tree/master/examples/FasterTransformer_HuggingFace_Bert)
* [Performance Guide](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#performance-guide): builtin support to optimize, benchmark and profile PyTorch and TorchServe performance
* [Expressive handlers](https://github.com/pytorch/serve/blob/master/CONTRIBUTING.md#contributing-to-torchServe): An expressive handler architecture that makes it trivial to support inferencing for your usecase with [many supported out of the box](https://github.com/pytorch/serve/tree/master/ts/torch_handler)
* [Metrics API](https://github.com/pytorch/serve/blob/master/docs/metrics.md#torchserve-metrics): out of box support for system level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics and PyTorch profiler support

## 🤔 How does TorchServe work

Expand Down Expand Up @@ -56,7 +56,7 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch mo
* [TorchServe UseCases](https://github.com/pytorch/serve/blob/master/examples/README.md#usecases)
* [Model Zoo](https://github.com/pytorch/serve/blob/master/docs/model_zoo.md) - List of pre-trained model archives ready to be served for inference with TorchServe.

For [more examples](https://github.com/pytorch/serve/blob/master/examples/README.md)
For [more examples](https://github.com/pytorch/serve/blob/master/examples/README.md#torchserve-internals)


## Advanced Features
Expand Down
4 changes: 2 additions & 2 deletions docs/inference_api.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Inference API
# [Inference API](#inference-api)

Inference API is listening on port 8080 and only accessible from localhost by default. To change the default setting, see [TorchServe Configuration](configuration.md).

Expand Down Expand Up @@ -41,7 +41,7 @@ If the server is running, the response is:
}
```

"maxRetryTimeoutInSec" (default: 5MIN) can be defined in a model's config yaml file(eg. model-config.yaml). It is the maximum time window of recovering a dead backend worker. A healthy worker can be in the state: WORKER_STARTED, WORKER_MODEL_LOADED, or WORKER_STOPPED within maxRetryTimeoutInSec window. "Ping" endpont"
"maxRetryTimeoutInSec" (default: 5MIN) can be defined in a model's config yaml file(e.g model-config.yaml). It is the maximum time window of recovering a dead backend worker. A healthy worker can be in the state: WORKER_STARTED, WORKER_MODEL_LOADED, or WORKER_STOPPED within maxRetryTimeoutInSec window. "Ping" endpoint"
* return 200 + json message "healthy": for any model, the number of active workers is equal or larger than the configured minWorkers.
* return 500 + json message "unhealthy": for any model, the number of active workers is less than the configured minWorkers.

Expand Down
20 changes: 10 additions & 10 deletions docs/management_api.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Management API
# [Management API](#management-api)

TorchServe provides the following APIs that allows you to manage models at runtime:

Expand Down Expand Up @@ -41,13 +41,13 @@ curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/m
}
```

### Encrypted model serving
### Encrypted model serving
If you'd like to serve an encrypted model then you need to setup [S3 SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) with the following environment variables:
* AWS_ACCESS_KEY_ID
* AWS_SECRET_ACCESS_KEY
* AWS_DEFAULT_REGION

And set "s3_sse_kms=true" in HTTP request.
And set "s3_sse_kms=true" in HTTP request.

For example: model squeezenet1_1 is [encrypted on S3 under your own private account](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html). The model http url on S3 is `https://torchserve.pytorch.org/sse-test/squeezenet1_1.mar`.
- if torchserve will run on EC2 instance (e.g. OS: ubuntu)
Expand Down Expand Up @@ -86,7 +86,7 @@ curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=fals
< x-request-id: 4dc54158-c6de-42aa-b5dd-ebcb5f721043
< content-length: 47
< connection: keep-alive
<
<
{
"status": "Processing worker updates..."
}
Expand All @@ -102,7 +102,7 @@ curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=true
< x-request-id: ecd2e502-382f-4c3b-b425-519fbf6d3b85
< content-length: 89
< connection: keep-alive
<
<
{
"status": "Model \"squeezenet1_1\" Version: 1.0 registered with 1 initial workers"
}
Expand All @@ -118,7 +118,7 @@ This API follows the [ManagementAPIsService.ScaleWorker](https://github.com/pyto
* `min_worker` - (optional) the minimum number of worker processes. TorchServe will try to maintain this minimum for specified model. The default value is `1`.
* `max_worker` - (optional) the maximum number of worker processes. TorchServe will make no more that this number of workers for the specified model. The default is the same as the setting for `min_worker`.
* `synchronous` - whether or not the call is synchronous. The default value is `false`.
* `timeout` - the specified wait time for a worker to complete all pending requests. If exceeded, the work process will be terminated. Use `0` to terminate the backend worker process immediately. Use `-1` to wait infinitely. The default value is `-1`.
* `timeout` - the specified wait time for a worker to complete all pending requests. If exceeded, the work process will be terminated. Use `0` to terminate the backend worker process immediately. Use `-1` to wait infinitely. The default value is `-1`.

Use the Scale Worker API to dynamically adjust the number of workers for any version of a model to better serve different inference request loads.

Expand All @@ -134,7 +134,7 @@ curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3"
< x-request-id: 42adc58e-6956-4198-ad07-db6c620c4c1e
< content-length: 47
< connection: keep-alive
<
<
{
"status": "Processing worker updates..."
}
Expand All @@ -150,7 +150,7 @@ curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3&synchronous=true"
< x-request-id: b72b1ea0-81c6-4cce-92c4-530d3cfe5d4a
< content-length: 63
< connection: keep-alive
<
<
{
"status": "Workers scaled to 3 for model: noop"
}
Expand All @@ -169,7 +169,7 @@ curl -v -X PUT "http://localhost:8081/models/noop/2.0?min_worker=3&synchronous=t
< x-request-id: 3997ccd4-ae44-4570-b249-e361b08d3d47
< content-length: 77
< connection: keep-alive
<
<
{
"status": "Workers scaled to 3 for model: noop, version: 2.0"
}
Expand Down Expand Up @@ -290,7 +290,7 @@ curl http://localhost:8081/models/noop/all
```

`GET /models/{model_name}/{model_version}?customized=true`
or
or
`GET /models/{model_name}?customized=true`

Use the Describe Model API to get detail runtime status and customized metadata of a version of a model:
Expand Down
2 changes: 1 addition & 1 deletion docs/metrics.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TorchServe Metrics
# [TorchServe Metrics](#torchserve-metrics)

## Contents of this document

Expand Down
4 changes: 2 additions & 2 deletions docs/performance_guide.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Performance Guide
# [Performance Guide](#performance-guide)
In case you're interested in optimizing the memory usage, latency or throughput of a PyTorch model served with TorchServe, this is the guide for you.
## Optimizing PyTorch
There are many tricks to optimize PyTorch models for production including but not limited to distillation, quantization, fusion, pruning, setting environment variables and we encourage you to benchmark and see what works best for you. An experimental tool that may make this process easier is https://pypi.org/project/torchprep.
Expand All @@ -9,7 +9,7 @@ In general it's hard to optimize models and the easiest approach can be exportin

`pip install torchserve[onnx]`

In particular TorchServe has native support for ONNX models which can be loaded via ORT for both accelerated CPU and GPU inference. ONNX operates a bit differentyl from a regular PyTorch model in that when you're running the conversion you need to explicity set and name your input and output dimensions. See https://github.com/pytorch/serve/blob/master/test/pytest/test_onnx.py for an example. So at a high level what TorchServe allows you to do is
In particular TorchServe has native support for ONNX models which can be loaded via ORT for both accelerated CPU and GPU inference. ONNX operates a bit differently from a regular PyTorch model in that when you're running the conversion you need to explicitly set and name your input and output dimensions. See https://github.com/pytorch/serve/blob/master/test/pytest/test_onnx.py for an example. So at a high level what TorchServe allows you to do is
1. Package serialized ONNX weights `torch-model-archiver --serialized-file model.onnx ...`
2. Load those weights from `base_handler.py` using `ort_session = ort.InferenceSession(self.model_pt_path, providers=providers, sess_options=sess_options)` which supports reasonable defaults for both CPU and GPU inference
3. Allow you define custom pre and post processing functions to pass in data in the format your onnx model expects with a custom handler
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Examples showcasing TorchServe Features and Integrations
# [Examples showcasing TorchServe Features and Integrations](#torchserve-internals)

## TorchServe Internals

Expand Down
4 changes: 2 additions & 2 deletions examples/Workflows/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Workflow examples
# [Workflow examples](#workflow-examples)

Workflows can be used to compose an ensemble of Pytorch models and Python functions and package them in a `war` file. A workflow is executed as a DAG where the nodes can be either Pytorch models packaged as `mar` files or function nodes specified in the workflow handler file. The DAG can be used to define both sequential or parallel pipelines.

Expand All @@ -8,7 +8,7 @@ As an example a sequential pipeline may look something like
input -> function1 -> model1 -> model2 -> function2 -> output
```

And a parallel pipeline may look something like
And a parallel pipeline may look something like

```
model1
Expand Down
2 changes: 2 additions & 0 deletions ts_scripts/spellcheck_conf/wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1049,3 +1049,5 @@ torchrun
nproc
largemodels
torchpippy
InferenceSession
maxRetryTimeoutInSec

0 comments on commit f01868f

Please sign in to comment.