Skip to content

Commit

Permalink
Update README and versions for 19.07 release
Browse files Browse the repository at this point in the history
  • Loading branch information
David Goodwin committed Jul 22, 2019
1 parent b30036f commit 582287e
Show file tree
Hide file tree
Showing 3 changed files with 217 additions and 8 deletions.
8 changes: 4 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ RUN python3 /workspace/onnxruntime/tools/ci_build/build.py --build_dir /workspac
############################################################################
FROM ${BASE_IMAGE} AS trtserver_build

ARG TRTIS_VERSION=1.4.0dev
ARG TRTIS_CONTAINER_VERSION=19.07dev
ARG TRTIS_VERSION=1.4.0
ARG TRTIS_CONTAINER_VERSION=19.07

# libgoogle-glog0v5 is needed by caffe2 libraries.
RUN apt-get update && \
Expand Down Expand Up @@ -301,8 +301,8 @@ ENTRYPOINT ["/opt/tensorrtserver/nvidia_entrypoint.sh"]
############################################################################
FROM ${BASE_IMAGE}

ARG TRTIS_VERSION=1.4.0dev
ARG TRTIS_CONTAINER_VERSION=19.07dev
ARG TRTIS_VERSION=1.4.0
ARG TRTIS_CONTAINER_VERSION=19.07

ENV TENSORRT_SERVER_VERSION ${TRTIS_VERSION}
ENV NVIDIA_TENSORRT_SERVER_VERSION ${TRTIS_CONTAINER_VERSION}
Expand Down
215 changes: 212 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,222 @@
NVIDIA TensorRT Inference Server
================================

**NOTE: You are currently on the r19.07 branch which tracks
stabilization towards the next release. This branch is not usable
during stabilization.**
**NOTICE: The r19.07 branch has converted to using CMake
to build the server, clients and other artifacts. Read the new
documentation carefully to understand the new** `build process
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/build.html>`_.

.. overview-begin-marker-do-not-remove
The NVIDIA TensorRT Inference Server provides a cloud inferencing
solution optimized for NVIDIA GPUs. The server provides an inference
service via an HTTP or GRPC endpoint, allowing remote clients to
request inferencing for any model being managed by the server.

What's New In 1.4.0
-------------------

* Added libtorch as a new backend. PyTorch models manually decorated
or automatically traced to produce TorchScript can now be run
directly by the inference server.

* Build system converted from bazel to CMake. The new CMake-based
build system is more transparent, portable and modular.

* To simplify the creation of custom backends, a Custom Backend SDK
and improved documentation is now available.

* Improved AsyncRun API in C++ and Python client libraries.

* perf_client can now use user-supplied input data (previously
perf_client could only use random or zero input data).

* perf_client now reports latency at multiple confidence percentiles
(p50, p90, p95, p99) as well as a user-supplied percentile that is
also used to stabilize latency results.

* Improvements to automatic model configuration creation
(-\\-strict-model-config=false).

* C++ and Python client libraries now allow additional HTTP headers to
be specified when using the HTTP protocol.

Features
--------

* `Multiple framework support
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#framework-model-definition>`_. The
server can manage any number and mix of models (limited by system
disk and memory resources). Supports TensorRT, TensorFlow GraphDef,
TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
formats. Also supports TensorFlow-TensorRT integrated
models. Variable-size input and output tensors are allowed if
supported by the framework. See `Capabilities
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/capabilities.html#capabilities>`_
for detailed support information for each framework.

* `Concurrent model execution support
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#instance-groups>`_. Multiple
models (or multiple instances of the same model) can run
simultaneously on the same GPU.

* Batching support. For models that support batching, the server can
accept requests for a batch of inputs and respond with the
corresponding batch of outputs. The inference server also supports
multiple `scheduling and batching
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#scheduling-and-batching>`_
algorithms that combine individual inference requests together to
improve inference throughput. These scheduling and batching
decisions are transparent to the client requesting inference.

* `Custom backend support
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#custom-backends>`_. The inference server
allows individual models to be implemented with custom backends
instead of by a deep-learning framework. With a custom backend a
model can implement any logic desired, while still benefiting from
the GPU support, concurrent execution, dynamic batching and other
features provided by the server.

* `Ensemble support
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/models_and_schedulers.html#ensemble-models>`_. An
ensemble represents a pipeline of one or more models and the
connection of input and output tensors between those models. A
single inference request to an ensemble will trigger the execution
of the entire pipeline.

* Multi-GPU support. The server can distribute inferencing across all
system GPUs.

* The inference server `monitors the model repository
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#modifying-the-model-repository>`_
for any change and dynamically reloads the model(s) when necessary,
without requiring a server restart. Models and model versions can be
added and removed, and model configurations can be modified while
the server is running.

* `Model repositories
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#>`_
may reside on a locally accessible file system (e.g. NFS) or in
Google Cloud Storage.

* Readiness and liveness `health endpoints
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/http_grpc_api.html#health>`_
suitable for any orchestration or deployment framework, such as
Kubernetes.

* `Metrics
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/metrics.html>`_
indicating GPU utilization, server throughput, and server latency.

.. overview-end-marker-do-not-remove
The current release of the TensorRT Inference Server is 1.4.0 and
corresponds to the 19.07 release of the tensorrtserver container on
`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for
this release is `r19.07
<https://github.com/NVIDIA/tensorrt-inference-server/tree/r19.07>`_.

Backwards Compatibility
-----------------------

Continuing in version 1.4.0 the following interfaces maintain
backwards compatibilty with the 1.0.0 release. If you have model
configuration files, custom backends, or clients that use the
inference server HTTP or GRPC APIs (either directly or through the
client libraries) from releases prior to 1.0.0 (19.03) you should edit
and rebuild those as necessary to match the version 1.0.0 APIs.

These inferfaces will maintain backwards compatibility for all future
1.x.y releases (see below for exceptions):

* Model configuration as defined in `model_config.proto
<https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/model_config.proto>`_.

* The inference server HTTP and GRPC APIs as defined in `api.proto
<https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/api.proto>`_
and `grpc_service.proto
<https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/grpc_service.proto>`_.

* The custom backend interface as defined in `custom.h
<https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/backends/custom/custom.h>`_.

As new features are introduced they may temporarily have beta status
where they are subject to change in non-backwards-compatible
ways. When they exit beta they will conform to the
backwards-compatibility guarantees described above. Currently the
following features are in beta:

* In the model configuration defined in `model_config.proto
<https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/model_config.proto>`_
the sections related to model ensembling are currently in beta. In
particular, the ModelEnsembling message will potentially undergo
non-backwards-compatible changes.


Documentation
-------------

The User Guide, Developer Guide, and API Reference `documentation
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/index.html>`_
provide guidance on installing, building and running the latest
release of the TensorRT Inference Server.

You can also view the documentation for the `master branch
<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/index.html>`_
and for `earlier releases
<https://docs.nvidia.com/deeplearning/sdk/inference-server-archived/index.html>`_.

READMEs for deployment examples can be found in subdirectories of
deploy/, for example, `deploy/single_server/README.rst
<https://github.com/NVIDIA/tensorrt-inference-server/tree/master/deploy/single_server/README.rst>`_.

The `Release Notes
<https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html>`_
and `Support Matrix
<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_
indicate the required versions of the NVIDIA Driver and CUDA, and also
describe which GPUs are supported by the inference server.

Other Documentation
^^^^^^^^^^^^^^^^^^^

* `Maximizing Utilization for Data Center Inference with TensorRT
Inference Server
<https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_.

* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference
<https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_.

* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
Inference Server and Kubeflow
<https://www.kubeflow.org/blog/nvidia_tensorrt/>`_.

Contributing
------------

Contributions to TensorRT Inference Server are more than welcome. To
contribute make a pull request and follow the guidelines outlined in
the `Contributing <CONTRIBUTING.md>`_ document.

Reporting problems, asking questions
------------------------------------

We appreciate any feedback, questions or bug reporting regarding this
project. When help with code is needed, follow the process outlined in
the Stack Overflow (https://stackoverflow.com/help/mcve)
document. Ensure posted examples are:

* minimal – use as little code as possible that still produces the
same problem

* complete – provide all parts needed to reproduce the problem. Check
if you can strip external dependency and still show the problem. The
less time we spend on reproducing problems the more time we have to
fix it

* verifiable – test the code you're about to provide to make sure it
reproduces the problem. Remove all other problems that are not
related to your request/question.

.. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
:target: https://opensource.org/licenses/BSD-3-Clause
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.4.0dev
1.4.0

0 comments on commit 582287e

Please sign in to comment.