Skip to content

Releases: triton-inference-server/server

Release 1.7.0, corresponding to NGC container 19.10

30 Oct 00:03
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.7.0

  • A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.

  • TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.

  • The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.

  • Automatic configuration generation (--strict-model-config=false) now works correctly for TensorRT models with variable-sized inputs and/or outputs.

  • Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.

  • Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.

  • The example clients now include a simple Go example that uses the GRPC API.

Known Issues

  • In TensorRT 6.0.1, reformat-free I/O is not supported.

  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.6.0, corresponding to NGC container 19.09

27 Sep 21:50
546b5cb
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.6.0

  • Added TensorRT 6 support, which includes support for TensorRT dynamic
    shapes.

  • Shared memory support is added as an alpha feature in this release. This
    support allows input and output tensors to be communicated via shared
    memory instead of over the network. Currently only system (CPU) shared
    memory is supported.

  • Amazon S3 is now supported as a remote file system for model repositories.
    Use the s3:// prefix on model repository paths to reference S3 locations.

  • The inference server library API is available as a beta in this release.
    The library API allows you to link against libtrtserver.so so that you can
    include all the inference server functionality directly in your application.

  • GRPC endpoint performance improvement. The inference server’s GRPC endpoint
    now uses significantly less memory while delivering higher performance.

  • The ensemble scheduler is now more flexible in allowing batching and
    non-batching models to be composed together in an ensemble.

  • The ensemble scheduler will now keep tensors in GPU memory between models
    when possible. Doing so significantly increases performance of some ensembles
    by avoiding copies to and from system memory.

  • The performance client, perf_client, now supports models with variable-sized
    input tensors.

Known Issues

  • The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.

  • In TensorRT 6:

    • Reformat-free I/O is not supported.
    • Only models that have a single optimization profile are currently supported.
  • Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.5.0, corresponding to NGC container 19.08

03 Sep 23:53
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.5.0

  • Added a new execution mode allows the inference server to start without
    loading any models from the model repository. Model loading and unloading
    is then controlled by a new GRPC/HTTP model control API.

  • Added a new instance-group mode allows TensorFlow models that explicitly
    distribute inferencing across multiple GPUs to run in that manner in the
    inference server.

  • Improved input/output tensor reshape to allow variable-sized dimensions in
    tensors being reshaped.

  • Added a C++ wrapper around the custom backend C API to simplify the creation
    of custom backends. This wrapper is included in the custom backend SDK.

  • Improved the accuracy of the compute statistic reported for inference
    requests. Previously the compute statistic included some additional time
    beyond the actual compute time.

  • The performance client, perf_client, now reports more information for ensemble
    models, including statistics for all contained models and the entire ensemble.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.5.0_ubuntu1604.clients.tar.gz and v1.5.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.5.0_ubuntu1604.custombackend.tar.gz and v1.5.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.4.0, corresponding to NGC container 19.07

30 Jul 23:06
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.4.0

  • Added libtorch as a new backend. PyTorch models manually decorated or automatically traced to produce TorchScript can now be run directly by the inference server.

  • Build system converted from bazel to CMake. The new CMake-based build system is more transparent, portable and modular.

  • To simplify the creation of custom backends, a Custom Backend SDK and improved documentation is now available.

  • Improved AsyncRun API in C++ and Python client libraries.

  • perf_client can now use user-supplied input data (previously perf_client could only use random or zero input data).

  • perf_client now reports latency at multiple confidence percentiles (p50, p90, p95, p99) as well as a user-supplied percentile that is also used to stabilize latency results.

  • Improvements to automatic model configuration creation (--strict-model-config=false).

  • C++ and Python client libraries now allow additional HTTP headers to be specified when using the HTTP protocol.

Known Issues

  • Google Cloud Storage (GCS) support has been restored in this release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.4.0_ubuntu1604.clients.tar.gz and v1.4.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.4.0_ubuntu1604.custombackend.tar.gz and v1.4.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.3.0, corresponding to NGC container 19.06

28 Jun 16:36
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.3.0

  • The ONNX Runtime (github.com/Microsoft/onnxruntime) is now integrated into inference server. ONNX models can now be used directly in a model repository.

  • HTTP health port may be specified independently of inference and status HTTP port with --http-health-port flag.

  • Fixed bug in perf_client that caused high CPU usage that could lower the measured inference/sec in some cases.

Known Issues

  • Google Cloud Storage (GCS) support is not available in the 19.06 release. Support for GCS is available on the master branch and will be re-enabled in the 19.07 release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.3.0_ubuntu1604.clients.tar.gz and v1.3.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Release 1.2.0, corresponding to NGC container 19.05

24 May 16:20
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.2.0

  • Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.

  • Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.

  • The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.

  • The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.

  • The perf_client application now has a -z option to use zero-valued input tensors instead of random values.

  • Improved error reporting of incorrect input/output tensor names for TensorRT models.

  • Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0_ubuntu1604.clients.tar.gz and v1.2.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Release 1.1.0, corresponding to NGC container 19.04

24 Apr 00:07
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.1.0

  • Client libraries and examples now build with a separate Makefile (a Dockerfile is also included for convenience).

  • Input or output tensors with variable-size dimensions (indicated by -1 in the model configuration) can now represent tensors where the variable dimension has value 0 (zero).

  • Zero-sized input and output tensors are now supported for batching models. This enables the inference server to support models that require inputs and outputs that have shape [ batch-size ].

  • TensorFlow custom operations (C++) can now be built into the inference server. An example and documentation are included in this release.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.1.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

Release 1.0.0, corresponding to NGC container 19.03

18 Mar 20:11
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.0.0

  • 1.0.0 is the first GA, non-beta, release of TensorRT Inference Server. See the README for information on backwards-compatibility guarantees for this and future releases.

  • Added support for stateful models and backends that require multiple inference requests be routed to the same model instance/batch slot. The new sequence batcher provides scheduling and batching capabilities for this class of models.

  • Added GRPC streaming protocol support for inference requests.

  • The HTTP front-end is now asynchronous to enable lower-latency and higher-throughput handling of inference requests.

  • Enhanced perf_client to support stateful models and backends.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.0.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

Release 0.11.0 beta, corresponding to NGC container 19.02

28 Feb 02:32
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 0.11.0 Beta

  • Variable-size input and output tensor support. Models that support variable-size input tensors and produce variable-size output tensors are now supported in the model configuration by using a dimension size of -1 for those dimensions that can take on any size.

  • String datatype support. For TensorFlow models and custom backends, input and output tensors can contain strings.

  • Improved support for non-GPU systems. The inference server will run correctly on systems that do not contain GPUs and that do not have nvidia-docker or CUDA installed.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v0.11.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

Release 0.10.0 beta, corresponding to NGC container 19.01

28 Jan 21:03
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server (TRTIS) provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 0.10.0 Beta

  • Custom backend support. TRTIS allows individual models to be implemented with custom backends instead of by a deep-learning framework. With a custom backend a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic batching and other features provided by TRTIS.