TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It deals with the inference aspect of machine learning, taking models after training and managing their lifetimes, providing clients with versioned access via a high-performance, reference-counted lookup table. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
To note a few features:
- Can serve multiple models, or multiple versions of the same model simultaneously
- Exposes both gRPC as well as HTTP inference endpoints
- Allows deployment of new model versions without changing any client code
- Supports canarying new versions and A/B testing experimental models
- Adds minimal latency to inference time due to efficient, low-overhead implementation
- Features a scheduler that groups individual inference requests into batches for joint execution on GPU, with configurable latency controls
- Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations and even non-Tensorflow-based machine learning models
# Download the TensorFlow Serving Docker image and repo
docker pull tensorflow/serving
git clone https://github.com/tensorflow/serving
# Location of demo models
TESTDATA="$(pwd)/serving/tensorflow_serving/servables/tensorflow/testdata"
# Start TensorFlow Serving container and open the REST API port
docker run -t --rm -p 8501:8501 \
-v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
-e MODEL_NAME=half_plus_two \
tensorflow/serving &
# Query the model using the predict API
curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://localhost:8501/v1/models/half_plus_two:predict
# Returns => { "predictions": [2.5, 3.0, 4.5] }
Refer to the official Tensorflow documentations site for a complete tutorial to train and serve a Tensorflow Model.
The easiest and most straight-forward way of using TensorFlow Serving is with Docker images. We highly recommend this route unless you have specific needs that are not addressed by running in a container.
- Install Tensorflow Serving using Docker (Recommended)
- Install Tensorflow Serving without Docker (Not Recommended)
- Build Tensorflow Serving from Source with Docker
- Deploy Tensorflow Serving on Kubernetes
In order to serve a Tensorflow model, simply export a SavedModel from your Tensorflow program. SavedModel is a language-neutral, recoverable, hermetic serialization format that enables higher-level systems and tools to produce, consume, and transform TensorFlow models.
Please refer to Tensorflow documentation for detailed instructions on how to export SavedModels.
- Follow a tutorial on Serving Tensorflow models
- Read the REST API Guide or gRPC API definition
- Use SavedModel Warmup if initial inference requests are slow due to lazy initialization of graph
- Configure models, version and version policy via Serving Config
- If encountering issues regarding model signatures, please read the SignatureDef documentation
- If using a model with custom ops, learn how to serve models with custom ops
Tensorflow Serving's architecture is highly modular. You can use some parts individually (e.g. batch scheduling) and/or extend it to serve new use cases.
- Ensure you are familiar with building Tensorflow Serving
- Learn about Tensorflow Serving's architecture
- Explore the Tensorflow Serving C++ API reference
- Create a new type of Servable
- Create a custom Source of Servable versions
If you'd like to contribute to TensorFlow Serving, be sure to review the contribution guidelines.
Please refer to the official TensorFlow website for more information.