TensorFlow Serving

The Basics and a Quick Tutorial

What Is TensorFlow Serving?

TensorFlow is an open-source machine learning platform developed by Google. It provides a framework for developing and training machine learning models, as well as tools for deploying those models in a production environment.

TensorFlow Serving is a library for serving machine learning models developed with TensorFlow. It allows users to deploy their TensorFlow models in a production environment, where they can be served via a HTTP REST API or gRPC interface. TensorFlow Serving makes it easy to deploy and manage machine learning models, as it includes features such as model versioning, automatic batching of requests, and support for canary deployments.

With TensorFlow Serving, users can serve their TensorFlow models in a production environment without having to worry about the underlying infrastructure or the serving details.

This is part of an extensive series of guides about machine learning.

The Need for Tensorflow Serving

TensorFlow Serving is useful because it allows users to separate the code for training machine learning models from the code for serving those models in a production environment. This separation of concerns can make it easier to develop and maintain machine learning models, as it allows users to focus on the training process without having to worry about the details of how the model will be served in production.

TensorFlow Serving also provides proper model versioning, which is important because it allows users to deploy multiple versions of a model and easily switch between them. This is useful when working with machine learning models, as it allows users to test new versions of a model without affecting the production version. It also makes it easier to roll back to a previous version of a model if needed.

In addition, TensorFlow Serving provides efficient model inference, which is the process of using a trained machine learning model to make predictions. TensorFlow Serving is optimized for efficient model inference, allowing users to serve their models at low latencies and high throughput. This is important when serving machine learning models in a production environment, as it allows the models to handle a large number of requests in real-time.

How Does TensorFlow Serving Work?

TensorFlow Serving works by providing a library for serving TensorFlow models in a production environment. When you want to serve a TensorFlow model with TensorFlow Serving, you first export the model from TensorFlow in a format that TensorFlow Serving can understand, such as a SavedModel or a frozen graph.

Once the model is exported, you can start the TensorFlow Serving server and point it to the directory containing the exported model. The TensorFlow Serving server will then load the model and start serving it via a HTTP REST API or gRPC interface.

Users can then send requests to the TensorFlow Serving server, containing input data for the model. The TensorFlow Serving server will then use the loaded model to process the input data and generate a prediction. The prediction is then returned to the user via the HTTP REST API or gRPC interface.

TensorFlow Serving also includes features such as model versioning, automatic batching of requests, and support for canary deployments, which make it easier to deploy and manage machine learning models in a production environment.

The diagram below shows the TensorFlow Serving architecture and its various abstractions.

Image Source: TensorFlow

Tutorial: Serving a Docker TensorFlow Model

Here is a tutorial on how to serve a TensorFlow model in Docker using TensorFlow Serving components. You will need to have Docker installed. It is also useful to have the Python file mnist_saved_model.py, a script that demonstrates how to train and export a simple machine learning model using TensorFlow and the MNIST dataset. The script trains a convolutional neural network (CNN) to classify handwritten digits from the MNIST dataset, and then exports the trained model in a format that can be served by TensorFlow Serving.

Note: You can download mnist_saved_model.py here.

Step 1: Export the TensorFlow Model

First, you will need to export the TensorFlow model that you want to serve. You can do this using the tf.saved_model.save method, which saves the model in a format that can be served by TensorFlow Serving.

Here is a code snippet demonstrating how to save a TensorFlow model to disk using the tf.saved_model.save method:

import tensorflow as tf

# Define the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, input_shape=(8,), activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Save the model to disk
tf.saved_model.save(model, 'saved_model/')

Step 2: Download the TensorFlow Serving Image

TensorFlow Serving is a pre-build image that provides an API wrapper to allow access to your saved model using a REST API.

You can download the pre-built image using the following command:

docker pull tensorflow/serving

Step 3: Run the TensforFlow Serving Container

Once you have downloaded the Tensorflow serving image, you can run it using the following command. Make sure to replace data in brackets (<>) with appropriate values.

docker run -p 8501:8501 --name <YOUR_MODEL_NAME>_serving \
--mount type=bind,source=/{path to your directory with saved file}/TF-Docker-Serving/<YOUR_MODEL_NAME>/,
target=/models/<YOUR_MODEL> \
-e MODEL_NAME=<YOUR_MODEL_NAME> -t tensorflow/serving

Step 4: Send Prediction Requests

You can send prediction requests to the TensorFlow Serving server using a HTTP REST API or gRPC interface.

You will need to provide input data for the model in the request, and the TensorFlow Serving server will use the loaded model to generate a prediction, which will be returned to you in the response. Make sure to replace the data in brackets (<>) with appropriate values.

curl -d '{"instances": [<TEST_DATA>]}' \
    -X POST http://localhost:8501/v1/models/<YOUR_MODEL_NAME>:predict

Deploying TensorFlow Serving instances with Run:ai

Run:ai Atlas allow you to easily deploy your TensorFlow Serving instances within a few clicks.

  • Auto-scaling ensures SLAs are met and offer scale-to-zero for cold models
  • Extensive performance metrics give better insights into model performance and scaling

Learn more about Run:ai Atlas and how it simplifies deployment for AI inference workloads