Keras Multi GPU: A Practical Guide

Keras is a deep learning API you can use to perform fast distributed training with multi GPU. Distributed training with GPUs enable you to perform training tasks in parallel, thus distributing your model training tasks over multiple resources. You can do that via model parallelism or via data parallelism. This article explains how Keras multi GPU works and examines tips for managing the limitations of multi GPU training with Keras.

If you are working with other deep learning frameworks, check out our articles about PyTorch multi GPU and TensorFlow multiple GPU.

In this article, you will learn:

What is Keras
What is distributed training with GPUs
Keras multi GPU and distributed training
Tutorial: How to Setup Keras on Multiple GPUs
Tips for managing the limitations of multi GPU training with Keras

What Is Keras?

Keras is a deep learning API that is based on the TensorFlow platform. It was designed to allow fast experimentation and easy model building with multiple graphical processing units (GPUs). Keras is broadly supported and can be used with TensorFlow, CNTK, Theano, MXNet and PlaidML.

What is Distributed Training with GPUs?

Keras enables you to distribute your model training tasks over multiple resources, performing training tasks in parallel. Distributed training is an essential part of deep learning. It enables you to leverage multiple CPUs or GPUs and drastically reduces the amount of time needed to train models.

When using distributed training, there are two implementation methods you can choose from—model parallelism and data parallelism. These implementations can be used individually or in combination, depending on your model requirements.

Model parallelism

Model parallelism segments your model into parts that can then be run in parallel. Parts are trained individually and the results of each part are rejoined with the whole.

This method enables you to run each segment on a different resource using the same data. This limits the amount of communication that is needed between workers to only that required for synchronization of shared parameters. You can also use this method with multiple GPUs in a single server.

Data parallelism

Data parallelism segments your training data into parts that can be run in parallel. Using copies of your model, you run each subset on a different resource. This is the most commonly used type of distributed training.

This method requires that you synchronize model parameters during subset training. If you do not, your prediction errors will not align between subsets. Because of this, data parallelism implementations require communications between workers so changes can be synced.

Keras Multi GPU and Distributed Training

First we should note that distributed training, as it is called in the Keras framework, may refer to two types of scalability:

Single worker—distributed training across multiple GPUs in the same physical server
Multi worker—distributed training across multiple GPUs on multiple physical servers

The discussion below refers to distributed training with single worker scalability – distributing workloads across multiple servers is more complex and is beyond the scope of this article. Below you can learn about Run:AI, which can help you automatically distribute workloads on any number of physical machines.

When using multiple GPUs in Keras, there are a few aspects that are helpful to know to get you started. The following section covers the basics of Keras multi-GPU training and provides some tips you can apply to improve your performance.

How it works

Keras offers several workload distribution strategies, including tf.distribute.Strategy, tf.distribute.MirroredStrategy, and tf.distribute.experimental.TPUStrategy.

Below we describe how to work with MirroredStrategy, which lets you perform synchronous distributed training on multiple GPUs on a single machine.

When using multi-GPU training, you run your model through the same series of steps for each segment. Below is an overview of the steps that are performed when using data parallelism with MirroredStrategy:

A global batch is segmented into local batches with data being evenly distributed.
A number of model replicas, up to the number of batches, is created and used to process an assigned batch. This involves a forward pass and a backward pass followed by an output of the gradient of the weights associated with the loss of the model on the batch.
The weights of each batch are then merged across the replicas to ensure that replicas stay in sync.

How to use it

Performing this sort of multi-GPU training with Keras requires the tf.distribute.MirroredStrategy API. Using this API, you must:

Instantiate a MirroredStrategy. During this process you have the option of configuring specific devices or using the default, which uses all available GPUs.
With your strategy object, open a scope and create any Keras objects and variables needed. Generally, this requires you to create and compile your model within the distribution scope.
Train your model using fit()(per normal). It is recommended to use tf.data.Dataset objects to load your data.

Using Keras callbacks to ensure fault tolerance

Fault tolerance is very important in distributed training since there are more operations that can experience errors. Having a strategy to recover in the event of failures can help ensure the accuracy of your model and prevent time spent redoing computations.

With Keras, the easiest way to build in fault tolerance is with a ModelCheckpoint callback to fit(). This method allows you to save your model at regular intervals; you can then use these savepoints to restore your model if something goes wrong.

tf.data Performance Tips

As previously mentioned, loading your data with a tf.data pipeline is the recommended method. When using this method, there are also a few tips that can help you increase your efficiency.

Calling dataset.cache()

Calling .cache() on a dataset enables you to cache data after your first iteration. Each subsequent iteration can then use this cache, eliminating loading time. Caching can be a valuable time saver when your data remains the same between iterations. It is also useful if you are reading data from a remote filesystem or your workflow is IO-bound.

Calling dataset.prefetch(buffer_size)

Calling .prefetch(buffer_size) enables you to prefetch your GPU memory in preparation for your next iteration. This allows you to use your pipeline asynchronously, processing new samples while your model is trained on the current set. This prefetching enables you to reduce the amount of time your resources are unused and to immediately move to the next iteration as soon as one finishes.

Tutorial: How to Setup Keras on Multiple GPUs

Setting up Keras to run on multiple GPUs can significantly enhance the speed and efficiency of deep learning models. Below is a step-by-step guide on how to achieve this:

Step 1: Install NVIDIA CUDA and cuDNN

First, ensure that NVIDIA CUDA and cuDNN are installed on your system. These are necessary for GPU-accelerated computing.

You can following command:

nvidia-smi

‍

Step 2: Check for Available GPUs

Verify that multiple GPUs are available on your system. Use the following code snippet to check for GPU availability:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Step 3: Set the CUDA_VISIBLE_DEVICES Environment Variable

Configure Keras to use multiple GPUs by setting the CUDA_VISIBLE_DEVICES environment variable. This variable specifies which GPUs should be visible to the program.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # Replace with the IDs of your available GPUs

Step 4: Create and Parallelize Your Keras Model

Define your Keras model as usual. Then, use the MirrorStrategy to parallelize your model across the specified GPUs.

from keras.models import Sequential
from keras.layers import Dense
from keras.datasets import mnist
from keras.utils import to_categorical
import tensorflow as tf

# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()

num_gpus = 4  # Replace with the number of available GPUs

# Load a sample dataset
(x_train, y_train), (x_val, y_val) = mnist.load_data()

# Preprocess the data
x_train = x_train.reshape(-1, 784).astype('float32') / 255
x_val = x_val.reshape(-1, 784).astype('float32') / 255

# Convert class vectors to binary class matrices (one-hot encoding)
y_train = to_categorical(y_train, 10)
y_val = to_categorical(y_val, 10)

with strategy.scope():
    # Define your Keras model
    model = Sequential()
    model.add(Dense(64, input_dim=784, activation='relu'))
    model.add(Dense(10, activation='softmax'))

    # Compile the model
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

# Train the model on your data
model.fit(x_train, y_train,
          epochs=20,
          batch_size=128 * num_gpus,
          validation_data=(x_val, y_val))

Note: Using tf.distribute.MirroredStrategy, the MirroredStrategy automatically detects and uses all available GPUs.

Store above code in a file called p3.py. You can execute it using the following command:

python3 p3.py

‍

While above code is executing, you can run nvidia-smi command to check how GPU and GPU memory are being utilized as shown below:

nvidia-smi

‍

Data Parallelism

Data parallelism involves splitting the training data into segments that are processed in parallel. Each GPU processes a portion of the data, and the gradients are then aggregated to update the model weights.

import tensorflow as tf

# Define the strategy for data parallelism
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2"])

# Set the batch size and other parameters
batch_size = 64
epochs = 10

# Load your training data

# Define your model
def build_model():
    #Model building steps	
    return model

# Place the model on GPU 3
with tf.device('/gpu:3'):
    model = build_model()

# Fit the model
model.fit( ... )

Model Parallelism

Model parallelism splits the model into segments, each processed by a different GPU. This is useful for models with many layers or high memory requirements.

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

with strategy.scope():
    model = build_model()  # Define your model here
    model.compile(loss='categorical_crossentropy',
                  optimizer='sgd')

    model.fit(train_dataset, epochs=num_epochs

Using Keras Callbacks for Fault Tolerance

Fault tolerance is crucial in distributed training. Using Keras callbacks like ModelCheckpoint allows you to save your model at regular intervals and restore it in case of failure.

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.callbacks import ModelCheckpoint

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Define MirroredStrategy
strategy = tf.distribute.MirroredStrategy()

print(f"Number of devices: {strategy.num_replicas_in_sync}")

# Generate some dummy data
x_train = np.random.random((10000, 20))
y_train = np.random.randint(2, size=(10000, 1))
x_val = np.random.random((2000, 20))
y_val = np.random.randint(2, size=(2000, 1))

# Define a simple model within the strategy scope
with strategy.scope():
    model = Sequential([
        Input(shape=(20,)),  # Define the input shape here
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Define the ModelCheckpoint callback
checkpoint = ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.2f}.keras',
                             save_best_only=True,
                             verbose=1)

# Train the model
model.fit(x_train, y_train,
          epochs=20,
          batch_size=128 * strategy.num_replicas_in_sync,
          validation_data=(x_val, y_val),
          callbacks=[checkpoint])

Let’s store it in a file called p6.py and we can execute it using the following command:

python3 p6.py

‍

Tips for Managing the Limitations of Multi GPU Training with Keras

When using Keras, there are advantages and limitations to your ability to perform multi-GPU training. Below are a few limitations to be aware of and how to handle these limitations.

Keras Multi GPU training is not automatic

Using single GPU configurations with Keras and Tensorflow is straightforward. Provided you are using NVIDIA and you have CUDA libraries installed, use of GPUs is automatic. However, this isn’t the case for scenarios with multiple GPUs.

To use multiple GPUs with Keras, you can use the multi_gpu_model method. This method enables you to copy your model across GPUs. When used, it can automatically split your input across GPUs for aggregation later. However, keep in mind that this method does not scale linearly with the number of GPUs due to the synchronization required.

Saving your parallel models

Once your training is finished, you may want to persist your training weights. Unfortunately, you can’t just use the save() method because Keras does not support saving parallel models.

To get around this, you can either call save() on the original model reference or you can serialize your model. The former automatically updates your weights, while the latter requires some manual clean up of synchronization connections.

GPU data bottlenecks

Often, preprocessing calculations are the most expensive aspect of training deep learning models. These calculations require data to be preprocessed in your CPUs and then fed to the GPUs. This goes smoothly as long as preprocessing is relatively simple and data isn’t bottlenecked in the CPU. If it is, your GPUs are left sitting idle while waiting for data to process.

While Keras can perform your preprocessing calculations in parallel, this is bottlenecked by Python’s Global Interpreter Lock (GIL), which prevents true multithreading. The easiest way to manage this is to simplify your preprocessing as much as possible.

You can typically do this using standard generators. However, if you need to use custom generators, try to offset some of the work with other libraries, like Numpy. These libraries can release the GIL and enable you to access a greater degree of parallelism.

Keras Multi GPU With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in Keras and other deep learning frameworks.

Here are some of the capabilities you gain when using Run:AI:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.