Inference & Training

Gradient Accumulation

by
Fara Hain
–
January 23, 2020

Today, Run:AI published our own gradient accumulation mechanism for Keras – it’s a generic implementation, that can wrap any Keras optimizer (both a built-in one or a custom one) – automatically enabling gradient accumulation by adding a single line to your code (no code modifications required).

We published three blog posts that help explain the concept and how to use the code:

We hope the tool will help both veteran data science teams and beginners train on large batch sizes even when GPU memory is limited, improving both performance and accuracy of models.

When building a deep learning model, one of the critical hyperparameters that data scientists consider is how many training examples (e.g. images) the neural network model should process in each training iteration — the deep learning batch size. However, sometimes the batch size is limited by the available memory of the GPUs which are running the model. Deep learning models themselves are becoming bigger and more complex, taking up more GPU memory and further reducing the maximum possible batch size and the achievable accuracy.

One solution to this problem is gradient accumulation. The idea is to split up the batch into smaller mini-batches which are run sequentially, while accumulating their results. The accumulated results are used to update the model parameters only at the end of the last mini-batch. Gradient accumulation is a particularly good option where there’s only access to a single GPU, because it can be run sequentially on the single resource.

Although the concept is simple, the mathematics and code required to implement gradient accumulation can be complicated.

Run:AI developer lead Raz Haleva and CTO Dr. Ronen Dar, as well as Micha Anholt worked on the tool and posts.

We hope you’ll try it out and let us know what you think!