Inference & Training

Breaking Static GPU Allocations with Guaranteed Quotas

by
Ronen Dar
March 15, 2020

In a previous article, we discussed the problem of machine scheduling and complications that arise from inefficient GPU utilization and allocation. In this post, we will introduce the concept of breaking static GPU allocations with Guaranteed Quotas.

Teams we work with buy expensive GPUs to run complex computational workflows and they don’t really have a way to share them efficiently between users.

Early on in the development of Run:AI, my co-founder and I saw that with some principles taken from virtualization and containers, and others taken from the world of high performance computing, we could devise a solution where abstracting workloads from the underlying compute could allow for the pooling of resources, moving and shifting of workloads, and managing AI infrastructure at scale. Data science teams would be more productive and IT teams would gain control and visibility into infrastructure utilization.

But deep learning workloads are not the same as traditional workloads running on enterprise VMs. To build this kind of AI infrastructure, we’d need to support the unique computational characteristics of data science workloads. How are these different? Data scientists are mainly engaged in one of two data science workflows for their experimentation.

  1. Build sessions where data scientists interactively develop and debug their models. These sessions require instantly and always-available GPUs.
  2. Training sessions for tuning model parameters. These sessions are typically long and require considerable GPU compute power, with performance and training speed being critical.
GPU Utilization

Typically, data scientists build a model while interacting with one or two build sessions, making static allocations a good fit for such workflows. Training sessions, however, are used much more erratically – sometimes data scientists run several training workloads in parallel (for example, during the process of hyper parameter optimization) while at other times they do not run any training jobs. Static GPU allocations don’t fit these fluctuating demands as data scientists either become the exclusive “owners” of too many expensive GPUs or are in need of more compute power.

Limiting data scientists in their training phase is essentially limiting the full scope of a deep learning project. Data science needs the freedom to experiment without being limited by static GPU allocations. Simple resource sharing, like using another data scientist’s GPUs while they are idle, is not possible with static allocations.

We envisaged a scheduler that implements a concept which we refer to as “Guaranteed Quotas” to solve this challenge. Guaranteed quotas essentially free data scientists from the limitations of static allocations and allow users to go over quota if idle GPUs are available.

The advantages?

  • Data scientists can run more experiments and,
  • They can use multi-GPU training more frequently and with more GPUs.

Ultimately this improves their productivity and the quality and speed of their data science initiatives, while increasing GPU utilization of the overall cluster.

How does a guaranteed quota system work?

New jobs are submitted to the system with a reference project. The Run:AI scheduler is based on Kubernetes architecture and is provided with a matrix of parameters to consider:

  • One of them is project priority. The organization or the IT admin can define projects and define priority for each project.
  • The second is quota, defining either a static or guaranteed allocation of GPUs for each project.

Projects translate into our scheduler as queues and depending on the organization’s preference, can be modeled as individuals, as teams of people, or as actual business activities. Projects with a guaranteed quota of GPUs, as opposed to projects with a static allocation, can use more GPUs than their quota allows. So ultimately the system allocates available resources to a job submitted to a queue, even if the queue is over quota. In cases where a job is submitted to an under-quota queue and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness into account.

Guaranteed quotas essentially break the boundaries of fixed allocations and make data scientists more productive, freeing them from limitations of the number of concurrent experiments they can run, or the number of GPUs they can use for multi-GPU training sessions. This greatly increases utilization of the overall GPU cluster.