Infrastructure

Challenges in GPU Machine Scheduling for AI and ML Workloads

by
Ronen Dar
December 19, 2019

Like a huge safari animal swatting all day at a pesky fly, sophisticated projects are often hindered by simple things. With the rapid growth of deep learning in the enterprise, we find one primary issue comes up again and again with IT leaders, MLOps teams, and engineers running deep learning projects – GPU Machine Scheduling – the fact that enterprises can’t effectively manage their GPUs. They buy expensive GPUs to run complex computational workloads and they don’t really have a way to share them efficiently between users.

Most of the companies that we at Run:AI talk to start small with their on-premises investments in GPUs. They have a few data scientists and they buy them a few GPUs. The allocation is often done physically, such that each Data Scientist receives a workstation with one or two GPUs that belong to him /her exclusively. Typically, after a few months, the IT department realizes that the solution is not scalable, as the data scientists quickly need more compute power.

Some companies try to solve the GPU sharing problem by buying GPU servers and allocating them to a single team of Data Scientists. The team members try to share the resources between themselves, but typically end up fighting over GPUs and “managing” allocations manually, with tools like Excel sheets or Slack channels…

So, it’s not really an efficient pool in any way and certainly not one that is scalable.

One of the reason for the difficulty to share GPU resources lies in the way data scientists consume computing power – they may have weeks that they run many short experiments in parallel, weeks they run a small number of long experiments, and then weeks they don’t run any. It is therefore inefficient if data scientists are the exclusive “owner” of GPUs – they either own too many resources or are short in compute power.

We at Run:AI started look for a better way to optimize GPU sharing and found that there are two related alternatives for job orchestration, Kubernetes and High-Performance Computing (HPC) schedulers, both do not solve the problem. While Kubernetes fits into the ecosystem of IT organizations, its default scheduler wasn’t built for batch workloads, for Data Science experiments. It was primarily built for services. For example, there is no efficient queuing mechanism that supports guaranteed quotas with multiple queues and priorities, and advanced automatic job preemption and resume. Add to that the lack of support in a simple and efficient way to launch and manage multi-node distributed training, as well as a simple way to consider advanced placement options and affinity configurations like CPU-GPU affinity, GPU-GPU affinity and more (all of this will be covered on a different blog post).

The second alternative to GPU machine scheduling is using tools from the HPC world, like SLURM. But that also has challenges and inefficiencies. For start, these tools were built for HPC experts and are highly complex. Using them is cumbersome and involves managing APIs with dozens of parameters and flags. Their configuration and on-going management are also highly complicated, typically involving reading lengthy documentations and experimenting with multiple parameters. There’s an in-depth comparison of Kubernetes vs Slurm here.

At Run:AI we worked on the problem. We built an AI-dedicated batch scheduler for Kubernetes that enables crucial features for the management of AI/ML workloads like advanced queuing and quotas, managing priorities and policies, automatic pause/resume, multi-node training, and more. In subsequent articles we’ll share more details about some of these features.

Ultimately, we are tackling additional challenges of GPU utilization and optimization, many of those problems are much more complex – but an elegant solution for scheduling, using a system of guaranteed quotas, seemed like a good place to start.

Coming soon…

Machine Scheduling for Data Science part 2 – How can guaranteed quotas improve the management of Data Science workloads?