Understanding Slurm GPU Management

GPU Scheduling with Run:AI

What is Slurm and How Does it Work with GPUs?

Slurm is an open-source workload and resource manager. To extend the functionality of Slurm, you can use plugins that offer diverse job types, workflows, and policies. Plugins can add a wide range of features, including resource limit management and accounting, as well as support for advanced scheduling algorithms.

Slurm is used for workload management on six of the ten most powerful computing systems in the world, including Tianhe-2 with 3,120,000 computing cores, and Piz Daint, a system that utilizes over 5,000 NVIDIA GPGPUs.

Slurm supports the use of GPUs via the concept of Generic Resources (GRES)—these are computing resources associated with a Slurm node, which can be used to perform jobs. Slurm provides GRE plugins for many types of GPUs.

Here are several notable features of Slurm:

  • Scales to tens of thousands of GPGPUs and millions of cores.
  • Offers military-grade security and no single point of failure.
  • Supports a heterogeneous configuration that allows users to leverage GPGPUs.
  • Provides topology-aware job scheduling, which can enable maximum system utilization.
  • Enables advanced reservations, backfill, suspend/resume, fair-share, and preemptive scheduling for critical jobs.

Related content: Read our guide to Slurm for machine learning

In this article:

What are Slurm Generic Resources (GRES)?

Generic Resources (GRES) are computing resources available on a specific Slurm node, which can be allocated to jobs. The most common use of GRES is to make GPUs on Slurm nodes available for use by Slurm jobs.

GREs are managed via two primary Slurm components:

  • slurmctld—the centralized manager that monitors resources and jobs. It holds a data structure called gres_list, which provides information about GRES types.
  • slurmd daemon—deployed on every node, responsible for receiving jobs and executing work. It holds a data structure called gres, which specifies a GRES available on a node, which can be used by a Slurm job or step.
  • Plugins—a GRES can optionally define a plugin to support specific device features.

Data Structures

Every node that needs to expose a Generic Resource for Slurm jobs, must have a string variable called gres. The value of this variable typically looks like this, indicating the number of GPUs on the machine and the number of network ports: gpu:2,nic:1

The slurmctld manager maintains a gres_list that has the following characteristics:

  • Each list element provides information about a GRES type (e.g. gpu and nic).
  • List elements will have a different structure for nodes, Slurm jobs, and steps, and so there are different functions to access GRES types for each of these.
  • If a node, job or step does not have any associated GRES, the item in gres_list will be NULL.

Mode of Operation

Here is how Slurm nodes declare their available resources, and those resources are utilized by jobs:

  1. The slurmd daemon on a Slurm node reads its configuration.
  2. slurmd calls the function node_config_load() for each GRES plugin, and verifies that the required devices actually exist on the node. If there is no plugin, it proceeds based on the configuration file and assumes the devices exist and are working.
  3. slurmd reports the node’s GRES information to the slurmctld daemon when it registers the node with the cluster.
  4. The slurmctld daemon keeps a record of GRES information for all registered nodes, including the number of available resources (for example, the number of GPUs), and the location of each node in a job allocation sequence.
  5. When a job or step starts, it specifies GRES allocated to the job. The functions job_set_env() and step_set_env() are used to set environment variables that direct the job running on Slurm to the relevant GRES. This method is compatible with CUDA, as long as CUDA uses its default environment variables.
  6. slurmctld then allocates jobs to nodes based on their available GRES and their sequence in the job or step.

Managing GPUs in Slurm

The main Slurm cluster configuration file, slurm.conf, must explicitly specify which GRES are available in the cluster. Here is an example of a slurm.conf file, which configures four GPUs supporting Multi-Process Service (MPS), with 4 GB of network bandwidth.

GresTypes=gpu,mps,bandwidth
NodeName=tux[0-7]
Gres=gpu:tesla:2,gpu:kepler:2,mps:400,bandwidth:lustre:no_consume:4G

In addition, Slurm nodes that need to expose GRES to jobs should have a gres.conf file. This file describes which types of Generic Resources are available on the node, how many of them, and which files and processing cores should be used by each GRES.

Running Jobs with GPUs

To use a GPU in a Slurm job, you need to explicitly specify this when running the job using the –gres or –gpus flag. The following flags are available:

  • –gres specifies the number of generic resources required per node
  • –gpus specifies the number of GPUs required for an entire job
  • –gpus-per-node same as –gres, but specific to GPUs
  • –gpus-per-socket specifies how many GPUs are required per job socket (this requires that the job specifies a task socket)
  • –gpus-per-task specifies how many GPUs are required for each task (this requires that the job specifies a number of tasks)

Working with CUDA Environment Variables

Slurm uses environment variables to interact with CUDA on nodes that run GPU resources. There are two main environment variables it uses:

  • CUDA_VISIBLE_DEVICES—allows Slurm to determine the number of GPUs available on a node. In CUDA 3.1 and higher, this can be used to run multiple jobs or steps on a node, ensuring unique resources are allocated to each job or step. Note that this variable may be different for a job (which is constrained to a specific cgroup in the Linux kernel) and for a Prolog or Epilogue program (which runs outside a cgroup).
  • CUDA_DEVICE_ORDER—Slurm tries to get data about GPUs on a device using the NVIDIA Management Library (NVML). Because NVML recognizes GPUs by their PCI bus ID, for this to work you need to set the CUDA_DEVICE_ORDER environment variable to PCI_BUS_ID.

Because GPU detection is based on environment variables and is not foolproof, make sure to check nodes after bootup to ensure that GPU devices are assigned to the relevant device files.

MPS Management

CUDA provides Multi-Process Service (MPS), a system that allows GPUs to be shared by multiple jobs. Each job receives a fraction of the GPU’s computing resources.

If you don’t use MPS, the GRES elements defined in the slurm.conf file will be distributed equally across all GPUs on the node.

To specify a fraction of GPU resources available to a specific GRES, set three parameters in the MPS configuration: Name, File, and Count. Note that job requests for MPS-configured resources may only use one GPU per node.

GPU Scheduling with Run:AI

Run:AI is a Slurm alternative, based on Kubernetes, which automates resource management and orchestration for AI workloads on distributed GPU infrastructure. With Run:AI, you can automatically run as many compute intensive workloads as needed on GPU in your AI and HPC infrastructure.

Here are some of the capabilities you gain when using Run:AI:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of resources, to avoid bottlenecks and optimize billing in cloud environments.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI accelerates deep learning and other compute intensive workloads, by helping teams optimize expensive compute resources.

Learn more about the Run:AI Kubernetes Scheduler.