Infrastructure

Why Run:AI - Pool GPU Resources to Simplify Workflows

by
Run:ai Team
–
January 13, 2022

Pool GPU Resources to simplify workflows

Run:AI pools heterogeneous resources so they can be used within two logical environments, build and train, to natively support data scientists’ different compute characteristics and increase utilization. The GPU virtualization pool exists inside a Kubernetes cluster. The two logical environments interact with the Run:AI scheduler for build and training workloads.

  • Build environment – dedicated for building models interactively, typically using jupyter notebooks or Pycharm, or simply by SSH-ing into a container. Performance in build environments is typically less critical so build workloads can be run on workstations or low-end servers.
  • Training environment – dedicated for long training workloads. As performance is important in training, these workloads should run on high-end GPU servers. Containers for training can be supplemented with a checkpointing mechanism that allows automatic preemption and resume without losing the state of the training. Run:AI creates a virtual pool of GPUs which can easily be shared among all users. With Run:AI, users can actually go over their guaranteed quota and use more GPUs than they are assigned.

By pooling the resources and managing them using the Run:AI scheduler, administrators gain control. They can easily onboard new users, maintain and add new hardware to the pool, and gain visibility, including a holistic view of GPU usage and utilization.

In addition, data scientists can automatically provision resources without depending on IT admins.