Infrastructure

Why Run:AI - Virtualization for AI

by
Run:ai Team
–
November 8, 2021

Why Should AI Infrastructure be Virtualized?

AI workloads are often run on bare metal and are allocated statically to data scientists. Static allocation of resources leads to limitations on experiment size and speed, low GPU utilization, and lack of IT controls.

Today, enterprises expect the simplicity and resource availability of the virtualized data center. But as enterprises adopt Deep Learning (DL) initiatives they find that managing infrastructure at scale, simplified maintenance, and visibility, all hallmarks of virtualized infrastructure, are not necessarily available in the AI stack. Virtualization of AI needs to support the unique nature of data science workloads, and still be easy for IT to manage and maintain. Learn more about pooling resources with GPU virtualization.

From Static Allocation of Resources to Dynamic

At each stage of the deep learning process, data scientists have specific needs for compute resources. Build stages require CPU or GPU in interactive sessions. Training is highly compute intensive, and requires considerable GPU compute power. Performance and speed are critical, but training is erratic – sometimes concurrent workloads are running other times no workloads are running as data scientists optimize the models. The Inference stage typically requires lower GPU utilization. Dynamic resource allocations, that take into account the process of deep learning, are critical for AI development. Learn more about GPU scheduling in multi-GPU environments.

Run:AI Kubernetes plugin simplifies workflows

Data scientists use containers to support their much-needed agility and portability. Kubernetes is currently the de-facto tool for orchestrating containerized applications in enterprise IT environments. For this reason, Run:AI was built as a Kubernetes plugin, enhancing its scheduling capabilities to support the existing workflows of data scientists. See our guides to Kubernetes architecture for AI workloads.

Deep Learning Requires a Different Paradigm

Traditional computing uses virtualization to share a single physical resource between multiple workloads. Deep learning (DL), however, requires a paradigm shift pertaining to virtualization, one that incorporates elements of distributed computing. For AI, virtualization should enable acceleration of a single workload to take as many resources as the workload needs. Run:AI’s “greedy” approach to virtualization better suits the needs of DL workloads, which are highly compute-intensive, often fully utilizing hardware accelerator resources in parallel for days or even weeks. Learn best practices for machine learning infrastructure.