Infrastructure

Fractional GPUs: Run:ai’s New GPU Compute Sharing

by
Raz Rotenberg
Eli Ginot
January 4, 2024

As the demand for GPU resources continues to grow in AI, effective resource management becomes crucial to ensure optimal performance and efficient allocation of these valuable resources. Fractional GPUs play a crucial role in the ability to optimize GPU utilization and to allow users to right size their GPU workloads. In this blog we describe the recent extension of Run:ai’s Fractional GPU, from supporting configurable GPU memory per-workload to supporting configurable GPU compute as well.

The Problem with Shared GPU Workloads

When a workload is executed on a GPU, it utilizes both the memory and compute subsystems of the GPU. The memory subsystem handles data storage and retrieval, while the compute subsystem processes the actual computations required by the workload. When only a single workload is running on a GPU, it utilizes the entire capabilities of both the memory and compute subsystems but when multiple workloads share a single GPU, they inevitably compete for access to the resources available. In the absence of effective management, these workloads may not receive the memory and computing resources they require, resulting in diminished performance and user dissatisfaction.

In the context of shared GPU clusters, Run:ai's fractional GPU technology has enabled users to configure the memory subsystem per workload, ensuring efficient allocation of GPU memory resources between multiple workloads sharing a single GPU. Until now, configuring the compute subsystem for each workload was not feasible and GPU compute was uniformly distributed among concurrent workloads. For example, a workload allocated 50% of the GPU memory would receive access to the full GPU compute resources if running alone but would receive only 1/6 of the GPU compute if 5 additional workloads concurrently utilize the GPU. This limitation results in unpredictable throughput and/or latency and dependency on the number of concurrent workloads.


Run:ai's Solution: Prioritization and Customizable Compute Allocation

To address these challenges, we introduce new configurations that enable finer control over GPU compute sharing on a per-workload basis.

  • Priority-Based Compute Sharing (priority-based mode):
    Workloads can now be configured with specific priorities, ensuring that higher-priority workloads receive full access to the compute resources until they are done. This enhancement empowers users to guarantee consistent performance for critical tasks.
  • Configurable Ratio of Time Slices (fair mode):
    Users can define the ratios of time slices that each workload receives on the GPU. This feature provides fine-grained control over resource allocation, ensuring that workloads receive the appropriate share of compute resources based on their importance.
  • Configurable Upper Limits for Compute Consumption (strict mode):
    Users can configure upper limits on compute resource utilization for each workload. This prevents any single workload from monopolizing resources and negatively impacting the performance of others, ensuring fair resource sharing.

Integration of these capabilities into Kubernetes is done with Request and Limit Values. Workloads can be assigned a Request and Limit value for GPU compute, just like CPU and memory resources can receive such values. This integration simplifies resource allocation and management within Kubernetes clusters and follows a similar integration for the favor of GPU memory sharing.

Use Cases for Configurable GPU Compute Sharing

The versatility of Run:ai's GPU compute management solution opens the door to a wide range of use cases:

  • Model Inference Servers with Different Priorities:
    Consider the scenario of multiple inference servers with varying priorities. Some inference servers are tasked with handling real-time requests, requiring immediate responses, while others handle background tasks or offline requests with non-strict Service Level Agreement (SLA). Run:ai's solution enables pausing low-priority tasks and ensuring critical real-time servers receive the resources they need exactly when they need them, avoiding performance bottlenecks and ensuring smooth operation.
  • Model Inference Servers with Different SLAs:
    When some inference servers require rapid response times while others can tolerate longer response times, different time-sharing ratios can be configured for the different servers, ensuring GPU compute is distributed according to needs and to SLA requirements.
  • Different Users Training Models on a Shared GPU Cluster:
    In research and development environments, multiple users might share a GPU cluster for training AI models. By allowing users to set their own priorities and resource allocations, Run:ai's solution ensures fair access to resources and consistent model training performance.

Conclusions

Run:ai extends Fractional GPU capabilities by offering granular control over GPU compute resource allocation, empowering users to maximize the efficiency of their GPU clusters and meeting the diverse needs of different workloads. Whether training models or running inference servers, Run:ai's enables the right sizing of GPU memory and compute resources, ensuring optimal resource utilization and reliable workload performance.