Hot Topic

A look back on our Customer Journeys

by
Ekin Karabulut
–
February 26, 2024

The inefficiency patterns in enterprises and how Run:ai helps them get more from their AI infrastructure investments and increase productivity

Like data scientists, we love finding patterns in data - so let’s check out the common patterns of our customers' stories together! As enterprises start their journey of integrating artificial intelligence (AI) into their business and operations, common challenge patterns occur, regardless of the vertical that they are in.

In this blog, we have compiled some of our customers’ stories, analyzed pitfalls that they experienced in their AI infrastructure and how Run:ai came to their rescue, changing their compute visibility and control, forever.

Our platform plays a vital role in addressing these common patterns, offering innovative novel scheduling, GPU over-provisioning, and more capabilities to optimize resource utilization, enhance visibility, and streamline workflows.

Common Pitfalls for Enterprises

Low GPU Utilization:

Across all customer stories, a striking commonality was the underutilization of GPU resources, despite demand from researchers, leading to inefficiencies and wasted capacity. Although the utilization of existing hardware was extremely low, visibility issues and bottlenecks made it seem like additional hardware was necessary.

Resource Management Challenges:

The complexity of managing resources, especially in environments with a large number of GPUs and varied workloads, was a recurring theme. Distributed experiments requiring a large number of GPUs were sometimes unable to begin, because smaller jobs allocating only a few GPUs without utilizing them fully were blocking them out of their resource requirements. GPU resources were statically allocated, creating times with bottlenecks and other times with inaccessible, but available infrastructure.

Ineffective Prioritization:

Allocating resources based on business priorities and project significance was a struggle for some organizations, leading to suboptimal utilization. The companies had a rigorous ROI estimation process taking into consideration investment, risks, and returns and prioritizing projects accordingly -- but the resource scheduling of the AI/ML project teams was not done according to these priorities. Instead, researchers were using GPUs inefficiently, consuming whole GPUs instead of fractions, “hogging” GPUs and preventing other projects -- sometimes more important ones -- from using them.

Visibility and Monitoring Issues:

Poor visibility into GPU clusters and inefficient monitoring tools slows down organizations from making data-driven decisions and identifying underutilized resources, within specific teams. This lack of visibility often leads to purchasing additional resources, which increases the number of idle and expensive resources.

Challenges in Training and Inference Workload Coexistence:

Effectively handling training and inference workloads in a shared cluster is a technical challenge. The requirements of training workloads are very different from inference workloads. Their unique needs can lead to inefficiencies without proper management. Scheduling them properly is key to optimizing resource utilization for both tasks, ensuring efficiency and preventing issues from their distinct requirements.

Managing Tool Flexibility for Data Scientists

Data scientists often have their preferred tools and frameworks (e.g. VSCode, Jupyter Notebook, Pycharm, Tensorboard, WandB etc.). However, when the IT setup lacks tool flexibility or easy access to resources without time-consuming tool setups, the data science team faces delays. This hampers productivity and extends the time needed to start experiments. The issue keeps happening when data scientists want to switch tools, causing a repetitive cycle of productivity setbacks.

Regulatory Compliance and Data Governance:

Ensuring compliance with regulatory requirements and maintaining robust data governance and access permission processes between the teams and different branches of the same company in multiple continents were critical concerns for certain industries.

Now let’s have a look at the individual customer stories from various industries and how the Run:ai platform helps them to get better value for their investment on resources, while accelerating the time to market.

Three Success Stories from Three Different Industries

King’s College London

Setup: Heterogeneous environment including different DGX types. Diverse research team with backgrounds in clinical research, data science and AI with varying usage patterns.

Challenges before Run:ai:

  • Low Overall Utilization of AI Hardware: total GPU utilization was below 30%, with significant idle periods for some GPUs despite demand from researchers
  • Overloaded System with Jobs Requiring More Resources: the system was overloaded on multiple occasions where more GPUs were needed for running jobs than were available.
  • Poor visibility and scheduling led to delays and waste: bigger experiments requiring a large number of GPUs were sometimes unable to begin because smaller jobs using only a few GPUs were blocking them out of their resource requirements.

‍Results after Run:ai:

  • Increased GPU Utilization: GPU utilization rose by 110%, resulting in increases in experiment speed. Researchers ran more than 300 experiments in a 40-day period, compared to just 162 experiments that were run in a simulation of the same environment without Run:ai. By dynamically allocating pooled GPUs to workloads, hardware resources were shared more efficiently.
  • Improved Visibility: with advanced monitoring and cluster management tools, data scientists are able to see which GPU resources are not being used and dynamically adjust the size of their job to run on available capacity.
  • Fair Scheduling and Guaranteed Resources: allowing large ongoing workloads to use the optimal amount of GPU during low-demand times, and automatically allowing shorter, higher-priority workloads to run alongside. In one instance, a single data scientist was able to submit more than 50 concurrent jobs, which were smoothly run as resources became available.
  • More Completed Experiments: allowing the AI Centre to iterate faster and develop critical diagnostic tools and therapeutic pathways that will save lives.

A Global Bank (anonymized)

Setup: Multiple clusters, each with tens of high-end GPUs. Hundreds of researchers/data scientists across business units (e.g. HR, real estate, risk)

‍Challenges before Run:ai:

  • Concurrent usage of GPUs by almost 400 data scientists & low GPU utilization: Out of the bank’s almost 400 data scientists, only a few tens were able to use the bank’s GPUs concurrently. This was a tremendous waste, especially as many projects were in an initial phase of building, with small data sets, and no need for a whole GPU.
  • Prioritization of resources in teams: There were 15-25 active AI/ML projects vying for compute resources in the same GPU cluster. The bank had a rigorous ROI estimation process taking into consideration investment, risks, and returns and prioritizing projects accordingly -- but GPU allocation was not done according to these priorities. Instead, researchers were using GPUs inefficiently, consuming whole GPUs instead of fractions, “hogging” GPUs and preventing other projects -- sometimes more important ones -- from using them.‍
  • Maintaining tool flexibility - while ensuring coherence: Different data science groups across regions and teams used different tools. The bank wanted to keep this flexibility and freedom, without creating a ‘Wild Wild West’ of siloed and inconsistent technologies sprinkled across the enterprise.

‍Results after Run:ai:

  • Fair scheduling and guaranteed resources: Using Run:ai, admins now easily control GPU fraction allocation according to business priorities and other factors like seasonality.
  • Increased GPU utilization, leading to faster model development, training, and deployment: Run:ai’s unique ability to use fractional GPUs instead of a whole GPU enables the bank to use its GPU cluster in the most efficient way possible. Now, the entire data science / research force of ~400 employees can share the cluster and access only the GPU fractions they need, when they need it. This led to huge improvements in time efficiency and productivity for the data scientists, as well as faster model iterations and shorter time to production - getting business value from the models faster.‍
  • Regulatory compliance: All projects using the Run:ai platform adhere to the bank’s regulatory, risk, and compliance requirements. Data governance processes and robust access permissions are also an integral part of the environment.

‍

A Defense Company (anonymized)

Setup: An air-gapped, private cloud which can run training and low latency, high throughput inference. The cluster would be used by over 200 researchers and multiple teams.

Challenges before Run:ai:

  • Training & inference in the same cluster: Complex to manage both training and inference on hundreds of on-prem NVIDIA DGX Systems and NVIDIA GPUs for hundreds of  researchers and AI teams
  • Lack of Resource Management: The absence of resource management compounded the difficulties in handling the extensive training and inference demands.
  • Inferior Inference Performance: Achieving low latency and maximum throughput for inference workloads was a struggle, given that the resources at hand is limited. In this case, an increased inference performance is feasible only when models surpass the memory-bound regime, requiring advanced scheduling to harmonize with training workloads in the cluster level.
  • Setup challenges: Operating in a fully air-gapped environment added an extra layer of complexity to the setup.

‍Results after Run:ai: 

  • Pool them all with node pools, rule them all with policies: By pooling all GPUs and applying pre-set priorities and policies, the customer was able to dynamically allocate resources to hundreds of users and multiple teams without management overhead. This resulted in greater availability of GPUs for training, speeding up research time. When NVIDIA Triton and Run:ai were used together, running inference workloads, an increase in throughput of 4.7x is observed.‍
  • Improved utilization: Optimize orchestration of GPUs in the cluster for more than 80% utilization.
  • ‍Faster training times: Speed of model training increased while utilization of the overall cluster was maximized.
  • On-demand access to data scientists: Customer successfully managed resource allocation of a shared pool of GPUs for the entire research team (hundreds of researchers) enabling on-demand access to all users and teams – essentially creating a private managed GPU cloud.
  • Higher ROI, more data-driven decisions: Customer was able to see when additional server investments were necessary based on actual usage patterns, for better planning and ROI.

For more customer stories, check out our case studies.

The diverse success stories from King's College London, a Global Bank, and a Defense Company underscore the transformative impact Run:ai has on overcoming common pitfalls in AI infrastructure - no matter in which vertical your organization is. Whether it's enhancing GPU utilization, streamlining resource management, ensuring fair prioritization, or addressing complex challenges in training and inference workloads, Run:ai platform makes the most out of your resources while letting each data scientist get what they need whenever they need. 

Have questions or want to see Run:ai platform in action? Get in touch with us.