Integration

Announcing Amazon SageMaker HyperPod & Run:ai Integration

by
Neir Benyamin
Rob Magno
–
November 12, 2024

We’re thrilled to announce a new integration between AWS and Run:ai, bringing together the power of Amazon SageMaker HyperPod and our innovative AI workload and GPU orchestration platform. This collaboration is set to enhance the way data scientists and ML engineers deploy and manage their workloads across AWS.

Amazon SageMaker HyperPod offers several key benefits for AI/ML practitioners. Amazon SageMaker HyperPod offers a fully resilient, persistent cluster purpose-built for large-scale distributed training and inference. It removes the undifferentiated heavy-lifting involved in managing ML infrastructure and optimizes resource utilization across multiple GPUs, significantly reducing time to train models. This feature supports any model architecture, allowing teams to scale their training jobs efficiently. HyperPod enhances resiliency by automatically detecting and handling infrastructure failures and ensuring that training jobs can seamlessly recover without significant downtime. Overall, it enhances productivity and accelerates the ML lifecycle.

The Run:ai platform streamlines AI workload and GPU orchestration across hybrid environments—on-premises and public/private clouds—all from a single interface. This centralized approach greatly benefits IT administrators overseeing GPU resources across different geographic locations and teams, enabling efficient use of on-prem, AWS cloud, and hybrid GPU resources while allowing for seamless cloud bursts when demand increases.

Both AWS and Run:ai technical teams have successfully tested and validated technical integration between Amazon SageMaker Hyperpod and Run:ai. This integration allows users to leverage the flexibility of Amazon SageMaker Hyerpod’s capabilities while benefiting from Run:ai’s GPU optimization, orchestration and resource management features.

Benefits of Run:ai and SageMaker HyperPod

With the integration of Run:ai and Amazon SageMaker HyperPod, organizations can now seamlessly extend their AI infrastructure across both on-premises and public/private cloud environments, taking advantage of the following key benefits:

1. Unified GPU Resource Management Across Hybrid Environments

Run:ai provides a single control plane that allows enterprises to efficiently manage GPU resources across enterprise infrastructure and Amazon SageMaker HyperPod. It also provides a simplified way (through the GUI or CLI) for scientists to submit their jobs to either their on-prem or HyperPod nodes. This centralized approach streamlines the orchestration of workloads, enabling admins to allocate GPU resources based on demand while ensuring optimal utilization across both environments. Whether on-premises or in the cloud, workloads can be prioritized, queued, and monitored from a single interface, simplifying operations for IT teams.

2. Enhanced Scalability and Flexibility

‍With Run:ai, organizations can easily scale their AI workloads by bursting to SageMaker HyperPod when additional GPU resources are needed. This hybrid cloud strategy allows businesses to scale dynamically without over-provisioning hardware, reducing costs while maintaining high performance. SageMaker HyperPod’s flexible infrastructure further supports large-scale model training and inference, making it ideal for enterprises looking to train or finetune foundation models such as Llama or Stable Diffusion.

3. Resilient Distributed Training

‍Run:ai's integration with SageMaker HyperPod enables efficient management of distributed training jobs across clusters. HyperPod continuously monitors the health of GPU, CPU, and network resources, automatically replacing faulty nodes to maintain system integrity. In parallel, Run:ai minimizes downtime by automatically resuming interrupted jobs from the last saved checkpoint, reducing the need for manual intervention and minimizing engineering overhead. This combination helps keep AI projects on track, even in the face of hardware or network issues.

4. Optimized Resource Utilization

‍Run:ai’s AI workload and GPU orchestration capabilities ensure that AI infrastructure is used efficiently. Whether running on on-premises GPUs or SageMaker HyperPod clusters, Run:ai’s advanced scheduled and gpu fractioning capabilities help optimize resource allocation, allowing organizations to run more workloads on fewer GPUs. This capability is especially beneficial for enterprises with This flexibility is particularly valuable in managing fluctuating demands, as compute needs often vary by time of day or season. Run:ai adapts to these shifts, prioritizing resources for inference during peak demand while balancing training requirements, ultimately reducing idle time and maximizing GPU return on investment.

As part of our validation process, we tested several key capabilities such as hybrid and multi-cluster management, automatic job resumption after hardware failures, FSDP elastic PyTorch preemption, inference serving, and Jupyter integration, as well as resiliency testing.

In the coming weeks, we will be collaborating with AWS to release a joint reference architecture. This resource will outline best practices and configurations for managing hybrid environments that integrate both on-premises and Amazon SageMaker HyperPod GPUs, enabling more efficient and scalable AI deployments.

Stay tuned for more updates as we continue to help you elevate your AI projects with the combined power of Run:ai and Amazon SageMaker HyperPod.

For more information, please reach out to Run:ai at [email protected]