We are pleased to announce the release of Run:ai 2.19, which includes several new features aimed at improving GPU utilization, workload management, and inference optimization. This latest release brings enhancements for researchers, developers, and platform administrators, offering greater control and performance across AI infrastructure. Additionally, we are introducing Run:ai Streamer, an open-source tool designed to reduce model cold start times especially during autoscaling, now available under the Apache license.
Enhanced Visibility and Control for Researchers
Researchers now have improved capabilities to monitor and manage workloads with features like:
- Pending Workload Insights: Get detailed explanations on why workloads are pending, allowing faster troubleshooting and adjustments.
- Expanded GPU Resource Optimization Messages: Users can now view 'GPU resource optimization' events to better understand scheduling and GPU usage decisions.
- Bulk Deletion of Workloads: Manage multiple workloads simultaneously, reducing clutter and improving the efficiency of your workflow.
- Topology-Aware Scheduling: Ensure optimal placement of distributed training workloads across regions or availability zones, improving performance and resource utilization.
Optimizing Inference Workloads
Our focus on inference workloads continues with several critical updates, including:
- Expanded Data Sources: Inference workloads now support data sources of type NFS and Hostpath, giving users more flexibility in data management.
- Improved Hugging Face Integration: Additional validation ensures smoother integration with Hugging Face models, minimizing submission errors.
- Rolling Inference Updates: ML engineers can now perform rolling updates on inference workloads, minimizing downtime and ensuring seamless transitions during model updates.
- Inference Endpoint Authorization: Securely control access to inference endpoints by specifying authorized users or groups, crucial for managing sensitive workloads.
Metrics, Telemetry, and CLI Improvements
For developers, Run:ai 2.19 brings enhanced visibility and ease of use with:
- Expanded Metrics and Telemetry: Gain deeper insights into GPU and CPU usage with new metrics at the cluster, node, and workload levels, accessible via the API.
- Improved CLI Autocompletion: CLI V2 now autocompletes project and workload names, streamlining the command-line experience and ensuring data consistency with the UI.
Platform Administrators: Enhanced Scheduling and Policy Management
Run:ai 2.19 offers platform administrators more control over AI infrastructure with:
- Department-Level Scheduling Rules: Assign priority and set scheduling rules at the department level, ensuring that resources are allocated in line with organizational priorities.
- Enhanced Audit Logs: A fully functional audit log allowing admins to track changes and actions across the platform for compliance and analysis.
Integration with Cloud Tools: Karpenter and COS Support
Run:ai continues to enhance its integration with leading cloud technologies, including:
- Karpenter Interworking: Optimize cloud costs and improve resource utilization with Karpenter, a Kubernetes cluster auto-scaler that moves workloads between nodes and scales resources on demand.
- COS Support for GKE: With the addition of Container-Optimized OS (COS) support for GKE, enterprises using Google Cloud can now leverage the latest NVIDIA GPU Operator for better performance and cost efficiency.
Introducing Run:ai Streamer: Open-Source Model Deployment Simplified
In addition to the new features in Run:ai 2.19, we are excited to announce the availability of Run:ai Streamer as an open-source project. Run:ai Model Streamer is developed to tackle the cold start problem head-on by significantly reducing model loading times during inference.
Why Streamer?
As enterprises scale their AI infrastructure, the challenge of the cold start problem starts appearing, where loading model weights onto GPUs during startup causes delays, slowing response times in user experience and driving up operational costs, especially during autoscaling. Streamer addresses this by:
- Faster Model Loading: Load models faster, allowing for faster scale up and inference start times.
- Apache Licensed: Streamer is available under the permissive Apache license, giving the community the flexibility to adapt and enhance the solution to meet their needs.
- Easy Integration with Inference Engines: The native C++ code offers high performance, while the Python API is user-friendly, similar to HuggingFace's safetensors library. It integrates with various inference engines (e.g., vLLM, TGI) by replacing the safetensors iterator with the streamer's iterator, without bypassing the engine's native model-loading code.
How to Get Started with Streamer
We’ve made it easy to begin using Run:ai Streamer. Full documentation, including usage instructions, is available on our GitHub repository.
Looking Ahead
The 2.19 release combined with the availability of Run:ai Streamer exemplifies our continued commitment to pushing the boundaries of AI infrastructure management. Whether you’re managing large-scale distributed workloads, running inference at scale, or experimenting with cutting-edge models, Run:ai 2.19 and Streamer provide the tools you need to accelerate innovation.