Infrastructure

LLM Serving Improvements by Run:ai

by
Lior Hillel
–
July 22, 2024

At Run:ai, we’ve dedicated countless hours to extensively researching and enhancing the capabilities of LLM serving. With the release of Run:ai version 2.18, we are excited to share a series of significant improvements and innovations. Here's an overview of our latest advancements:

1. Inference Workloads as First-Class Citizens in Run.ai

Unified and Asset-based Experience

The user experience for submitting inference workloads is now asset-based - as it is right now with workloads. This means, you can submit an inference workload via Run:ai UI or API by specifying your project’s assets: environment, compute resources, auto-scaling configuration, and data sources. Pre-configured assets can be reused across different workloads, simplifying management and boosting efficiency. With a couple of clicks, your model is deployed in your cluster.

2. Advanced Auto-Scaling

Knative, but Better

In times of increased requests to your application, scaling up the replicas to meet performance requirements, in other words auto-scaling, becomes a crucial component. That’s why, we have enhanced Knative’s complex auto-scaling capabilities with an intuitive interface. You can easily set the most relevant parameters without dealing with configurations. We've also stabilized auto-scaling with Knative and included latency as a metric. Now, you can choose between concurrency, throughput, and latency to optimize the auto-scaling of your applications when submitting inference workloads.

Custom Metrics

Finding the optimal metric for auto-scaling is challenging. We plan to introduce more custom metrics tailored to diverse use cases that we discuss in the future with our design partners, ensuring your applications scale efficiently. 

‍Scale-to-Zero

During periods of low demand, the number of replicas can also be scaled down to zero, minimizing resource use and leading to cost savings. These freed resources can be utilized in other workloads of your teams.

3. Inference Metrics Visualization

Comprehensive Metrics Overview

We provide a detailed overview of inference metrics directly in the Run:ai UI. Monitor and analyze key performance indicators such as latency and throughput in real-time. These insights help you fine-tune your deployments for optimal performance and resource utilization.

4. Expose your application to authorized users easily

Automatic private and public URLs

When a model is deployed via tools like the ChatBot UI, we automatically generate internal and optionally external URLs for your deployed models, ensuring easy access within your cluster or from external sources without requiring complex ingress configurations. Additionally, we provide the ability to whitelist access to these URLs, so you can restrict access to specific teams or users (e.g., the URL can be made accessible only to the finance team). This dual capability of creating the ingress and managing access control simplifies the deployment process and enhances security.

5. Deploying State-of-the-Art Models

Faster Experimentation with Hugging Face

Our latest release of the LLM Catalog extends model deployment support to all Hugging Face models supported by vLLM, such as Llama, Falcon etc.. Deploy large language models from Hugging Face with minimal effort using our integration. Simply provide your Hugging Face token and the model name to get started and we handle the model downloading, loading to GPU and vLLM setup automatically.

Deploy from File System/S3

Bring your own fine-tuned model without setting up an inference engine. Connect your file system/S3 data source and deploy the model, leveraging Run:ai's auto-scaling capabilities.

Curious about the entire release? Check out the announcement here‍. 

Ready to get started? Book your demo today and see how Run:ai can help you accelerate AI development and increase efficiency.