Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness in high-demand times, while managing the costs of GPUs. Organizations often face a trade-off between provisioning additional GPUs for peak demand or risking SLA during spikes in traffic:
- Deploy many replicas with GPUs to handle the worst-case traffic scenarios and pay for hardware that spends most of its time idling.
- Scale up aggressively from zero, and your users suffer through latency spikes.
Neither approach is ideal. The first drains your budget; the second risks frustrating your users.
In earlier posts, we introduced Run:ai Model Streamer, which addressed startup time bottlenecks to enhance auto-scaling efficiency. Today, we’re introducing another innovation designed to push the boundaries of GPU utilization for inference workloads even further: Run:ai GPU memory swap - in other words, Model Hot Swapping.
Why Model Hot Swapping?
Hot swapping introduces a more dynamic approach to resource management in serving models. It allows multiple models to share the same GPUs, even if their combined memory requirements exceed the available GPU capacity. Here’s how it works:
- Dynamic Memory Offloading: Models that are not getting any request at a specific timeframe no longer hog GPU memory. They are swapped to the CPU memory when not in use.
- Rapid Activation: On receiving a request, the model is immediately swapped back into GPU memory with minimal latency.
- More Model Replicas, Less Hardware: This enables multiple models to share the same hardware, significantly reducing the number of always-on machines, without compromising responsiveness. Additionally, since the server (i.e., the CPU process) remains active even when the GPU part is swapped out, the replica can be quickly re-enabled as the server is already initialized.
With hot swapping, organizations can efficiently handle unpredictable workloads while avoiding costly over-provisioning.
Benchmarking GPU Memory Swap: Validating Performance
To demonstrate the performance of GPU memory swap, we simulated real-world LLM deployment scenarios.
Benchmarking Setup
Models Tested
- Meta-Llama-3.1-8B-Instruct (FP32: ~32 GB)
- Mistral-7B (FP32: 27.5 GB)
- Falcon-11B (FP32: 41.83 GB)
Hardware and Software Environment
- GPU: NVIDIA L40S (48 GB) connected via PCIe Gen4 x8, limited to half its maximal theoretical throughput
- Instance Type: AWS g6e.4xlarge
- Scheduler: Run:ai Scheduler (v2.19)
- Inference Engine: vLLM version 0.6.4 with default configurations
- The server image is preloaded into the node and the model weights are cached into an EBS storage, eliminating network traffic overheads in all scenarios.
Metrics
- Time-to-First-Token (TTFT): Measured from the moment the first request hits the server to when the model generates its first token. We used the official benchmarking script of vLLM, simulating a production environment by disabling the warm-up phase.
Input Conditions
- Prompt Lengths: 128 tokens and 2048 tokens, with models stopping at the EOS token.
Results: Three Scenarios, Three Outcomes
We evaluated three distinct scenarios:
- Scale from Zero: Measuring time-to-first-token (TTFT) when loading a model from scratch.
- GPU Memory Swap Between Models on a Single GPU: Evaluating TTFT when a model is swapped from CPU memory back into GPU memory.
- Baseline (Warm Models): Establishing a baseline TTFT when the model is already resident in GPU memory.
1. Scaling from Zero—Long Delays
Scaling from zero involves initializing the pod, loading the model onto the GPU, and processing the first request. As expected, this approach resulted in the highest TTFT due to initialization overhead.
The time-to-first-token (TTFT) consistently exceeds 140 seconds for smaller models and can stretch beyond 200 seconds for slightly larger ones. These delays—up to 208 seconds—are often impractical for real-time applications, underscoring the inefficiency of scaling from zero in production.
2. GPU Memory Swap—Optimal Efficiency
For this test, models started in CPU memory and were dynamically swapped into GPU memory upon request. Two model groups were tested; First Group consists of Llama-3.1-8B and Mistral-7B while second group consists of Llama-3.1-8B and Falcon-11B
With the following sequence:
- A request was sent to one model, prompting it to load into GPU memory. The system dynamically swapped this model from CPU to GPU memory, and the time-to-first-token (TTFT) was recorded.
- Once this model completed its task and was automatically swapped back to CPU memory, a request was sent to the second model. Similarly, it was loaded into GPU memory, and its TTFT was recorded.
Note: With GPU Memory Swap, TTFT is limited by the PCI bandwidth and the time it takes to swap in and out models between the CPU and GPU memory.
Both batches—Llama-3.1-8b-Instruct paired with Mistral-7b, and Llama-3.1-8b-Instruct paired with Falcon-11b—produced consistent results across models and input sizes. Falcon-11b exhibited slightly longer TTFT in comparison to Mistral-7b, as expected, due to its memory footprint. However, this variation (~0.5 seconds) is minimal and well within acceptable performance ranges for real-world scenarios.
These results—just 2–3 seconds depending on the input sequence length—represent a ~50-66x improvement over scaling from zero depending on the model type and input length.
3. Baseline Performance—Warm Models, High Costs
To establish a baseline, we measured TTFT for models already fully loaded into GPU memory. This represents the theoretical best-case scenario in terms of latency.
Warm models deliver near-instant responses but require the GPU to be fully dedicated to the model at all times. This leads to significant costs when handling multiple models or replicas, as GPUs remain underutilized during periods of low demand.
The Takeaway: Cost Efficiency Without Compromise
GPU memory swap achieves the ideal balance between performance and cost, reducing TTFT to just a few seconds. This approach enables organizations to consolidate workloads onto fewer GPUs while maintaining stringent SLAs, ensuring both efficiency and reliability. Compared to always-on warm models, this approach delivers significant cost savings with only a minor latency trade-off.
While Run:ai Model Streamer can help with reducing TTFT for Scale from Zero scenarios by a few tens of seconds, GPU memory swap on the other hand, pushes the boundaries to sub 10 secs TTFT and fits applications that require such SLAs.
Deploy smarter, not harder. With GPU memory swap, you can maximize GPU efficiency, minimize idle costs, and maintain the responsiveness your users expect. Contact us to see GPU memory swap live and learn more about how Run:ai can transform your AI infrastructure.