Inference & Training

From Storage to GPU—but Faster: Accelerating Model Loading with Run:ai Model Streamer

by
Ekin Karabulut
Noa Neria
Omer Dayan
October 29, 2024

When it comes to deploying large machine learning models, speed is everything. Whether you’re serving recommendations in real-time or running complex NLP tasks in production, there’s a critical window from model load to inference. But as models balloon in size—from a couple billions of parameters to hundreds of billions—so does the time it takes to move them from storage to memory and get them ready to serve. This is a huge issue especially if you are trying to scale up your application to match your users demand.

This challenge led us to build the Run:ai Model Streamer—a purpose-built solution to make large model loading faster, more efficient, and easier to manage, especially at scale. In this blog, I'll give you an inside look at why we created the streamer, how it works, and a sneak peek at the benchmarking results that prove its performance.

For those, already looking to dive deeper into nitty gritty technical details, be sure to check out the full benchmarking whitepaper for more insights into the tool and more experiment results.

Why We Built Run:ai Model Streamer

When we first set out to improve the model-loading process, we were dealing with the same challenges many AI practitioners face: large models, slow load times, thus slow scale up times. While various tools help with loading models into memory, they still leave gaps, particularly around speed, flexibility when working across different storage types such as local file systems (e.g., SSD), cloud-based object stores (e.g., S3) and any other high-end storage solutions and the integration with the wide-adopted safetensors model format. 

Run:ai Model Streamer

How it works

At a higher level, the Run:ai Model Streamer is built around two characteristics:

Reading Data in Parallel: Rather than loading model tensors sequentially, the streamer pulls model weights in parallel, using a user-specified level of concurrency. leveraging multiple concurrent requests to the storage system. This parallelization significantly reduces the time it takes to get the model into memory while saturating available storage throughput, particularly when working with large files.

Streaming Tensors to GPU in Parallel: While reading the model tensors in parallel, the streamer simultaneously transfers them to the GPU as soon as they are ready. This simultaneous process ensures that the model not only loads into memory faster but also gets to the GPU in real-time, cutting down the wait time before the replica is ready for inference. 

Key Design Principles

  • Accelerating Loading Times: At its core, the streamer dramatically reduces the time it takes to load large models into memory. Whether you're loading from high-speed SSDs, cloud-based storage like S3 or any other storage, the streamer optimizes the process to minimize latency and make models available for inference much faster than traditional loaders.
  • Support Across Multiple Storage Types: It doesn’t matter where your model weights are. One of the key features of the Model Streamer is its ability to work effectively across a variety of storage mediums. SSDs, NVMe, NFS, and cloud storage (like S3), adapting to a variety of infrastructure setups without requiring additional configurations. This flexibility is essential for modern AI systems that often span both local and cloud-based infrastructure.
  • No Tensor Format Conversion Needed: The tool works seamlessly with the widely adopted safetensors format, eliminating the need for serialization and deserialization processes. This ensures ease of use across different deployment environments.
  • Easy Integration with Inference Engines: Native C++ code ensures high performance, while the safetensors iterator is implemented in a very similar way to the safetensors library by HuggingFace for ease of use. It can be integrated into different inference engines (e.g. vLLM, TGI etc.) by changing this safetensors iterator with the iterator of the streamer. This way, it does not bypass the native code of the engine for loading the model. 

Side note for Run:ai customers: Specifically in the latest releases of our product, we offer a ready-made vLLM container that already has the Run:ai model streamer integrated. Please get in touch with your administrator for more information!

Sneak Peek: Benchmarking Results

In our benchmarking whitepaper, we put the Run:ai Model Streamer through its paces alongside other popular model loaders. The results were clear: Run:ai Model Streamer outperformed:

  • GP3 SSD: While Safetensors Loader took 47.99 seconds to load the model from GP3 SSD, Run:ai Model Streamer reduced this time to a staggering 14.34 seconds.
  • IO2 SSD: Here too, the streamer outperformed the Safetensors Loader, loading the model in just 7.53 seconds, compared to 47 seconds with Safetensors.
  • Amazon S3: This is where the streamer truly shines. While Tensorizer required 37.36 seconds to load the model, Run:ai Model Streamer loaded it in just 4.88 seconds—an order of magnitude improvement, thanks to its optimized concurrency handling for cloud storage.

To get started or learn more, check out our full benchmarking study and our Github repository.