Model loading is a critical aspect of machine learning deployment, yet it remains a significant bottleneck that many organizations strive to overcome. Given its complexity, various companies and researchers have developed their own methods for loading weights. In this blog, we will explore these techniques and provide an overview of the different tools available in this space.
Loading model weights fast for inference tasks in auto-scaling scenarios is a significant bottleneck that many are working to overcome. State-of-the-art models like LLaMA-2-7B or LLaMA-3-8B are around 12-15 GB in size for 16-bit representation, while versions with 70 billion parameters require even more memory (around 129 GB for 16-bit representation). The process of loading these weights from local (SSD, NVMe), cluster-wide (NFS, Lustre), or cloud storage (S3) to CPU memory and then to GPU memory is time-consuming and complex. This duration is influenced by the techniques your model loader employs, as well as the specifics of your storage and connection. As a result, the performance of your model loading depends on multiple factors, making it a challenge.
For those eager for a quick summary, scroll down to the comparison section at the end for a brief overview of these tools and key considerations to choose between them.
For those new to this topic, let’s begin unpacking the tools!
Model Loading: How Does It Work?
Loading a machine learning model to a GPU for inference involves two main steps:
- Read Weights from Storage to CPU Memory: Load the model's weights (which can be in various formats such as .pt, .h5, .safetensors, or custom formats) from storage (local, cluster-wide, or cloud) into CPU memory.
- Move Model to GPU: Transfer the model's parameters and relevant tensors to GPU memory.
Notably, when loading models from cloud-based storage like S3, the process involves an additional step of loading the model to local disk as an intermediate stage before transferring it to CPU and GPU memory.
Traditionally, these steps are taking place sequentially, which makes the model loading times one of the biggest bottlenecks when scaling up.
Model Loading Offerings
With an understanding of model loading basics, let's check out the various libraries that offer model loading solutions, as outlined in Table 1.
HuggingFace (HF) Safetensors
The HuggingFace (HF) Safetensors Loader is an open-source tool developed to streamline and secure the loading and saving of tensors in a variety of machine learning workflows. It utilizes a memory-mapped file system to enable zero-copy loading, which eliminates unnecessary data copying and speeds up read operations, especially for large models.
For CPU operations, the loader directly maps tensor data from storage into memory, reducing memory usage. On the GPU, it creates an empty tensor in PyTorch and uses cudaMemcpy to transfer the tensor data directly, ensuring efficient memory utilization. Additionally, the loader supports shared tensors, preventing the duplication of tensor data referenced by multiple layers, which helps optimize memory usage.
By offering zero-copy loading and support for shared tensors, HF Safetensors Loader provides a faster and more reliable alternative to traditional loading methods, such as those based on pickle. With its integration of the safetensors format and support for functions like save_pretrained()
with the safe_serialization
flag, the loader simplifies the model saving and loading process, making it a practical solution for efficient and secure tensor management in various deployment scenarios.
Tensorizer by CoreWeave
Tensorizer is an open-source tool by CoreWeave designed to optimize the process of loading large machine learning models. Instead of loading entire models into RAM before moving them to the GPU, Tensorizer streams model data one tensor at a time from sources like HTTP or S3. This approach reduces memory usage and speeds up the loading process. The tool packages all model weights and metadata into a single file, enabling efficient access and on-demand loading.
Tensorizer’s threaded design allows multiple Python readers to handle different tensors independently, providing parallelism during model loading. Each tensor is assigned to a specific thread, though each thread can manage multiple tensors. These threads can fetch random tensors as needed. Additionally, Tensorizer supports security features like encryption-at-rest and cryptographic verification, ensuring that models are safely handled whether they’re deployed locally or in the cloud.
Run:ai Model Streamer
Run:ai Model Streamer is a Python SDK with a high-performance C++ backend, designed to speed up the loading of machine learning models onto GPUs from various storage solutions like local disk, network file systems, or cloud-based object stores (e.g., S3). It achieves this by using multiple threads to concurrently read model tensors from storage into CPU memory, while simultaneously transferring them to GPU memory. This parallel process optimizes the use of both the CPU and GPU subsystems, enabling real-time loading without bottlenecks.
The tool efficiently balances the workload, allowing tensors of different sizes to be read concurrently, maximizing storage read bandwidth. It also supports the safetensors format, which avoids conversion overhead, ensuring fast and direct model loading. Its easy-to-use Python API integrates seamlessly with inference engines such as vLLM, while leveraging the speed of its C++ layer for optimal performance.
What sets Run:ai Model Streamer apart is its concurrency model, allowing multiple threads to read from the same tensor in parallel. It also streams tensors from CPU to GPU while still reading from storage, enhancing efficiency. The tool is highly customizable, letting users control concurrency levels, memory usage, and data chunk sizes to suit different hardware setups, making it a versatile solution for accelerating model loading in resource-constrained environments.Further details on setup and usage can be found in the documentation. For performance benchmarks, check out our whitepaper.
Anyscale Model Loader
Anyscale Model Loader, although not open source, provides additional improvements in loading models. It offers direct loading from S3 to the GPU. The model loader builds the GPU tensor and copies data buffers directly into the GPU, thus improving the loading process by:
- Loading from S3 to GPU without loading the entire tensor data to the CPU.
- Streaming tensor data in chunks from S3 to GPU, maintaining low CPU memory usage proportional to the chunk size and number of loading threads.
Note: Anyscale solution is available for the users of Anyscale endpoints.
Comparison
Selecting a model loading tool for efficient inference depends on your specific requirements for speed, resource management, and security. Each tool we discussed has its own strengths, so let’s distill some insights to help you pick the right one for your environment without reiterating the technical details.
Key Considerations for Choosing a Model Loader
- Performance Requirements and Scalability
If you’re working in a dynamic, auto scaling environment, you’ll want a tool that efficiently manages concurrent reads and minimizes loading time. Run:ai Model Streamer, with its approach, excels in streaming models directly to GPU memory, while reading tensors in parallel from various storage options. It’s particularly strong in scenarios where you need real-time loading across large or varied storage solutions. If immediate scaling and fast loading are critical, especially in object storage options like S3, Run:ai ’s approach may be the best fit.
- Open Source Flexibility vs. Ecosystem Integration
If flexibility and open-source adaptability are important to your setup, Tensorizer and Run:ai Model Streamer both offer transparency and customization options, making them ideal for users who prefer control over every aspect of their deployment. Meanwhile, Anyscale Model Loader integrates smoothly within the Anyscale ecosystem, which can be beneficial if you’re already committed to that stack, though it may lack the customization open-source options provide.
- Format and Security Preferences
Most tools support safetensors, which has become the standard for secure, efficient model handling. Run:ai Model Streamer’s native support for safetensors makes it convenient for users who prioritize security and simplicity without additional format conversion. Tensorizer, on the other hand, requires the users to convert the model weights into tensorizer format. But it goes further with optional encryption-at-rest and verification capabilities, which can be advantageous if compliance or enhanced security measures are a focus.
- Storage Considerations
The storage type in your environment will significantly impact your choice. If your models are stored on slower or remote storage, the efficiency of tools like Run:ai Model Streamer, with its highly optimized C++ backend for reading directly from various storage types, can mitigate potential delays. All loaders will benefit from faster storage, so consider your current setup and any planned upgrades when choosing between them.
Bottom Line
Each tool brings unique strengths. Balancing these factors with your specific performance, security, and resource requirements will guide you to the right solution for your inference needs. For performance benchmark comparisons of safetensors loader, Run:ai Model Streamer and Tensorizer, check out our detailed performance insights here.