What Was NVIDIA Apex—Now Torch.amp?
Originally developed by NVIDIA, Apex was an open source PyTorch extension aimed at simplifying mixed-precision and distributed training to improve computational efficiency and reduce memory usage. Apex offered tools and models optimized for these tasks, benefiting from NVIDIA's deep learning expertise.
Recently, the core functionalities of Apex, particularly those enabling mixed-precision training, were integrated into PyTorch as the torch.cuda.amp module, commonly referred to as Torch.amp. Torch.amp ensures compatibility and ease of use across different versions of PyTorch, and allows developers to perform mixed-precision training without a separate extension.
This is part of a series of articles about AI open source projects
In this article:
What Is Mixed-Precision Training?
Mixed-precision training involves the use of both 16-bit and 32-bit floating-point data types during model training to reduce memory consumption and speed up computation. Traditionally, deep learning models were trained using 32-bit floating-point (FP32) arithmetic to maintain the necessary precision for complex calculations. However, such precision is not always required throughout the entire training process.
By strategically employing 16-bit floating-point (FP16) arithmetic where the reduced precision does not significantly impact model accuracy, mixed-precision training achieves faster computation and lower memory usage. This approach allows for larger models or higher batch sizes under the same hardware constraints, enabling more efficient and scalable training processes.
How Does Mixed-Precision Training Work in PyTorch?
Mixed-precision training in PyTorch leverages the torch.cuda.amp module to dynamically adjust the precision during model training. This module provides a context manager and a set of utility functions to automatically manage the precision of tensors and operations. By doing so, it ensures that certain computations are performed in FP16 for efficiency, while others are kept in FP32 to preserve accuracy.
Mixed-precision training is enabled in PyTorch by the Autocast and GradScaler functionalities. Autocast automatically casts tensor operations to the appropriate precision, and GradScaler adjusts the gradient scaling to manage the reduced precision's impact on gradient underflow and overflow. Together, they streamline the implementation of mixed-precision training, making it a practical option for optimizing deep learning models.
Understanding Torch.amp Key Components with Code Examples
The code examples below were shared in the Torch.amp documentation.
Autocasting
In Torch.amp, autocast can automatically cast some operations down to FP16 to save memory and accelerate training.
Torch.amp provides context managers or decorators that allow regions of your script to run in mixed precision. In these regions, operations run in a dtype automatically chosen by autocast to improve performance while maintaining accuracy.
Autocast should only be used for the forward pass stage of a neural network, including the loss computations. PyTorch does not recommend using autocast for backpropagation, however when using autoscast for the forward pass, PyTorch will automatically use the same dtype for the corresponding backward passes.
Here is a code example illustrating the use of autoscast for CUDA Devices:
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
for input, target in data:
optimizer.zero_grad()
# Enables autocasting for forward pass
with torch.autocast(device_type="cuda"):
output = model(input)
loss = loss_fn(output, target)
# Exits the context manager before the backward pass
loss.backward()
optimizer.step()
Gradient Scaling
16-bit precision may not be enough for some computations. In particular, when computing gradient values as part of backpropagation processes, because they are typically small values, representing them with FP16 might lead to buffer underflows. In other words, the values might be represented with zeros. This can interfere with training of neural networks.
The Torch.amp GradScalar class can resolve this issue. It takes the loss value as an input, and multiplies it by a large scalar, inflating gradient values. This makes it possible to represent them in 16-bit precision without the risk of buffer underflow. Subsequently, when gradients are updated, it scales them down to ensure parameters are updated correctly. It also includes a mechanism to avoid buffer overflows.
Here is a code example illustrating the use of GradScaler:
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
# Performing the backward pass with scaled loss
scaler.scale(loss).backward()
# scaler.step() first unscales gradients
# If gradients don't contain missing or infinite values
# optimizer.step() is called, otherwise skipped
scaler.step(optimizer)
# Updating scale for the next iteration
scaler.update()
Best Practices for Mixed-Precision Training with Torch.amp
Here are a few best practices that can help you make more effective use of Torch.amp:
- Be mindful of hardware capabilities: Ensure that your hardware supports mixed-precision training efficiently. NVIDIA GPUs with Tensor Cores, such as those from the Volta, Turing, and Ampere architectures, are designed to accelerate FP16 computations significantly.
- Start with the default settings: Begin your mixed-precision training using the default settings of autocast and GradScaler. These settings are optimized for a wide range of models and tasks, and they often provide a good balance between speedup and accuracy.
- Monitor model accuracy: Keep an eye on your model's accuracy when using mixed-precision training. Ensure that the use of FP16 does not significantly impact the overall accuracy of your model. Small adjustments to the training process or model architecture may be needed to maintain accuracy.
- Use autocast for the forward pass: Apply autocast primarily to the forward pass of your neural network, including loss computations. Autocast optimizes performance by automatically determining the appropriate precision (FP16 or FP32) for different operations.
- Avoid manual casts within autocast regions: When using autocast, avoid manually casting tensors to specific data types within the autocast block. Letting autocast manage these conversions ensures optimal performance and reduces the risk of errors.
- Use GradScaler for stable backpropagation: Employ GradScaler to handle gradient scaling. This is crucial for stabilizing the training process, as it adjusts the scale of loss values to prevent underflows and overflows during backpropagation in FP16.
- Fine-tune the scaling factor: Although the default scaling factor in GradScaler works well for most cases, you might need to fine-tune this value for your specific model or data. Adjusting the scaling factor can help in avoiding underflow or overflow issues with gradients.
- Stay updated with PyTorch and CUDA: Regularly update your PyTorch and CUDA versions to benefit from the latest optimizations and bug fixes for mixed-precision training. Newer versions often include improvements that can enhance performance and stability.
- Test with different batch sizes: Experiment with different batch sizes when using mixed-precision training. Sometimes, increasing the batch size can be more effective due to the reduced memory footprint of FP16, leading to faster training without compromising model accuracy.
- Profile your training: Use profiling tools to identify bottlenecks in your training process. Understanding where most of the computation time is spent can guide you in optimizing model architecture or training procedure for better performance with mixed precision.
By adhering to these best practices, developers and researchers can maximize the benefits of mixed-precision training with Torch.amp, achieving faster training times and reduced memory usage without sacrificing the accuracy of their deep learning models.
Related content: Read our guide to Nvidia container toolkit
Optimizing Deep Learning Training with Run:ai
Run:ai automates resource management and orchestration for AI infrastructure, letting you train deep learning models faster and more efficiently. With Run:ai, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:ai:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.