When it comes to training big models or handling large datasets, relying on a single node might not be sufficient and can lead to slow training processes. This is where multi-node training comes to the rescue. There are several incentives for teams to transition from single-node to multi-node training. Some common reasons include:
- Faster Experimentation: Speeding Up Training
In research and development, time is of the essence. Teams often need to accelerate the training process to obtain experimental results quickly. Employing multi-node training techniques, such as data parallelism, helps distribute the workload and leverage the collective processing power of multiple nodes, leading to faster training times.
- Large Batch Sizes: Data Parallelism
When the batch size required by your model is too large to fit on a single machine, data parallelism becomes crucial. It involves duplicating the model across multiple GPUs, with each GPU processing a subset of the data simultaneously. This technique helps distribute the workload and accelerate the training process.
- Large Models: Model Parallelism
In scenarios where the model itself is too large to fit on a single machine's memory, model parallelism is utilized. This approach involves splitting the model across multiple GPUs, with each GPU responsible for computing a portion of the model's operations. By dividing the model's parameters and computations, model parallelism enables training on machines with limited memory capacity.
In this tutorial, we will focus on data parallelism using PyTorch, a popular open-source machine learning framework known for its flexibility and efficiency in deep learning model development. PyTorch offers extensive support for distributed training, allowing you to harness the power of multiple nodes and GPUs. If you are unsure about the parallelism strategy, please refer to the ‘Parallelism Strategies for Distributed Training’ blogpost.
Pytorch
PyTorch is a popular open-source machine learning framework that provides a flexible and efficient platform for developing deep learning models. It offers extensive support for distributed training, allowing you to leverage multiple nodes and GPUs to accelerate the training process.
Data Parallelism in Pytorch
Data parallelism is a key feature in PyTorch that enables efficient distributed training. It involves duplicating the model across multiple GPUs, where each GPU processes a subset of the data simultaneously. Once done, the results of the models are combined and training continues as normal. This approach is particularly useful when the batch size used by your model is too large to fit on a single machine, or when you aim to speed up the training process.
PyTorch provides two options for implementing data parallelism: DataParallel (DP) and DistributedDataParallel (DDP). While both options are available, it is recommended to use DDP for its superior performance, as stated in the official PyTorch documentation. For a detailed overview of both settings, you can refer to the Pytorch documentation.
Step-by-Step Guide: Running Distributed Training on Run:ai
Prerequisites
Before diving into the distributed training setup, ensure that you have the following prerequisites in place:
- Run:ai environment (v2.10 and later): Make sure you have access to a Run:ai environment with version 2.10 or a later release. Run:ai provides a comprehensive platform for managing and scaling deep learning workloads on Kubernetes.
- Two nodes with one GPU each: For this tutorial, we will use a setup consisting of two nodes, each equipped with one GPU. However, you can scale up by adding more nodes or GPUs to suit your specific requirements.
- Image Registry (e.g., Docker Hub Account): Prepare an image registry, such as a Docker Hub account, where you can store your custom Docker images for distributed training.
Modifying the Training Script for DDP Training
To prepare your code for DDP training, you need to make some modifications to the training script. In this tutorial, we will use a slightly different version of the example script provided in the PyTorch documentation. Please refer to the official PyTorch documentation for more details. You can find the modified training script, named distributed.py, and all other documents presented in this guide in this GitHub repository.
Here are some essential points to consider when modifying the training script for DDP:
- Setup a Process Group: Create a worker within a group of workers using the torch.distributed.init_process_group() function.
- Choose a Communication Backend: Select the appropriate communication backend that handles the communication between workers. For GPU-accelerated jobs, it is recommended to use "nccl." For CPU jobs, "gloo" is the recommended backend. In this tutorial, we will use "nccl."
- Use DistributedSampler for Data Loading: Wrap your dataset with the DistributedSampler to ensure that each GPU receives a different portion of the dataset automatically. This sampler helps distribute the data across the nodes efficiently.
- Wrap Your Model with DDP: Enclose your model with DistributedDataParallel (DDP) to enable data parallelism. DDP synchronizes gradients across each model replica and specifies the devices to synchronize, which is the entire world by default.
After making the required changes to the training script, we will create a bash script, launch.sh, which will launch the training job. Here is an overview of the launch.sh script:
We will execute distributed.py using torchrun on every node, as explained in the PyTorch documentation. The script includes various system-related arguments passed to the torchrun command. Here is an overview of what each variable does:
- ‘nproc_per_node’: The number of workers on each node. In our case, this value is set to 1.
- ‘nnodes’: The total number of nodes participating in the distributed training.
- ‘rdzv_endpoint’: The IP address and port on which the C10d rendezvous backend should be instantiated and hosted. It is recommended to select a node with high bandwidth for optimal performance.
- ‘rdzv_backend’: The backend of the rendezvous (e.g. c10d).
By following this approach, you won't need to recreate your Docker image if the master node changes.
Side Note: Saving Checkpoints
During training, it is a best practice to save checkpoints frequently to mitigate the impact of network challenges. In this tutorial, we save checkpoints every 2 epochs. However, feel free to adjust this frequency based on your specific use case.
Creating the Docker Image
Now, let's create a Docker image to encapsulate our training environment. In the Dockerfile, we will pull the PyTorch base image, install PyTorch using pip, copy all the required files to the container's working directory, make the launch.sh script executable, and run it. Here is an overview of the Dockerfile:
Side Note: In the future versions of Run:ai, the step of creating and building Docker images will be automated.
Building and Pushing the Docker Image
Next, log in to your Docker Hub account and build the Docker image using the following commands:
Make sure to replace YOUR-USER-NAME with your actual Docker Hub username. These commands will build the image and push it to your Docker Hub repository. You can find the image that we created for this guide here.
Launching the Distributed Training on Run:ai
To start the distributed training on Run:ai, ensure that you have the correct version of the CLI (v2.10 or later). To launch a distributed PyTorch training job, use the runai submit-pytorch or runai submit-dist pytorch command depending on your CLI version. Here is the command to submit our job:
In this example, we selected 1 worker and attached 1 GPU, as we have a setup with 2 nodes (1 worker and 1 master) for training. After submitting the job, you can monitor the logs of the worker nodes through the Run:ai user interface or using the CLI command ‘runai describe job distributed-training-pytorch -p project_name’.
Conclusion
In this tutorial, we explored the process of running multi-node distributed training on Kubernetes with Run:ai using PyTorch. By leveraging data parallelism and PyTorch's distributed training capabilities, you can handle big datasets faster and efficiently. We covered the necessary modifications to the training script, creating the Docker image, and launching the distributed training job on Run:ai. With these steps, you can accelerate your training process and obtain faster experimental results.