Azure GPU: Best GPU-Optimized VMs & Optimization Best Practices

How Can You Access GPUs in the Azure Cloud?

The Microsoft Azure cloud offers specialized virtual machines (VMs) equipped with graphical processing units (GPUs), which can handle intensive graphics processing, parallel computation, and machine learning tasks. These VMs are equipped with powerful NVIDIA and AMD GPUs, providing computational power to accelerate a variety of workloads, from data analysis to complex simulations.

Azure’s GPU offering allows anyone, from individual developers to large organizations, to access any amount of GPU resources without an upfront investment. Azure GPU-powered VMs support a range of frameworks and tools, making them versatile for various computing needs.

This is part of our series of articles about cloud deep learning.

In this article, you will learn:

Use Cases for Azure GPUs
Azure GPU Enabled VMs Pricing
Best GPU-Optimized Azure VMs
Tutorial: Getting Started with Azure GPUs
Best Practices for Azure GPU Optimization ‍
Automated Deep Learning GPU Management With Run:ai

Use Cases for Azure GPUs

Here are some common use cases for GPUs on Azure:

Training and Inference of AI Models

Azure GPUs provide the necessary computational power to process large datasets quickly. They reduce the time it takes to train complex models, making iterative development and testing more feasible.

For model inference, Azure GPUs offer low-latency responses crucial for real-time applications. Their ability to handle multiple requests simultaneously ensures AI-driven applications are responsive and reliable.

High-Performance Computing (HPC)

Azure GPUs deliver high computational speeds for HPC use cases, enabling researchers and engineers to solve complex problems faster. Examples include weather simulation, genomic analysis, or quantum mechanics calculations.

Azure's scalable architecture allows users to deploy HPC workloads on-demand, scaling to any number of compute VMs, while efficiently managing computing resources.

Graphics and Visualization

Graphics and visualization tasks, such as 3D rendering, virtual reality (VR), and visual effects (VFX) production, benefit significantly from Azure GPUs. These GPUs deliver the graphical processing power needed for high-quality, real-time rendering, enhancing creativity and productivity.

By utilizing Azure GPUs, professionals can access high-performance rendering capabilities without investing in expensive local hardware, enabling more flexible and scalable production workflows.

Real-Time Data Analysis

Azure GPUs are capable of handling real-time data analysis, enabling businesses to derive insights from large volumes of data quickly. High-speed processing allows for the immediate identification of trends, anomalies, and patterns, supporting informed decision-making.

Applications like financial modeling, risk analysis, and fraud detection leverage Azure GPUs for their ability to process complex computations and large datasets efficiently, leading to more accurate and timely results.

Related content: Read our guide to Azure deep learning

Azure GPU Enabled VMs Pricing

Azure's pricing for GPU-enabled virtual machines varies based on the specific VM series, GPU type, and configuration. Here are some key considerations for pricing:

NCsv3-series: These VMs, equipped with NVIDIA Tesla V100 GPUs, are priced to support high-performance computing and machine learning workloads. Pricing typically starts around $3 per hour, but this can vary based on the region and specific VM configuration.
NDv2-series: Designed for deep learning and inference, these VMs use NVIDIA Tesla P100 GPUs. Prices start at approximately $6 per hour. The ND A100 v4-series, which uses NVIDIA A100 Tensor Core GPUs, is more expensive due to its enhanced capabilities, starting at about $11 per hour.
NV-series: Ideal for visualization and graphics-intensive tasks, these VMs feature NVIDIA Tesla M60 GPUs. Pricing begins at around $1.5 per hour. The newer NVv3-series VMs offer improved performance at a similar price point.
NVv4-series: Utilizing AMD Radeon Instinct MI25 GPUs, these VMs provide a cost-effective option for moderate graphics processing needs. Pricing starts at roughly $1 per hour.

Azure also offers reserved instance pricing and spot pricing for significant cost savings. Reserved instances require a commitment of one to three years, offering up to 72% savings compared to pay-as-you-go prices. Spot VMs can provide discounts of up to 90% but are ideal for interruptible workloads.

To get the exact pricing for specific configurations, use the Azure Pricing Calculator or refer to the pricing pages for Windows VMs and Linux VMs.

Best GPU-Optimized Azure VMs

Here are some examples of Azure virtual machines equipped with GPUs.

NCv3-series and NC T4_v3-series

The NCv3-series and NC T4_v3-series VMs are optimized for computational tasks like AI and deep learning. Equipped with NVIDIA Tesla V100 and T4 GPUs, these VMs offer a balance of performance and cost, suitable for a variety of GPU-accelerated applications.

NC 100 v4-series

NC 100 v4-series VMs are designed for large-scale AI and machine learning workloads. Powered by NVIDIA A100 Tensor Core GPUs, these VMs provide significant computational power, accelerating model training and inference.

These VMs support GPU partitioning, allowing multiple users to share GPU resources, maximizing utilization and reducing costs for AI development projects.

Image credit for this and the following images: Azure

ND A100 v4-series

The ND A100 v4-series VMs are tailored for demanding deep learning and HPC applications. With NVIDIA A100 Tensor Core GPUs, they offer exceptional computational power, necessary for training complex models and performing sophisticated simulations.

NDm A100 v4-series

Designed for the most demanding AI and HPC tasks, NDm A100 v4-series VMs feature NVIDIA A100 Tensor Core GPUs. They support massive workloads, offering scalable performance that caters to needs ranging from deep learning model training to complex scientific simulations.

These VMs provide high bandwidth and low-latency networking, essential for workloads that require extensive communication between nodes.

NGads V620-series

NGads V620-series VMs are specialized for graphics-intensive applications, including game streaming, professional visualization, and virtual workstations. Powered by NVIDIA A40 GPUs, they deliver high-performance graphics capabilities, ensuring smooth and detailed visualizations.

‍

NV-series and NVv3-series

NV-series and NVv3-series VMs cater to applications requiring powerful graphics processing, such as video editing, design, and visualization. Equipped with NVIDIA Tesla M60 GPUs, they deliver robust performance for demanding graphics workloads.

These VMs offer a cost-effective solution for graphics-intensive applications, balancing performance and affordability.

NVv4-series

NVv4-series VMs focus on delivering efficient, scalable graphics performance for applications like remote visualization, CAD, and gaming. With AMD Radeon Instinct MI25 GPUs, they provide a flexible and cost-effective option for users needing moderate graphics processing capabilities.

Partitionable GPUs in NVv4-series VMs allow for more granular resource allocation, optimizing utilization for various use cases.

Tutorial: Getting Started with Azure GPUs

Creating a Microsoft Account

To begin, you need a Microsoft account. If you have received an invitation email from your Azure Administrator, follow the instructions provided in the email to join the Azure subscription.

Logging into the Azure Portal

Go to portal.azure.com.
Log in using your Microsoft account credentials.
Once logged in, you will see the dashboard page.
If you have multiple subscriptions, select the one corresponding to your institution by clicking on your account name in the top right corner. If this option is not available, contact your Azure Admin.

Creating a VM

On the Azure portal dashboard, click on the + Create a resource button on the left sidebar.
Select Ubuntu Server 16.04 LTS from the list of available VM images.
Click Create to start the setup process.

Configuring the VM

Fill in the details for your VM, such as the name and username.
Change the storage type from SSD to HDD.
Select the region you have been allotted.
Click on View all to see all VM options and select NV6 from the list. If NV6 does not appear, ensure you selected the correct region and storage type. The NV6 option is crucial for GPU instances.

Finalizing the VM Setup

Choose the appropriate VM size and click OK.
Wait for the configuration to validate and click OK again.
Once the VM deployment is complete, your VM should be running.

Using the VM

Finding Your VM

Navigate to portal.azure.com.
Click on All resources and select your VM from the list.

Connecting to Your VM

Once your VM is started, it may take a few minutes. Click Connect and follow the instructions to establish an SSH connection.

Stopping Your VM

When you are done working, ensure you stop your VM to avoid incurring charges.
Make sure the VM status shows as fully stopped and not “stopped still incurring compute charges”. If necessary, hit stop again.

Installing CUDA and TensorFlow

Step 1: Running the Initial Setup Script

1. SSH into your VM.

2. Run the following commands:

git clone https://github.com/leestott/Azure-GPU-Setup.git
cd azure-gpu-setup

3. Verify the setup directory using ls -all.

4. Execute the first setup script:

./gpu-setup-part1.sh

5. This script will install necessary libraries, fetch and install NVIDIA drivers, and trigger a VM reboot. This step may take some time.

Step 2: Completing the Setup

1. SSH into the VM again after it restarts.

2. Navigate back to the azure-gpu-setup directory.

3. Run the second setup script:

./gpu-setup-part2.sh

4. This script will install the CUDA toolkit, CUDNN, and TensorFlow, and set the required environment variables. After the script completes, refresh your shell environment:

source ~/.bashrc

5. To verify that TensorFlow and the GPU are configured correctly, run the test script:

python gpu-test.py

Following these steps will set up a GPU-enabled virtual machine on Azure, ready for deep learning and other computational tasks.

Best Practices for Azure GPU Optimization

Here are some best practices that can help you make effective use of Azure GPUs.

Choose an Azure GPU Series and Instance Size That Best Fits Your Computational Needs

Selecting the right Azure GPU series and instance size is crucial for optimizing performance and minimizing costs. Users should evaluate their computational needs, considering factors like processing power, memory requirements, and network bandwidth. The appropriate GPU instance ensures efficient resource utilization, delivering optimal performance for specific workloads.

Use Azure Managed Disks for High I/O Performance

Azure Managed Disks offer high I/O performance, crucial for GPU-intensive workloads that require rapid data access and storage. Leveraging these disks can improve the performance of applications, ensuring fast data retrieval and processing. Managed Disks also provide reliability and scalability, supporting the dynamic needs of GPU-accelerated applications.

Leverage Azure Batch for Large-Scale GPU Workloads

Azure Batch simplifies processing of large-scale GPU workloads, automating the allocation and management of computational resources. This service enables efficient execution of parallel tasks, reducing the time and effort needed to process large datasets or perform complex simulations.

Use Azure Proximity Placement Groups to Colocate Your GPU Instances

Azure Proximity Placement Groups (PPGs) ensure low-latency communication between GPU instances by colocating them in the same datacenter. This is particularly useful for applications requiring rapid interaction between nodes, such as HPC and multiplayer gaming. Colocation reduces network latency, enhancing performance and responsiveness of GPU-accelerated applications.

Use Azure Reservations and Spot VMs for Relevant GPU Workloads

Azure reservations and Spot VMs offer cost-saving opportunities for GPU workloads. Reservations provide discounted rates for committed usage, suitable for predictable workloads. Spot VMs allow users to bid for unused Azure capacity at discounts of up to 90%, but can be interrupted at short notice, making them useful for flexible workloads not requiring 24/7 availability.

Automated Deep Learning GPU Management With Run:ai

Run:ai automates resource management and workload orchestration for machine learning infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai accelerates deep learning on GPU by, helping data scientists optimize expensive compute resources and improve the quality of their models.

Learn more about the Run:ai GPU virtualization platform.

Azure GPU

Best GPU-Optimized VMs and Optimization Best Practices