What Is a Graphics Processing Unit (GPU)?
A graphics processing unit (GPU) is a computer processor that performs rapid calculations to render images and graphics. GPUs use parallel processing to speed up their operations. They divide tasks into smaller parts, and distribute them among multiple processor cores (up to hundreds of cores) running within the same GPU.
GPUs were traditionally responsible for rendering 2D and 3D images, videos, and animations, but today extend to a wider range of uses, including deep learning and big data analytics.
Before the emergence of GPUs, central processing units (CPUs) performed the calculations necessary to render graphics. However, CPUs are inefficient for many computing applications. GPUs offload graphic processing and massively parallel tasks from CPUs to provide better performance for specialized computing tasks.
What Is a Central Processing Unit (CPU)?
A CPU is a processor consisting of logic gates that handle the low-level instructions in a computer system. CPUs considered the brain of a personal computer’s integrated circuitry. CPUs perform basic logic, arithmetic, and I/O operations, and allocate commands to other components and subsystems running in a computer.
Today, CPUs are typically multi-core, meaning that there are two or more processors in the integrated circuit.
The use of multiple cores in a single processor reduces power consumption, enhances performance, and enables efficient parallel processing of multiple tasks.
This is part of our series of articles about multi-GPU.
In this article:
How Does a CPU Work?
A CPU works by executing a sequence of stored instructions called a program. The process involves several stages:
- Fetching: The CPU retrieves an instruction from the memory address stored in the program counter (PC). This instruction is loaded into the instruction register (IR).
- Decoding: The control unit (CU) interprets the instruction in the IR. The instruction is broken down into its opcode (operation code) and operand (the data to be operated on).
- Executing: The decoded instruction is sent to the appropriate part of the CPU, such as the arithmetic logic unit (ALU) for calculations or the control unit for directing data flow. The ALU performs the necessary computation or logical operation.
- Writing back: The results of the execution phase are written back to the CPU registers or memory. This might involve updating the PC, writing to an accumulator, or storing the result in RAM.
The CPU repeats this cycle, processing instructions sequentially and handling interruptions and multitasking through scheduling algorithms and control signals.
How Does a GPU Work?
GPUs operate on a similar fetch-decode-execute cycle but are designed to handle thousands of threads simultaneously, making them suitable for parallel processing tasks:
- Fetching: The GPU fetches instructions from its own memory, often from a dedicated high-speed video memory (VRAM).
- Decoding: Instructions are decoded by the control units within the streaming multiprocessors (SMs). Each SM can handle multiple instructions simultaneously, assigning them to available cores.
- Executing: The decoded instructions are executed by numerous small cores in the SM. These cores handle simple operations but in massive parallel quantities, suitable for tasks like matrix multiplications and vector operations common in graphics rendering and machine learning.
- Writing back: The results are written back to the GPU memory or sent to the display output.
GPUs are optimized for high arithmetic intensity and throughput, making them suitable for tasks that can be broken down into parallel processes.
Similarities Between GPUs and CPUs
While GPUs and CPUs have different primary functions, they share some fundamental architectural components that enable their processing capabilities.
Core
Both CPUs and GPUs have multiple cores that execute instructions. CPU cores are designed for complex, single-threaded tasks, while GPU cores handle many simpler, parallel tasks. This difference reflects their use cases: CPUs are suited to diverse computing tasks, whereas GPUs are optimized for parallelizable workloads.
Memory
Both CPUs and GPUs use a hierarchical memory structure. CPUs have a small, fast cache memory close to the cores and larger, slower RAM farther away. GPUs use on-chip memory like registers and cache for quick access and global memory for larger data storage.
Control unit
The control unit (CU) in both CPUs and GPUs fetches and decodes instructions, coordinating the flow of data and operations. In CPUs, the CU manages complex tasks with high precision and low latency. In GPUs, the CU handles many simple tasks simultaneously, focusing on maximizing throughput and parallel efficiency.
CPU vs GPU: Architectural Differences
The following describes the components of a CPU and GPU, respectively.
CPU Architecture
CPUs can process data quickly in sequence, thanks to their multiple heavyweight cores and high clock speed. They are suited to running diverse tasks and can switch between different tasks with minimal latency. The speed of CPUs may create the impression of parallelism, but they can only handle one task at a time.
The following conceptual diagram outlines the basic components of a CPU.
While each CPU design may differ, the main components of a CPU are:
- Control Unit (CU)—the control unit retrieves instructions, decodes, and executes them. It also sends control signals to manage hardware and directs data around the processor system.
- Clock—this helps coordinate the components of the computer. The clock produces regular electrical pulses to synchronize the components. The pulse frequency, measured in hertz (Hz), is called the clock speed. Higher frequencies allow the processing of more instructions within a given timeframe.
- Arithmetic Logic Unit (ALU)—this unit handles calculations. It makes logical and arithmetic decisions and serves as a gateway for transferring data from the primary memory to secondary storage.
- Registers—a register holds a small amount of high-speed memory within the processor. Registers store data required for processing, including the instruction to be decoded, the address of the next instruction, and the results of computations. Different CPUs contain different numbers and types of registers. However, most processors are likely to include a program counter, a memory address register (MAR), a memory data register (MDR), a current instruction register (CIR), and an accumulator (ACC).
- Caches—a cache is a small, built-in RAM in the processor that can store data and reusable instructions temporarily. Caches enable high-speed processing because the data is more readily available to the processor than the external RAM.
- Buses—these are fast internal connections that send data and signals from the processor to other components. There are three types of buses: an address bus to carry memory addresses components like input/output devices and primary memory; a data bus to carry actual data; and a control bus to carry control signals and clock pulses.
GPU Architecture
At a high level, GPU architecture is focused on putting cores to work on as many operations as possible, and less on fast memory access to the processor cache, as in a CPU. Below is a diagram showing a typical NVIDIA GPU architecture. We’ll discuss it as a common example of modern GPU architecture.
Source: ResearchGate
An NVIDIA GPU has three primary components:
- Processor Clusters (PC) - the GPU consists of several clusters of Streaming Multiprocessors.
- Streaming Multiprocessors (SM) - each SM has multiple processor cores, and a layer-1 cache that allows it to distribute instructions to its cores.
- Layer-2 cache - this is a shared cache that connects SMs together. Each SM uses the layer-2 cache to retrieve data from global memory.
- DRAM - this is the GPU’s global memory, typically based on technology like GDDR-5 or GDDR-6. It holds instructions that need to be processed by all the SMs.
Memory latency is not a critical consideration in this type of GPU design. The main concern is making sure the GPU has enough computations to keep all the cores busy.
CUDA Architecture
NVIDIA developed the widely used Compute Unified Device Architecture (CUDA). CUDA provides an API that lets developers leverage GPU resources without requiring specialized knowledge about GPU hardware.
An NVIDIA GPU allocates memory according to a specific CUDA hierarchy. CUDA developers can optimize memory use directly via the API. Here are the main GPU components that map to CUDA constructs, which developers can control directly from their programs:
- Registers - memory allocated to individual CUDA cores, managed as “threads” in the CUDA model. Data in registers can be processed faster than in any other level of the architecture. From here on, every component processes data progressively slower.
- Read-only memory - on-chip memory available to SMs.
- L1 Cache - on-chip memory shared between cores, managed within CUDA as “CUDA blocks”. The L1 cache is hardware-controlled and thus enables fast data transfer.
- L2 Cache - memory shared between all CUDA blocks across multiple SMs. The cache stores both global and local memory.
- Global memory - enables access to the device’s DRAM. This is the slowest element to access for a CUDA program.
CPU vs. GPU: Pros and Cons
CPU Advantages and Limitations
CPUs have several distinct advantages for modern computing tasks:
- Flexibility—a CPU is a general-purpose processor that can handle many tasks, and multitask between multiple activities.
- Faster in many contexts—CPUs are faster than GPUs when handling operations like data processing in RAM, I/O operations, and operating system administration.
- Precision—CPUs can support mid-range math operations with higher precision than GPUs, which is important for many use cases.
- Cache memory—CPUs have a large local cache memory, which lets them handle large sets of linear instructions.
- Hardware compatibility—CPUs are compatible with all types of motherboards and system designs, whereas GPUs require specialized hardware support.
CPUs have the following disadvantages compared to GPUs:
- Parallel processing—CPUs are less adept at tasks that require millions of identical operations because of their limited parallelism.
- Slower development—CPUs are a very mature technology that is already reaching the limits of its development, while GPUs have much more potential to improve.
- Compatibility—several types of CPUs, including x86 and ARM processors, and software may not be compatible with all types.
GPU Advantages and Limitations
The unique advantages of GPUs include:
- High data throughput—a GPU can perform the same operation on many data points in parallel, so that it can process large data volumes at speeds unmatched by CPUs.
- Massive parallelism—a GPU has hundreds of cores, allowing it to perform massively parallel calculations, such as matrix multiplications.
- Suitable for specialized use cases—GPUs can provide massive acceleration for specialized tasks like deep learning, big data analytics, genomic sequencing, and more.
Disadvantages of GPUs compared to CPUs include:
- Multitasking—GPUs can perform one task at massive scale, but cannot perform general purpose computing tasks.
- Cost—Individual GPUs are currently much more expensive than CPUs. Specialized large-scale GPU systems can reach costs of hundreds of thousands of dollars.
- Difficulty handling complexity—a GPU can struggle with processing tasks that are not well structured. They cannot efficiently process branching logic, sequential operations, or other complex programming patterns.
CPU vs. GPU for Machine and Deep Learning
CPUs and GPUs offer distinct advantages for artificial intelligence (AI) projects and are more suited to specific use cases.
Use Cases for CPUs
The CPU is the master in a computer system and can schedule the cores’ clock speeds and system components. CPUs can perform complex mathematical calculations quickly, as long as they process one problem at a time. CPU performance begins to slow down when performing multiple tasks simultaneously.
CPUs have narrow, specialized use cases for AI workloads. CPUs may be well-suited to process algorithm-intensive tasks that don’t support parallel processing. Examples include:
- Real-time inference and machine learning (ML) algorithms that don’t parallelize easily
- Recurrent neural networks relying on sequential data
- Inference and training recommender systems with high memory requirements for embedding layers
- Models using large-scale data samples, such as 3D data for inference and training
CPUs are useful for tasks that require sequential algorithms or perform complex statistical computations. Such tasks are not common in modern AI applications, given that most companies choose the efficiency and speed of GPUs over the specialization of CPUs. Still, some data scientists prefer to develop AI algorithms differently, relying on serial processing, or logic, instead of statistical computations.
Use Cases for GPUs
GPUs are best suited for parallel processing and are the preferred option for training AI models in most cases. AI training typically involves processing mostly identical, simultaneous operations on multiple data samples. Training data sets continue to grow larger, requiring increasingly massive parallelism to enable the efficient performance of tasks.
Enterprises generally prefer GPUs because most AI applications require parallel processing of multiple calculations. Examples include:
- Neural networks
- Accelerated AI and deep learning operations with massive parallel inputs of data
- Traditional AI inference and training algorithms
GPUs provide the raw computational power required for processing largely identical or unstructured data. Over the last 30 years, GPUs have evolved and moved from personal computers to workstations, servers, and data centers. GPUs will likely continue to dominate applications running in data centers or the cloud.
Related content: Read our guide to deep learning GPU
Using CPUs and GPUs Together for HPC
High Performance Computing (HPC) is a set of technologies that enable large-scale, massively parallel computing. Traditionally, HPC systems were based on CPUs, but modern HPC systems increasingly make use of GPUs. It is common for HPC servers to combine multiple CPUs and GPUs in one system.
HPC systems that combine CPUs and GPUs use a specially-optimized PCIe bus. A design pattern called “dual root configuration” enables efficient memory access to a large number of processors. In a dual root server, there are two main processors with a separate memory zone for each processor. The PCIe bus is split between the two processors, and roughly half of the PCIe slots, which are commonly used for GPUs, are assigned to each processor.
There are three types of fast data links in this architecture:
- Inter-GPU connection—an NVlink connection enables fast communication between GPUs at data rates up to 300 GB/s. This allows programmers to work with multiple GPUs as if they were one large GPU.
- Inter-root connection—a fast link, such as Ultra Path Interconnect (UPI) in Intel systems, enables processors belonging to each of the two processors to communicate with the other part of the PCIe board.
- Network connection—fast network interface, typically using infiniband.
A dual-root PCIe design enables optimal use of CPU memory and GPU memory, supporting applications that require both massively parallel and sequential computing operations.
Related content: Read our guides to:
GPU Virtualization with Run:AI
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
Run:AI accelerates deep learning on GPU by helping data scientists optimize expensive compute resources and improve the quality of their models.
Learn more about the Run:AI GPU virtualization platform.