SIMT vs SIMD: Second Edition

Sep 21

This is an update to my previous post about the similarities and differences between SIMT and SIMD programming models. It’s based on some thoughts from a conversation I had with an ex-colleague (and ex-NVIDIA-SM-architect) friend of mine.

At first glance, SIMT (Single Instruction, Multiple Threads) and SIMD (Single Instruction, Multiple Data) can look deceptively similar. Both approaches exploit parallelism by executing the same operation across multiple data elements. But the differences have profound consequences for how processors are built and how programmers achieve performance.

SIMD: Parallelism in the Instruction

SIMD encodes parallelism directly in the ISA. A vector instruction specifies an operation on multiple lanes of data at once. All lanes act in lockstep: one program counter, one instruction stream, one register file containing vectors of values.

Because SIMD has less flexibility in the hardware (no per-thread scheduling, no predication) it’s area- and power-efficient. For a computation that requires a certain amount of logic to be active for a given number of cycles, if the programmer (or compiler) can align and schedule the data correctly, SIMD executes the work with fewer transistors switching. The efficiency advantage comes from this rigidity: the hardware doesn’t need extra infrastructure for divergence or fine-grained scheduling.

If the same perfectly regular program were expressed in SIMT, it would run essentially the same way: no divergence, each thread executing in lockstep. But in that case, some silicon in the SIMT control path (warp schedulers, per-thread context, reconvergence logic) would sit idle, consuming area and some leakage power without adding value.

However, workloads aren’t always perfectly regular. When branching, uneven loop trip counts, or “tail” elements crop up (e.g., when an array size isn’t divisible by the vector width, forcing a mostly empty SIMD vector to run), SIMT becomes more efficient. Instead of rerunning whole vector-width batches with minimal utilization, SIMT can simply mask or predicate the threads that need to be active, avoiding wasted instructions and coarse-grained restarts.

In short:

SIMD is more efficient when data lines up exactly, because rigidity minimizes hardware overhead, especially for control infrastructure.
SIMT is more efficient when data or control flow is irregular, because its flexibility avoids wasted work and vector-packing inefficiencies.

SIMT: Parallelism in the Instruction Stream

SIMT takes a different approach. The ISA doesn’t define a vector width. Instead, parallelism emerges because many threads execute the same instruction sequence simultaneously.

Threads are organized into groups (warps or wavefronts) that share instruction decode and control, but each thread maintains its own program counter and registers. This allows the programming model to scale to thousands of threads, while the hardware overlaps their execution to hide latency. The thread elements can be sized to any vector or irregular data structure because they are independent of the ISA. In fact, the number of threads to launch a program with can be decided at runtime.

Importantly, SIMT and SIMD aren’t mutually exclusive. Modern SIMT processors often use SIMD execution units within a SIMT framework. Warp-level operations, shuffle instructions, and cooperative primitives all map onto SIMD-like structures under the hood. The boundary is much blurrier than the simplified version of reality from the previous post.

Control, Divergence, and Synchronization

A common misconception is that SIMT threads behave like fully independent CPUs. That’s not quite true.

In SIMD, lanes are not individually addressable: they always execute the same instruction.
In SIMT, each thread has its own context, but threads in a warp share control flow.

When threads diverge on a branch, the warp serializes execution: one path executes while threads on the other are disabled, then the alternate path runs until all reconverge. This allows flexibility but introduces inefficiency if divergence is high.

Critically, threads in the same warp cannot progress completely independently. If one thread skips an operation while others take it, the “skipping” thread stalls until the warp catches up. SIMT tolerates divergence but does not eliminate it.

Area and Power Tradeoffs

Whether SIMD or SIMT, the same number of execution units (gates) are needed for a given amount of computation. SIMT doesn’t “do more with less.”

The real tradeoff lies in control infrastructure:

SIMD: single register file, one decoder, fewer control paths. More compact and power-efficient.
SIMT: multiple thread contexts, larger register files, warp schedulers, divergence handling logic. More flexible, but higher area and power.

If SIMD can achieve high utilization, it will almost always be more efficient per unit area. SIMT’s strength lies in hiding memory latency and scaling across thousands of threads.

Why SIMT Feels Easier for Programmers

From the programmer’s perspective, SIMT is simpler:

You write scalar code for one thread, and the runtime scales it across thousands.
No manual packing or alignment of data into vectors.
Irregular array sizes and workloads are handled more gracefully.

This is why SIMT underpins CUDA and other GPU programming models: it exposes massive parallelism in a way that matches how we already think about scalar threads, rather than forcing explicit vectorization.

Summary: Complementary Approaches

SIMD: Compact, power-efficient, best when workloads map cleanly to vectors.
SIMT: Flexible, scalable, latency-tolerant, best when thousands of threads can run with mostly similar control flow.

Modern processors blur the lines: SIMT architectures often execute SIMD instructions inside warps, and vector CPUs increasingly support thread-like abstractions.

The key is to match the model to the workload’s parallelism structure and performance bottlenecks. SIMD extracts maximum efficiency when data is regular; SIMT enables massive throughput by tolerating latency and irregularity at scale.

Ben Glick