GPU Architecture

GPUs are throughput machines, made for simple massive parallel workload.

Compute

CUDA Core: operates on individual scalars
Tensor Core: operates on vectors/matrices. Can be dense or sparse depending if every element of the tensor is used.
Special Function Unit (SFU): accelerates certain mathematical operations like sin, cos and log

Memory and Caches

VRAM: Video RAM; name comes from GPUs original purpose
DRAM: dynamic RAM, general purpose. Off chip
SRAM: static RAM, faster more expensive and on-chip

VRAM on a GPU limits the size of the model you can use on it; it holds the model weights plus KV cache. Memory bandwidth is the bottleneck for decoding.

Reference

Philip Kiely. Inference Engineering, Baseten (April 2026).

Perpetually Incomplete

Recent Notes

CUDA

GPU Architecture

2026-05-15

Explorer

GPU Architecture

Compute

Memory and Caches

Reference

Recent Notes

CUDA

GPU Architecture

2026-05-15

Graph View

Table of Contents