GPUs are throughput machines, made for simple massive parallel workload.
Compute
- CUDA Core: operates on individual scalars
- Tensor Core: operates on vectors/matrices. Can be dense or sparse depending if every element of the tensor is used.
- Special Function Unit (SFU): accelerates certain mathematical operations like sin, cos and log
Memory and Caches
- VRAM: Video RAM; name comes from GPUs original purpose
- DRAM: dynamic RAM, general purpose. Off chip
- SRAM: static RAM, faster more expensive and on-chip
VRAM on a GPU limits the size of the model you can use on it; it holds the model weights plus KV cache. Memory bandwidth is the bottleneck for decoding.
Reference
- Philip Kiely. Inference Engineering, Baseten (April 2026).