What’s the simplest neural network I can build with an RTX 5090?
I use uv for all my Python projects so I started there https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch
The instructions are long but I went with the simplest option:
To start, consider the following (default) configuration, which would be generated by running uv init —python 3.14 followed by uv add torch torchvision.
Now a quick check to confirm we are using the GPU:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)The above prints out cuda. Beatiful.
I asked Claude for the simplest neural network I could build and it gave me this code:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
train = DataLoader(
datasets.MNIST(".", train=True, download=True, transform=transforms.ToTensor()),
batch_size=128,
shuffle=True,
)
class Net(nn.Module):
def __init__(self):
super().__init__()
self.c1 = nn.Conv2d(1, 32, 3)
self.c2 = nn.Conv2d(32, 64, 3)
self.fc1 = nn.Linear(64 * 5 * 5, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = F.max_pool2d(F.relu(self.c1(x)), 2)
x = F.max_pool2d(F.relu(self.c2(x)), 2)
return self.fc2(F.relu(self.fc1(x.flatten(1))))
model = Net().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(3):
for x, y in train:
x, y = x.to(device), y.to(device)
opt.zero_grad()
loss = F.cross_entropy(model(x), y)
loss.backward()
opt.step()
print(epoch, loss.item())Explanation:
- Import several dependencies
- Sanity check to confirm we are using GPU
- Load datasets
- Outer DataLoader wraps that Dataset to yield mini-batches: 128 examples at a time, reshuffled each epoch. When iterate over it, each step gives you x of shape [128, 1, 28, 28] and y of shape [128]
- We define the network from nn.Module
- The four layers:
- c1: convolution, 1 input channel (grayscale) → 32 output channels (feature maps), 3×3 kernel
- c2: 32 → 64 channels, 3×3 kernel
- fc1: fully-connected layer, 1600 → 128. The 6455 is the flattened size after the convolutions and pooling — I’ll explain where 5×5 comes from in a sec.
- fc2: 128 → 10. Ten outputs, one per digit class.
- Perform forward pass:
- c1(x) → [128, 32, 26, 26] (a 3×3 conv with no padding shrinks each spatial dim by 2). relu zeros out negatives. max_pool2d(…, 2) takes the max over each 2×2 block, halving spatial dims → [128, 32, 13, 13].
- c2 → [128, 64, 11, 11]. Pool → [128, 64, 5, 5]. That’s where the 6455 = 1600 in fc1 came from.
- flatten(1) collapses everything from dim 1 onward, so [128, 64, 5, 5] → [128, 1600]. Through fc1, ReLU, fc2 → final shape [128, 10]. Ten raw scores (logits) per image.
- Build the model and move all its parameters onto the GPU. Use Adam as the optimiser, it’ll update the weights using gradients. model.parameters() hands it every learnable tensor. lr=1e-3 is the learning rate, a sensible default for Adam.
The four-step heartbeat of basically every PyTorch training loop:
- opt.zero_grad() — clear out gradients from the previous batch. PyTorch accumulates gradients by default, so without this they’d pile up across batches and ruin training.
- model(x) — forward pass, calls your forward method, returns logits [128, 10]. F.cross_entropy(logits, y) computes the loss: how wrong the predictions are vs. the true labels. Returns a single scalar.
- loss.backward() — autograd walks backward through every operation that produced loss and computes the gradient of the loss with respect to each parameter. Stores those gradients on the parameters themselves.
- opt.step() — Adam reads those gradients and nudges each parameter in the direction that reduces loss.
- Once per epoch we print out the result:
print(epoch, loss.item())
That’s it.
Excited for what’s to come.