A simple neural network

What’s the simplest neural network I can build with an RTX 5090?

I use uv for all my Python projects so I started there https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch

The instructions are long but I went with the simplest option:

To start, consider the following (default) configuration, which would be generated by running uv init —python 3.14 followed by uv add torch torchvision.

Now a quick check to confirm we are using the GPU:

import torch
 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

The above prints out cuda. Beatiful.

I asked Claude for the simplest neural network I could build and it gave me this code:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
 
 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
 
train = DataLoader(
    datasets.MNIST(".", train=True, download=True, transform=transforms.ToTensor()),
    batch_size=128,
    shuffle=True,
)
 
 
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.c1 = nn.Conv2d(1, 32, 3)
        self.c2 = nn.Conv2d(32, 64, 3)
        self.fc1 = nn.Linear(64 * 5 * 5, 128)
        self.fc2 = nn.Linear(128, 10)
 
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.c1(x)), 2)
        x = F.max_pool2d(F.relu(self.c2(x)), 2)
        return self.fc2(F.relu(self.fc1(x.flatten(1))))
 
 
model = Net().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
 
for epoch in range(3):
    for x, y in train:
        x, y = x.to(device), y.to(device)
        opt.zero_grad()
        loss = F.cross_entropy(model(x), y)
        loss.backward()
        opt.step()
    print(epoch, loss.item())

Explanation:

Import several dependencies
Sanity check to confirm we are using GPU
Load datasets
Outer DataLoader wraps that Dataset to yield mini-batches: 128 examples at a time, reshuffled each epoch. When iterate over it, each step gives you x of shape [128, 1, 28, 28] and y of shape [128]
We define the network from nn.Module
The four layers:
- c1: convolution, 1 input channel (grayscale) → 32 output channels (feature maps), 3×3 kernel
- c2: 32 → 64 channels, 3×3 kernel
- fc1: fully-connected layer, 1600 → 128. The 6455 is the flattened size after the convolutions and pooling — I’ll explain where 5×5 comes from in a sec.
- fc2: 128 → 10. Ten outputs, one per digit class.
Perform forward pass:
- c1(x) → [128, 32, 26, 26] (a 3×3 conv with no padding shrinks each spatial dim by 2). relu zeros out negatives. max_pool2d(…, 2) takes the max over each 2×2 block, halving spatial dims → [128, 32, 13, 13].
- c2 → [128, 64, 11, 11]. Pool → [128, 64, 5, 5]. That’s where the 6455 = 1600 in fc1 came from.
- flatten(1) collapses everything from dim 1 onward, so [128, 64, 5, 5] → [128, 1600]. Through fc1, ReLU, fc2 → final shape [128, 10]. Ten raw scores (logits) per image.
Build the model and move all its parameters onto the GPU. Use Adam as the optimiser, it’ll update the weights using gradients. model.parameters() hands it every learnable tensor. lr=1e-3 is the learning rate, a sensible default for Adam.

The four-step heartbeat of basically every PyTorch training loop:

opt.zero_grad() — clear out gradients from the previous batch. PyTorch accumulates gradients by default, so without this they’d pile up across batches and ruin training.
model(x) — forward pass, calls your forward method, returns logits [128, 10]. F.cross_entropy(logits, y) computes the loss: how wrong the predictions are vs. the true labels. Returns a single scalar.
loss.backward() — autograd walks backward through every operation that produced loss and computes the gradient of the loss with respect to each parameter. Stores those gradients on the parameters themselves.
opt.step() — Adam reads those gradients and nudges each parameter in the direction that reduces loss.

Once per epoch we print out the result: print(epoch, loss.item())

That’s it.

Excited for what’s to come.

Perpetually Incomplete

Recent Notes

2026-05-05

Model Distillation [In Progress]

2026-05-04

Explorer

A simple neural network

Recent Notes

2026-05-05

Model Distillation [In Progress]

2026-05-04

Graph View