Understanding PyTorch is one of the fundamental skills on the path of mastering modern machine learning. There are a lot of great books about it, for deep understanding, but let’s give you a tasting menu version of it - by quickly building something practical and explaining core concepts as we do it.

We’re going to build something real — a program that looks at images of handwritten digits (0–9) and tells you which digit it is. By the end, you’ll understand how PyTorch works and have a working neural network.


What is PyTorch?

PyTorch is a Python library for building and training neural networks. It gives you two superpowers:

  1. Tensors — a data structure for holding numbers in grids (we’ll explain this properly below)
  2. Automatic gradient computation — the math engine that makes neural networks learn (we’ll explain this too)

Let’s install it and get going. Before we install pytorch we are going to install the amazing uv - fast and efficient Python manager that manages python virtual environment, packages etc. and basically is the best way to use Python in this day and age. We will also use our project in a virtual environment, avoiding muddying-up the global Python space

curl -LsSf https://astral.sh/uv/install.sh | sh
mkdir -p ~/code/pytorch-101
cd ~/code/pytorch-101
uv venv --python 3.12 #install python 3.12 the latest right now

Now we can isntall pytorch and pytorchvision with uv:

uv pip install torch torchvision

with that out of the way, let’s build the code now

Step 1: Tensors — Just Multi-Dimensional Arrays

You know arrays. A tensor is just an array that can have any number of dimensions:

import torch

# A 1D tensor — like a regular list of numbers
temperatures = torch.tensor([72.0, 68.5, 75.3])

# A 2D tensor — like a spreadsheet or a table with rows and columns
sales_table = torch.tensor([
    [100, 200, 150],   # row 0
    [300, 250, 175],   # row 1
])

# A 3D tensor — like a stack of spreadsheets
# (this is what a batch of grayscale images looks like!)

Why not just use Python lists? Because tensors can:

  • Run math on millions of numbers at once (vectorized operations)
  • Run on a GPU for 10–100x speedup
  • Track their own math history so the network can learn (more on this soon)
# Basic operations — they work on entire tensors at once
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(a + b)       # tensor([50., 7.0, 9.0])     — adds element by element
print(a * b)       # tensor([4.0, 10.0, 18.0])   — multiplies element by element
print(a.sum())     # tensor(6.0)               — adds up everything
print(a.mean())    # tensor(2.0)               — average

The shape of a tensor tells you its dimensions. This matters a lot:

img = torch.randn(28, 28)       # a 28×28 grid of random numbers (like a grayscale image)
print(img.shape)                 # torch.Size([28, 28])

batch = torch.randn(64, 28, 28) # 64 images, each 28×28
print(batch.shape)               # torch.Size([64, 28, 28])

That’s tensors. Arrays with superpowers. Let’s move on.


Step 2: Our Data — 70,000 Handwritten Digits

There’s a famous dataset called MNIST — 70,000 images of handwritten digits (0–9), each 28×28 pixels in grayscale. It’s the “Hello World” of machine learning.

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Download the dataset (only happens once, ~50MB)
train_data = datasets.MNIST(
    root='./data',
    train=True,              # 60,000 training images
    download=True,
    transform=transforms.ToTensor()  # converts images to tensors with values 0.0–1.0
)

test_data = datasets.MNIST(
    root='./data',
    train=False,             # 10,000 test images
    download=True,
    transform=transforms.ToTensor()
)

Each item in the dataset is a pair: (image_tensor, label). The image is a 28×28 tensor of pixel values (0.0 = black, 1.0 = white), and the label is the correct digit (0–9).

DataLoader: Feeding Data in Batches

You don’t feed images one at a time — you feed them in batches (groups). This is faster because the GPU can process 64 images almost as fast as 1.

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

DataLoader is like a conveyor belt: each time you ask for the next item, it hands you a batch of 64 images and their 64 labels.


Step 3: Building the Neural Network

You already know the basic idea: input goes in, passes through layers of parameters, output comes out. In PyTorch, you define a network as a Python class:

import torch.nn as nn

class DigitRecognizer(nn.Module):
    def __init__(self):
        super().__init__()
        # Define the layers
        self.flatten = nn.Flatten()          # converts 28×28 image → flat list of 784 numbers
        self.layer1 = nn.Linear(784, 128)    # 784 inputs → 128 outputs
        self.layer2 = nn.Linear(128, 64)     # 128 → 64
        self.layer3 = nn.Linear(64, 10)      # 64 → 10 (one output per digit 0–9)

    def forward(self, x):
        # Define how data flows through the layers
        x = self.flatten(x)                  # [64, 1, 28, 28] → [64, 784]
        x = torch.relu(self.layer1(x))       # layer 1 + activation
        x = torch.relu(self.layer2(x))       # layer 2 + activation
        x = self.layer3(x)                   # layer 3 (raw scores, no activation)
        return x

model = DigitRecognizer()

Let’s unpack the new concepts:

What is nn.Linear?

A linear layer does this: output = input × weights + bias. It’s the fundamental building block — a grid of learnable parameters that transforms input numbers into output numbers.

nn.Linear(784, 128) means: “Take 784 input numbers, multiply by a 784×128 table of weights, add a bias, and produce 128 output numbers.” Those weights start as random numbers and get improved during training.

What is relu?

After each linear layer, we apply an activation function. Without it, stacking linear layers would just collapse into one big linear layer (multiplying matrices together just gives you another matrix). The network couldn’t learn anything complex.

relu is the simplest activation: it just replaces negative numbers with zero.

relu(-3) = 0
relu(0)  = 0
relu(5)  = 5

That’s it. This tiny nonlinearity is enough to let the network learn complex patterns.

What is forward?

The forward method defines the path data takes through your network. PyTorch calls this automatically when you do model(some_input). You never call forward directly.

Why 10 outputs?

The final layer outputs 10 numbers — one “score” for each digit (0–9). The highest score is the network’s prediction. So if the output is [0.1, 0.2, 8.5, 0.3, ...], the network is guessing “2” (index 2 has the highest score).


Step 4: How Neural Networks Learn (The Key Insight)

This is the part most tutorials rush through. Let’s slow down.

A neural network starts with random weights. It’s going to be wrong about everything. Training is the process of adjusting those weights to make it less wrong. Here’s how:

The Loss Function: “How wrong are we?”

After the network makes a prediction, we need a single number that says “how bad was that prediction.” This is the loss.

criterion = nn.CrossEntropyLoss()

CrossEntropyLoss is the standard loss function for classification. It takes the network’s 10 raw output scores and the correct label, and returns a number: high if the prediction was wrong, low if it was right. You don’t need to understand the math — just know it measures wrongness.

Gradients: “Which direction should we adjust?”

Here’s the key question: we have thousands of weight parameters, and we need to know how to change each one to reduce the loss. Should weight #4,732 go up or down? By how much?

A gradient answers exactly this. For each weight, the gradient tells you:

  • Sign: should this weight increase (+) or decrease (−) to reduce the loss?
  • Magnitude: how sensitive is the loss to this weight? (Big gradient = big impact)

Think of it like this: you’re blindfolded on a hilly landscape, and you want to walk downhill. The gradient is someone telling you “the ground slopes down to your left” — it points you toward lower ground (lower loss).

Backpropagation: Computing Gradients Automatically

Computing gradients by hand for thousands of weights would be impossible. PyTorch does it automatically with one line:

loss.backward()

This works because PyTorch secretly records every math operation you do on tensors (that’s the “tracks their own math history” superpower from earlier). When you call .backward(), it replays that history in reverse and uses the chain rule from calculus to compute every gradient. You never have to do the math yourself.

The Optimizer: “Apply the adjustments”

Once we have the gradients, we need to actually update the weights:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Adam is an optimizer — an algorithm that takes the gradients and uses them to update each weight in a smart way. lr=0.001 is the learning rate: how big each adjustment step is. Too big and you overshoot; too small and training takes forever. 0.001 is a safe default.

model.parameters() tells the optimizer “here are all the weights you’re responsible for updating.”


Step 5: The Training Loop

Now we put it all together. This is the pattern you’ll use for every PyTorch project:

model = DigitRecognizer()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train for 5 passes through the entire dataset
for epoch in range(5):
    running_loss = 0.0

    for images, labels in train_loader:
        # Step 1: Reset gradients from the previous batch
        # (otherwise they accumulate, which we don't want)
        optimizer.zero_grad()

        # Step 2: Feed images through the network
        predictions = model(images)

        # Step 3: Measure how wrong we are
        loss = criterion(predictions, labels)

        # Step 4: Compute gradients for every weight
        # ("how should each weight change to reduce this loss?")
        loss.backward()

        # Step 5: Update all weights using those gradients
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/5, Average Loss: {avg_loss:.4f}")

That 5-step inner loop is the heartbeat of all deep learning. It repeats hundreds of thousands of times — each batch of 64 images nudges the weights a tiny bit in the right direction. Over time, the network goes from random guessing (~10% accuracy) to high accuracy (~98%).

An epoch means one full pass through all 60,000 training images. We do 5 epochs, so the network sees each image roughly 5 times.


Step 6: Testing — Does It Actually Work?

We test on the 10,000 images the network has never seen during training:

correct = 0
total = 0

model.eval()  # switch to evaluation mode (disables some training-only behaviors)

with torch.no_grad():  # don't track gradients — we're not training
    for images, labels in test_loader:
        predictions = model(images)
        # .argmax(1) picks the index of the highest score in each prediction
        # (index = the predicted digit)
        predicted_digits = predictions.argmax(dim=1)
        total += labels.size(0)
        correct += (predicted_digits == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.1f}%")
# Typically prints: Test Accuracy: 97.5%+

torch.no_grad() is an optimization — since we’re not training, there’s no need for PyTorch to record math operations for gradient computation. Saves memory and time.


Step 7: Using It on a Single Image

# Grab one image from the test set
image, true_label = test_data[0]

# Feed it through the model
model.eval()
with torch.no_grad():
    output = model(image.unsqueeze(0))  # unsqueeze adds a batch dimension: [1,28,28] → [1,1,28,28]
    predicted = output.argmax(dim=1).item()

print(f"Model says: {predicted}, Actual: {true_label}")

The Complete Program

Here’s the whole thing, copy-paste ready:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# --- Data ---
transform = transforms.ToTensor()
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

# --- Model ---
class DigitRecognizer(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layer1 = nn.Linear(784, 128)
        self.layer2 = nn.Linear(128, 64)
        self.layer3 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        x = self.layer3(x)
        return x

model = DigitRecognizer()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# --- Train ---
for epoch in range(5):
    running_loss = 0.0
    for images, labels in train_loader:
        optimizer.zero_grad()
        predictions = model(images)
        loss = criterion(predictions, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}/5, Loss: {running_loss/len(train_loader):.4f}")

# --- Test ---
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        predicted = model(images).argmax(dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"\nTest Accuracy: {100 * correct / total:.1f}%")

Quick Reference: Concepts We Covered

Concept Plain English
Tensor A multi-dimensional array of numbers that can run on GPU and track its math history
Linear layer A grid of learnable weights that transforms N inputs into M outputs
Activation (ReLU) Replaces negatives with zero — lets the network learn non-obvious patterns
Loss function A single number measuring “how wrong was the prediction”
Gradient For each weight: “should it go up or down, and by how much, to reduce loss?”
Backpropagation PyTorch’s automatic system for computing all gradients at once
Optimizer (Adam) Takes gradients and updates weights accordingly
Epoch One full pass through the training data
Batch A group of samples processed together (e.g., 64 images at once)
DataLoader Conveyor belt that feeds batches to your training loop

Where to Go Next

You now understand the full PyTorch workflow. Everything else builds on this:

  • Convolutional Neural Networks (CNNs) — instead of nn.Linear, use nn.Conv2d layers that look at local patches of pixels. Gets you from ~97% to 99%+ on MNIST.
  • GPU acceleration — add .to("mps") (on your Mac) to your model and data tensors.
  • Saving/loading modelstorch.save(model.state_dict(), 'model.pth') and load later.
  • Transfer learning — start with a pre-trained model (like ResNet) and fine-tune it on your data.
  • Transformers — the architecture behind ChatGPT/Claude, built with the same PyTorch primitives.