How Models Learn

Every machine learning model — from logistic regression to a trillion-parameter LLM — is produced by the same loop. Internalize it once and nothing later in this guide will feel mysterious.

The five ingredients

Data — examples of the task. For supervised learning, each example is an input paired with the correct answer (the label).
A model — a function with adjustable internal numbers called parameters (or weights). Untrained, those numbers are random.
A loss function — a single number measuring how wrong the model’s predictions are on the data. Lower is better.
An optimizer — a procedure that nudges the parameters to reduce the loss.
Compute — the hardware to run this loop millions of times.

“Training” is just: repeatedly adjust the parameters until the loss is low.

The training loop

# Pseudocode — the heart of essentially all ML training.
model = initialize_random_parameters()

for epoch in range(num_epochs):          # one pass over the data
    for batch in training_data:          # a chunk of examples
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.labels)  # how wrong?
        gradients = compute_gradients(loss, model.parameters)
        model.parameters -= learning_rate * gradients     # nudge toward "less wrong"

That last line is the whole game. The gradient points in the direction that increases loss, so we step the opposite way. The learning rate controls step size. This is gradient descent, and it is how virtually every modern model learns.

Loss functions: defining “wrong”

The loss function encodes what you want. Pick the wrong one and the model optimizes the wrong thing.

Task	Common loss	Intuition
Regression (predict a number)	Mean Squared Error	Penalize squared distance from the truth
Classification (predict a class)	Cross-entropy	Penalize confident wrong answers heavily
Language modeling	Cross-entropy over next token	Reward predicting the actual next word

An LLM is trained with a loss as simple as “predict the next token.” Run that over trillions of tokens and grammar, facts, and reasoning patterns emerge as side effects of getting good at the prediction.

The real goal: generalization

A model that memorizes its training data perfectly is worthless — you already had that data. The point is performance on new, unseen inputs. That’s generalization, and to measure it you must split your data:

Training set (~70–80%) — the model learns from this.
Validation set (~10–15%) — used to tune settings and pick the best model.
Test set (~10–15%) — touched once, at the very end, to estimate real-world performance.

Why models fail: the two failure modes

Underfitting — the model is too simple (or undertrained) to capture the pattern. High error on both training and validation data. Fix: a bigger model, better features, or more training.
Overfitting — the model is powerful enough to memorize training data, including its noise. Low training error, high validation error. Fix: more data, regularization, simpler model, or early stopping.

The art of ML is landing in the middle: a model expressive enough to learn the signal, disciplined enough not to memorize the noise.

Where this shows up in LLM work

Even though you won’t train foundation models, this framing is everywhere:

Fine-tuning is the same loop, started from a pretrained model.
Overfitting explains why a model fine-tuned on 50 examples parrots them back instead of generalizing.
Evaluation discipline — a held-out test set — is exactly how you should measure a prompt or RAG system, covered in LLM Engineering.

Key takeaways

All ML training is one loop: predict, measure loss, compute gradients, nudge parameters. Gradient descent does the nudging. The loss function defines what “good” means — choose it carefully. The real objective is generalization to unseen data, which is why you must hold out a test set and never let it leak. Models fail by underfitting (too simple) or overfitting (memorizing); good ML lives in the gap between them.