Skip to content
About

How Models Learn

Every machine learning model — from logistic regression to a trillion-parameter LLM — is produced by the same loop. Internalize it once and nothing later in this guide will feel mysterious.

  1. Data — examples of the task. For supervised learning, each example is an input paired with the correct answer (the label).
  2. A model — a function with adjustable internal numbers called parameters (or weights). Untrained, those numbers are random.
  3. A loss function — a single number measuring how wrong the model’s predictions are on the data. Lower is better.
  4. An optimizer — a procedure that nudges the parameters to reduce the loss.
  5. Compute — the hardware to run this loop millions of times.

“Training” is just: repeatedly adjust the parameters until the loss is low.

# Pseudocode — the heart of essentially all ML training.
model = initialize_random_parameters()
for epoch in range(num_epochs): # one pass over the data
for batch in training_data: # a chunk of examples
predictions = model(batch.inputs)
loss = loss_function(predictions, batch.labels) # how wrong?
gradients = compute_gradients(loss, model.parameters)
model.parameters -= learning_rate * gradients # nudge toward "less wrong"

That last line is the whole game. The gradient points in the direction that increases loss, so we step the opposite way. The learning rate controls step size. This is gradient descent, and it is how virtually every modern model learns.

The loss function encodes what you want. Pick the wrong one and the model optimizes the wrong thing.

TaskCommon lossIntuition
Regression (predict a number)Mean Squared ErrorPenalize squared distance from the truth
Classification (predict a class)Cross-entropyPenalize confident wrong answers heavily
Language modelingCross-entropy over next tokenReward predicting the actual next word

An LLM is trained with a loss as simple as “predict the next token.” Run that over trillions of tokens and grammar, facts, and reasoning patterns emerge as side effects of getting good at the prediction.

A model that memorizes its training data perfectly is worthless — you already had that data. The point is performance on new, unseen inputs. That’s generalization, and to measure it you must split your data:

  • Training set (~70–80%) — the model learns from this.
  • Validation set (~10–15%) — used to tune settings and pick the best model.
  • Test set (~10–15%) — touched once, at the very end, to estimate real-world performance.
Underfitting High error on both — too simple Good fit Low error on both — the target Overfitting Validation loss turns back up training loss validation loss
  • Underfitting — the model is too simple (or undertrained) to capture the pattern. High error on both training and validation data. Fix: a bigger model, better features, or more training.
  • Overfitting — the model is powerful enough to memorize training data, including its noise. Low training error, high validation error. Fix: more data, regularization, simpler model, or early stopping.

The art of ML is landing in the middle: a model expressive enough to learn the signal, disciplined enough not to memorize the noise.

Even though you won’t train foundation models, this framing is everywhere:

  • Fine-tuning is the same loop, started from a pretrained model.
  • Overfitting explains why a model fine-tuned on 50 examples parrots them back instead of generalizing.
  • Evaluation discipline — a held-out test set — is exactly how you should measure a prompt or RAG system, covered in LLM Engineering.

All ML training is one loop: predict, measure loss, compute gradients, nudge parameters. Gradient descent does the nudging. The loss function defines what “good” means — choose it carefully. The real objective is generalization to unseen data, which is why you must hold out a test set and never let it leak. Models fail by underfitting (too simple) or overfitting (memorizing); good ML lives in the gap between them.