Skip to content
About

Model Evaluation

Building a model is easy. Knowing whether it’s any good is the hard part — and the part that separates engineers who ship reliable AI from those who ship confident-looking failures. This page is the single most transferable skill in the guide: the same discipline evaluates a fraud model and a RAG pipeline.

Before celebrating “92% accuracy,” ask: what does doing nothing get? If 92% of transactions are legitimate, a model that predicts “legitimate” every time scores 92% — and catches zero fraud.

A baseline is the trivial reference point: predict the most common class, predict the average, or use last week’s value. Your model’s score is only meaningful as a delta over the baseline. Compute the baseline first, every time.

When classes are imbalanced, accuracy lies. Use the confusion matrix and the metrics built from it.

Predicted Positive Negative Actual Positive Negative TP true positive — correct hit FN missed positive FP false alarm TN true negative — correct
  • Accuracy = (TP + TN) / all. Fine only when classes are balanced.
  • Precision = TP / (TP + FP). Of the items we flagged, how many were right? Matters when false positives are costly.
  • Recall = TP / (TP + FN). Of the items we should have flagged, how many did we catch? Matters when misses are costly.
  • F1 score = the harmonic mean of precision and recall — one number when you care about both.

For models predicting a number:

  • MAE (mean absolute error) — average miss, in the original units. Easy to explain: “off by $12k on average.”
  • RMSE (root mean squared error) — like MAE but punishes large misses much harder. Use when big errors are disproportionately bad.
  • — fraction of variance explained, 0 to 1. A quick “how much of the pattern did we capture?”

A single train/test split can be lucky or unlucky. k-fold cross-validation removes the luck: split the data into k parts, train on k−1 and test on the held-out one, rotate so each part is the test set once, then average.

Fold 1 test train train train train Fold 2 train test train train train Fold 3 train train test train train Fold 4 train train train test train Fold 5 train train train train test Each fold tests on a different slice — average the five scores for a luck-free estimate.

You get a more reliable estimate and a sense of variance — if scores swing wildly across folds, your model is unstable.

Data leakage is when information unavailable at prediction time sneaks into training. The model looks brilliant in testing, then collapses in production. Classic causes:

  • Target leakage — a feature that’s a proxy for the answer. Predicting churn using cancellation_date — which only exists because the user churned.
  • Train/test contamination — scaling, imputing, or selecting features using the whole dataset before splitting, so test statistics leak into training.
  • Temporal leakage — training on future data to predict the past. Time-series splits must respect the clock; never shuffle randomly.

A model that wins on your test set can still lose in production. Distribution shift — real inputs drifting away from your training data — degrades models silently over time. That’s why the workflow doesn’t end at evaluation: you monitor live performance and re-evaluate continuously. That operational loop is MLOps.

Compute a baseline before trusting any score. Use precision and recall — not accuracy — for imbalanced problems, and let the cost of each error type pick the trade-off. Cross-validate to remove the luck of a single split. Hunt relentlessly for data leakage: it’s the most common cause of models that look great offline and fail in production.