Model Evaluation

Building a model is easy. Knowing whether it’s any good is the hard part — and the part that separates engineers who ship reliable AI from those who ship confident-looking failures. This page is the single most transferable skill in the guide: the same discipline evaluates a fraud model and a RAG pipeline.

Always start with a baseline

Before celebrating “92% accuracy,” ask: what does doing nothing get? If 92% of transactions are legitimate, a model that predicts “legitimate” every time scores 92% — and catches zero fraud.

A baseline is the trivial reference point: predict the most common class, predict the average, or use last week’s value. Your model’s score is only meaningful as a delta over the baseline. Compute the baseline first, every time.

Classification metrics

When classes are imbalanced, accuracy lies. Use the confusion matrix and the metrics built from it.

Accuracy = (TP + TN) / all. Fine only when classes are balanced.
Precision = TP / (TP + FP). Of the items we flagged, how many were right? Matters when false positives are costly.
Recall = TP / (TP + FN). Of the items we should have flagged, how many did we catch? Matters when misses are costly.
F1 score = the harmonic mean of precision and recall — one number when you care about both.

Regression metrics

For models predicting a number:

MAE (mean absolute error) — average miss, in the original units. Easy to explain: “off by $12k on average.”
RMSE (root mean squared error) — like MAE but punishes large misses much harder. Use when big errors are disproportionately bad.
R² — fraction of variance explained, 0 to 1. A quick “how much of the pattern did we capture?”

Cross-validation

A single train/test split can be lucky or unlucky. k-fold cross-validation removes the luck: split the data into k parts, train on k−1 and test on the held-out one, rotate so each part is the test set once, then average.

You get a more reliable estimate and a sense of variance — if scores swing wildly across folds, your model is unstable.

Data leakage: the silent killer

Data leakage is when information unavailable at prediction time sneaks into training. The model looks brilliant in testing, then collapses in production. Classic causes:

Target leakage — a feature that’s a proxy for the answer. Predicting churn using cancellation_date — which only exists because the user churned.
Train/test contamination — scaling, imputing, or selecting features using the whole dataset before splitting, so test statistics leak into training.
Temporal leakage — training on future data to predict the past. Time-series splits must respect the clock; never shuffle randomly.

Offline metrics aren’t the finish line

A model that wins on your test set can still lose in production. Distribution shift — real inputs drifting away from your training data — degrades models silently over time. That’s why the workflow doesn’t end at evaluation: you monitor live performance and re-evaluate continuously. That operational loop is MLOps.

Key takeaways

Compute a baseline before trusting any score. Use precision and recall — not accuracy — for imbalanced problems, and let the cost of each error type pick the trade-off. Cross-validate to remove the luck of a single split. Hunt relentlessly for data leakage: it’s the most common cause of models that look great offline and fail in production.