Taking a Machine Learning Model From Notebook to Production

August 10, 2024

Taking a Machine Learning Model From Notebook to Production

The most dangerous words in data science are "the model works great in the notebook."

A notebook is a controlled environment. Your data is a static CSV. The preprocessing happens in a specific order you remember. The feature engineering relies on a global variable you defined three cells earlier. The model loaded from a pickle file you saved six months ago and probably can't reproduce.

Production is none of those things. Production means live data, concurrent requests, changing schemas, monitoring, retraining, and an angry stakeholder asking why the prediction changed.

Here's the gap I've learned to close.

The notebook is for exploration, not production code

I treat notebook code as research notes — disposable, non-linear, full of dead ends. Once a model approach is validated, I rewrite the pipeline from scratch as clean, modular Python.

That means:

src/ data/ ingestion.py # fetch + validate raw data preprocessing.py # feature engineering, encoding, scaling model/ train.py # training loop, hyperparameter search evaluate.py # metrics, threshold selection predict.py # inference logic only utils/ config.py # all parameters in one place logging.py

Nothing in predict.py should depend on anything in train.py except the saved artifacts (model, scaler, encoder). If it does, you'll have silent bugs when inference runs in a different environment.

Reproducibility starts with your preprocessing

The single most common production bug I've seen: the preprocessing in training doesn't match the preprocessing in inference.

In training, you fit a StandardScaler on your training data. In inference, you need to apply that same scaler — the one fitted on training data — to new inputs. If you refit it on the incoming request, you'll scale the data differently and the model will produce nonsense predictions silently.

The fix is simple: serialize every preprocessing artifact alongside the model.

import joblib # Save during training joblib.dump(scaler, "artifacts/scaler.pkl") joblib.dump(encoder, "artifacts/encoder.pkl") joblib.dump(model, "artifacts/model.pkl") # Load during inference — same objects, same parameters scaler = joblib.load("artifacts/scaler.pkl") encoder = joblib.load("artifacts/encoder.pkl") model = joblib.load("artifacts/model.pkl")

Version your artifacts alongside your code. A model artifact with no corresponding code version is a future debugging nightmare.

Validate inputs before you predict

Production data is messier than training data. Always.

Before running inference, validate that incoming inputs match what the model was trained on:

  • Are all expected features present?
  • Are numeric features within a reasonable range?
  • Are categorical features within the set seen during training?

A model passed an out-of-distribution input will usually produce a prediction — it won't throw an error. It just produces the wrong answer. Input validation is your first line of defense.

def validate_input(data: dict, expected_features: list, feature_ranges: dict) -> None: missing = set(expected_features) - set(data.keys()) if missing: raise ValueError(f"Missing features: {missing}") for feature, (min_val, max_val) in feature_ranges.items(): if not (min_val <= data[feature] <= max_val): raise ValueError(f"{feature} out of expected range")

Monitor model performance, not just system performance

Most ML monitoring setups track CPU, memory, and latency. That's necessary but not sufficient.

Models degrade silently. A production model trained six months ago might still run without errors but produce systematically wrong predictions because the real-world distribution shifted. This is called data drift, and it's invisible unless you're watching for it.

Monitoring I add to every deployed model:

  • Prediction distribution: Is the distribution of output scores shifting week over week?
  • Feature distribution: Are input values drifting from training distributions?
  • Ground truth lag: For models where labels arrive later (churn prediction, fraud), track actual outcomes vs. predictions once labels are available.

I set thresholds on these metrics and alert when they breach. The alert doesn't mean "redeploy immediately" — it means "go look." Usually there's a benign explanation. Sometimes there isn't.

Build retraining into the design

A model is not a one-time artifact. Treat it like a data product that needs regular updates.

Questions to answer before deploying:

  • How often will this model be retrained?
  • What triggers a retrain — time schedule, performance degradation, or both?
  • What does the approval process look like before a new model version goes live?
  • How do you roll back if a new version performs worse?

For the stock price forecasting model I built, I set up a monthly retraining job on AWS SageMaker that automatically ran backtesting on the new model before promoting it to production. If the new model's RMSE was worse than the current production model's, the job stopped and flagged for manual review.

The boring infrastructure is the actual product

The model itself is often 20% of the work. The other 80% is everything around it: data validation, artifact management, monitoring, retraining pipelines, rollback procedures, documentation.

That 80% is unglamorous and easy to skip when you're excited about model performance. But it's the difference between a proof of concept and something that earns trust — and keeps earning it six months after you built it.

GitHub
LinkedIn
CV