Designing Data Pipelines That Actually Scale

The first pipeline I built was a Python script on a cron job — pulled data from a source system, ran some transforms, dumped results to a database. It worked perfectly for about three months.

Then the source schema changed. Then volume tripled. Then two other teams started depending on the output. Then it silently started producing wrong numbers with no error, no alert, nothing.

That's the lifecycle of a naive data pipeline. Here's how I think about building ones that don't follow that path.

Start with the contract, not the code

Before writing a single transformation, define the contract:

What is the source? Schema, update frequency, owner, SLA.
What is the output? Column definitions, expected row counts, update cadence.
Who depends on it? Downstream tables, dashboards, ML models, reports.

Writing this forces clarity on what "correct" even means. Without it, you have no baseline to detect drift against. I keep a simple markdown doc next to every pipeline. When the source changes, the doc updates first — before any code does.

Idempotency is non-negotiable

Every pipeline run should produce the same result given the same inputs. If running the pipeline twice on the same day produces different output, you have a problem that will eventually corrupt your data — silently.

This means:

Avoid INSERT without deduplication. Use MERGE / UPSERT semantics or truncate-and-reload.
Parameterize by date, not "now." WHERE event_date = CURRENT_DATE is dangerous in a pipeline. WHERE event_date = {{ execution_date }} is not.
Test reruns explicitly. Before deploying, run the pipeline twice on the same window and diff the outputs.

Modular beats monolithic every time

The worst pipeline I inherited was 2,000 lines of SQL in a single stored procedure — extraction, transformation, business logic, and loading all in one pass. Nobody could touch it without fear.

I now structure pipelines in layers:

raw/           — Exact copy of source, zero transformation
staging/       — Type casting, deduplication, basic cleaning
intermediate/  — Business logic, joins, derived fields
mart/          — Final output, aggregated, query-optimized

Each layer is independently testable. Breaking changes in the source only affect raw. Business logic changes only touch intermediate. This is the dbt philosophy, and it holds up.

Data quality checks belong inside the pipeline

Quality checks should run within the pipeline, not as an afterthought downstream. Every mart table I build has at minimum:

Row count check: Is today's output within ±20% of yesterday's?
Null check: Are primary key fields fully populated?
Freshness check: Was the source actually updated before this pipeline ran?
Range check: Are numeric KPIs within business-plausible bounds?

If any check fails, the pipeline halts and alerts. A pipeline that fails loudly is infinitely better than one that succeeds quietly with wrong data.

At Traffic Management Inc., I built a great_expectations check suite for our Power BI data feeds. We caught a broken source integration two hours after it happened — instead of discovering it during a stakeholder review three days later.

Handle late-arriving data deliberately

In event-based systems, data for event_date = yesterday keeps arriving hours or days after midnight. If your pipeline closes the window at midnight, you'll systematically undercount late events.

Options I've used:

Reprocessing windows: Rerun the last 3 days on every execution. Slightly expensive but correct.
Watermarking: Track the latest event timestamp seen and only close a window once no new events arrive for N hours.
Late-data tables: Capture late arrivals separately and reconcile on a slower cadence.

Ignoring the problem means your "final" numbers keep changing after you've reported them — which erodes stakeholder trust fast.

Monitor the pipeline, not just the output

Most teams monitor dashboards. Fewer monitor the pipelines feeding those dashboards. I set up monitoring at three levels:

Execution: Did it run? How long did it take? Did it fail?
Volume: Are row counts consistent with expectations?
Drift: Is the KPI moving in a way that's explainable by real events?

Alerting on execution time is underrated. A pipeline that normally runs in 4 minutes and suddenly takes 40 minutes didn't fail — but something broke. Catching that early saves a lot of downstream fire-fighting.

The engineering is in the boring parts

The transformation logic in most pipelines isn't the hard part. The hard part is the scaffolding: contracts, idempotency, quality checks, monitoring, documentation that lets someone else debug it at 9pm when you're unavailable.

Pipelines that scale aren't clever. They're disciplined.