My Data Science Workflow: From Messy Data to Real Decisions

The version of data science in tutorials is clean. You download a perfectly formatted CSV, run a model, get a great accuracy score, and call it done. The version in real life is different. The data is in three systems with conflicting schemas, the business question changes twice during the project, and "accuracy" isn't how success gets measured.

This is the workflow I've developed through real projects — at Genesis Motor America, at Traffic Management Inc., at CSULB, and across seven hackathons. It's not the only way to do data science. It's the way I've found that consistently produces work that actually gets used.

Phase 1: Frame the problem before touching the data

I spend the first chunk of every project on business understanding — not data exploration.

This means interviewing stakeholders to answer:

What decision does this project exist to improve?
How is that decision currently being made?
What would "good enough" look like as an output?
What happens if we get it wrong?

The last question is underrated. A model that flags healthy customers as churn risks (false positives) wastes outreach budget. A model that misses actual churners (false negatives) loses revenue. Which is worse depends entirely on the business context — and it determines which metric I optimize for.

Writing a one-page problem statement before starting any analysis has saved me weeks of rework.

Phase 2: Data auditing — trust nothing

When I get access to a new dataset, I don't start modeling. I start interrogating.

import pandas as pd

df = pd.read_csv("data.csv")

# Basic shape
print(df.shape)
print(df.dtypes)

# Missing values
print(df.isnull().sum() / len(df) * 100)

# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Distribution of key fields
print(df.describe())

# Categorical cardinality
for col in df.select_dtypes("object").columns:
    print(f"{col}: {df[col].nunique()} unique values")

I'm looking for:

Missing data patterns: Is missingness random, or correlated with something meaningful?
Unexpected distributions: Bimodal distributions, suspicious cliffs at round numbers, date columns with gaps.
Duplicates and leakage: Are there rows that shouldn't exist? Any features that encode the target?
Encoding inconsistencies: "CA", "California", "ca" all meaning the same thing.

I document every anomaly I find. Some will be irrelevant. Some will turn out to be the most important thing in the whole project.

Phase 3: Exploratory analysis with a hypothesis list

Before running a single model, I build a set of hypotheses about what might predict the target. Then I test them visually and statistically.

For a customer churn project:

H1: Longer-tenure customers churn less → Plot churn rate by tenure bucket
H2: Customers on month-to-month contracts churn more → Cross-tab contract type vs. churn
H3: High support ticket volume predicts churn → Correlation analysis

Testing hypotheses before modeling gives me two things: a gut-check on whether the data contains signal, and a starting point for feature engineering that's grounded in domain logic rather than pure statistical fishing.

Phase 4: Feature engineering — where the real value is

Raw features rarely go directly into a model in their best form. Feature engineering is where you encode domain knowledge.

A timestamp becomes: day of week, hour, days since first purchase, days since last purchase.

A raw dollar amount becomes: log-transformed value (if skewed), percentile within segment, ratio to segment average.

A zip code becomes: median household income (via census join), urban/rural classification, distance to nearest store.

I keep a running list of feature ideas, rate each by expected signal strength and implementation cost, and work through the list. The best features I've built have usually come from asking "what would a human analyst use to make this judgment?" and then encoding that reasoning as a variable.

Phase 5: Model selection — start simple

The first model I try is almost always logistic regression or a decision tree. Not because it'll win, but because it gives me a baseline, it's fast to iterate on, and its outputs are interpretable enough to sanity-check against domain knowledge.

If the simple model is clearly underperforming (and it often is), I move to gradient boosting — XGBoost or LightGBM. These are my workhorses for tabular data. They handle missing values, require minimal preprocessing, and consistently outperform simpler models.

I use deep learning only when I have a genuine reason: unstructured data (text, images, time series), very large datasets, or a specific architecture requirement. For most business tabular data problems, a well-tuned XGBoost beats a neural network in both performance and explainability.

Phase 6: Evaluation against the business metric

Model accuracy is rarely the metric that matters. I evaluate models against business outcomes wherever possible:

Churn model: Revenue saved per dollar of outreach spend
Lead scoring model: Conversion rate lift vs. baseline in the top decile
Demand forecasting: Reduction in inventory carrying cost

This requires some assumptions and approximations. Make them explicit. A model that's 85% accurate but generates $500K in recovered revenue is better than a model that's 91% accurate and has no clear business impact.

Phase 7: Communicate clearly — then communicate again

The best model in the world fails if nobody understands the output. I structure every project deliverable around three audiences:

Executive summary (1 page): What we found, what we recommend, what it's worth. No methodology, no caveats, no acronyms.

Business narrative (5-10 pages): The key findings with supporting visualizations, confidence levels stated plainly in English, and clear recommendations with estimated impact.

Technical appendix: Methodology, feature importance, model performance across segments, data limitations.

Most stakeholders read the first two. The technical appendix is for reproducibility and for the next analyst who inherits the work.

The workflow is the product

Most data science projects fail not because the model was wrong, but because the process broke down somewhere between framing and delivery. The workflow is what keeps a project on track when the data is messier than expected, the requirements shift, or the timeline compresses.

Build good habits around the process, and the technical work takes care of itself.