What Every Product Builder Should Know About Data Preprocessing

I’m a software engineer pursuing an MSc in Data Analytics. For years, my relationship with data quality was through the backend lens: designing normalized database schemas, validating input at API boundaries, enforcing consistent formats before anything hits a table. I understood why data needs to be consistent. What I didn’t have was a structured pipeline for getting it there at scale.

The data preprocessing module in my course changed that. It introduced a formal framework (imputation strategies, outlier detection, feature reduction, dimensionality techniques) that I hadn’t encountered from the engineering side. Here’s what clicked for me as someone who builds products.

Your Data Is Messier Than You Think

The course uses an e-commerce company called ShopWave as a running example. They’re pulling data from website clickstreams, mobile app events, CRM records, payment transactions, and customer support logs. Each source has its own format, its own quirks, and its own failure modes.

This mirrors what happens in every product codebase. You start with one clean data source, maybe your own database. Then you integrate analytics. Then a payment provider. Then a third-party CRM. Each integration adds a layer of inconsistency that compounds over time.

A few things that go wrong in practice.

Format inconsistencies are common. One system stores dates as 2026-01-28, another as 01/28/2026, another as a Unix timestamp. Your user’s name is “John Smith” in one table and “john smith” in another. Phone numbers come with country codes, without them, with dashes, with spaces.

Missing data appears in many forms. Optional fields are sometimes populated, sometimes null, sometimes an empty string. A mobile app upgrade changes which events get tracked, leaving gaps in your dataset that you don’t notice for weeks.

Duplicate records are harder to spot. A user signs up with their email, then later signs in with Google OAuth. Now you have two records for the same person. Your analytics counts them as two users. Your recommendation engine builds two separate profiles. Neither profile has enough data to be useful.

These aren’t edge cases. This is the normal state of production data.

The Preprocessing Pipeline

The course lays out data preprocessing as a pipeline with distinct stages. This mental model is useful because it gives you a vocabulary for what’s actually happening when you “clean data.” It’s not one step. It’s several, and each one matters.

graph LR
    A[Collection] --> B[Cleaning]
    B --> C[Integration]
    C --> D[Transformation]
    D --> E[Reduction]
    E --> F[Normalization]

Collection is gathering raw data from your sources. This is where you define what you’re capturing and how.

Cleaning is fixing errors, handling missing values, removing duplicates, and dealing with outliers.

Integration is combining data from multiple sources into a coherent dataset. This is where schema conflicts and entity resolution happen.

Transformation is reshaping data into formats useful for analysis or modeling. Encoding categorical variables, aggregating records, deriving new features.

Reduction is trimming your dataset to what’s actually needed. Removing redundant features, sampling large datasets, compressing dimensions.

Normalization is scaling values so they’re comparable. A user’s age (20–80) and their purchase count (0–10,000) need to be on similar scales for many algorithms to work.

As a product builder, you don’t need to run all six stages for every feature. But knowing they exist helps you ask the right questions. When someone says “the data is ready,” you can ask: ready how? Cleaned? Integrated? Normalized?

Data Cleaning Is the 80% You Didn’t Budget For

There’s a well-known statistic that data scientists spend 80% of their time on data cleaning. I used to think that was an exaggeration. It’s not.

The course walks through a practice exercise with order data that has realistic problems. Here’s a simplified version of what that looks like.

Order ID	Customer	Date	Amount	Category	Status
1001	John Smith	2026-01-15	299.99	Electronics	Completed
1002	jane doe	01/16/2026	-50.00	electronics	Pending
1001	John Smith	2026-01-15	299.99	Electronics	Completed
1003	Bob Wilson	2026-01-17		Home & Garden	Completed
1004	JANE DOE	2026-01-18	15000.00	Electronics	Completed

Five rows, five problems. A duplicate (rows 1 and 3). Inconsistent name casing (“jane doe” vs “JANE DOE”). Mixed date formats. A negative amount that’s probably a refund mixed in with orders. A missing amount. A suspiciously large value that might be an outlier or a data entry error.

Now multiply this by millions of rows and dozens of columns. That’s what your production data actually looks like.

Missing value handling has real product implications. The course covers several strategies: deletion (remove the row), imputation (fill with mean, median, or a predicted value), and flagging (keep the gap but mark it). Each has product implications.

If you delete rows with missing data, you might lose your most valuable users, the ones who signed up early before you tracked everything. If you impute with averages, you flatten out the interesting variation in your data. If you flag without handling, downstream features break in unexpected ways.

Duplicates erode user trust. When a user sees the same product recommended twice, or their order history shows duplicate entries, they lose confidence in your product. Deduplication sounds simple until you’re matching records across systems with different identifiers and no shared key.

Outliers need context. That $15,000 order might be a data entry error, or it might be your best customer. Blindly removing outliers can mean removing your most important data points. The course recommends using domain knowledge to set reasonable bounds, not just statistical thresholds.

Data Integration Is a Product Problem

This is where the course material felt most familiar. As a backend engineer, I’ve dealt with integrating data across services. But the course frames it as a formal discipline with distinct patterns, not just ad hoc API stitching.

The course covers three approaches.

ETL (Extract, Transform, Load) pulls data from sources, transforms it into a consistent format, and loads it into a central store. This is the classic data warehouse approach. It works well when you need consistent historical data and can tolerate some latency.

Data federation leaves data in place and queries across sources in real-time. No central store, but every query pays the cost of cross-source joins. This is useful when data is too large to move or changes too frequently.

Data propagation copies data between systems on a schedule or in response to events. Think database replication or event-driven architectures.

In product terms, the integration approach you choose determines your feature’s latency, consistency, and cost. A real-time recommendation engine can’t wait for a nightly ETL job. An analytics dashboard doesn’t need sub-second data federation. Matching the integration pattern to the product requirement saves you from over-engineering or under-delivering.

Schema conflicts are the real challenge. When your payment provider calls it transaction_date and your analytics platform calls it event_timestamp and your CRM calls it created_at, someone has to reconcile those. The course calls this “schema integration,” and it’s the kind of work that’s invisible until it’s wrong.

Entity resolution is equally tricky. Is “J. Smith” at john@example.com the same person as “John Smith” at johnsmith@example.com? Your product’s ability to build a unified user profile depends on getting this right. Bad entity resolution means fragmented user experiences.

What This Means for Product Builders

After working through this material, I’ve distilled a few practical takeaways.

Budget for data prep from the start. If a feature depends on data from multiple sources, assume at least half the work is getting that data into a usable state. This isn’t overhead. It’s the foundation the feature stands on.

Validate at the boundary. The cheapest place to catch data quality issues is when data enters your system. Pick canonical formats for dates, names, currencies, and identifiers. Enforce them in your data models. Schema validation, type checking, range constraints. Apply them on the way in. Every bad record you let through costs more to fix later.

Monitor data quality in production. Data quality isn’t a one-time cleanup. Sources change, schemas drift, new edge cases appear. Treat data quality like uptime. Monitor it, alert on it, and have a process for fixing issues.

Know your missing data strategy before you need it. When a critical field is null, what happens? Does the feature degrade gracefully? Does it show an error? Does it silently produce garbage? Decide this during design, not during an incident review.

Document your transformations. When you derive a metric or combine fields, write down what you did and why. Six months from now, someone (probably you) will need to understand why user_score is calculated the way it is.

Wrapping Up

The formal framework changed how I see the problem. Before, I thought of data quality as a series of one-off fixes. Now I see it as a pipeline, each stage with its own failure modes and tradeoffs. That mental model is the real takeaway.

The biggest shift is in how I scope work. I used to estimate features based on the application logic: the API endpoints, the UI components, the business rules. Now I start with the data. What state is it in? What state does it need to be in? How do I get it there reliably?

If you’re building products that depend on data, and most products do now, this stuff isn’t optional. It’s not the exciting part. But it’s the difference between a feature that works in a demo and one that works in production.