Spreadsheets as a Thinking Tool: Data Modeling, Analysis, and Decision-Making

Section 6 of 14

How to Clean and Transform Data in Spreadsheets

We spent the last section designing data entry systems to prevent problems before they start — because the best cleaning is the cleaning you never have to do. But in the real world, clean data entry isn't always under your control. The dataset you receive arrives from a legacy system, an ERP export, a PDF conversion, or a colleague's spreadsheet that predates your standardized practices. It comes with trailing spaces, hidden characters, dates formatted as text, and values that look consistent until you try to sum them. This is completely normal.

What separates analysts who produce reliable results from those who don't usually isn't analytical sophistication — it's the discipline and toolkit to clean data before any analysis begins. The decisions made in this phase determine whether your SUMIFS returns the right answer or silently misses half the records because "New York" and "New York " (with a trailing space) are different strings as far as Excel is concerned.

This section is your cleaning toolkit. We'll cover formulas for surgical text repairs, Power Query for when you need an industrial-grade pipeline, and a practical checklist you can run on any unfamiliar dataset. You'll develop a habit of distrust — never trusting data you haven't checked — which is genuinely one of the most valuable analytical skills there is.

The Golden Rule: Keep Source Data Untouched

Before we dig into specific techniques, one principle: leave the source data untouched in one area, and write your cleaning formulas in a separate column or sheet. Never overwrite source data with cleaning formulas — you may need to go back. This sounds obvious until you're three hours deep in analysis and realize your cleaning formula had a logic error, and the only way to fix it is to start over from the mangled original.

TRIM: The Single Most Used Cleaning Formula

Extra spaces are the silent killer of lookups, aggregations, and matches. They're invisible, they're everywhere in exported data, and they cause VLOOKUP to return #N/A while you stare at two cells that look absolutely identical.

TRIM removes leading spaces, trailing spaces, and reduces internal multiple spaces to a single space:

=TRIM(A2)

If A2 contains " New York ", TRIM returns "New York". That's it. That's the whole formula. Use it on every text column in any dataset you didn't create yourself.

CLEAN: Removing the Invisible

CLEAN strips non-printing characters — the ASCII codes 0 through 31 that can embed themselves in data exported from legacy systems, web scrapers, or PDFs. These characters are truly invisible; you can't see them, but they'll break your formulas and lookups in ways that feel supernatural until you understand what's happening.

=CLEAN(A2)

Combine TRIM and CLEAN together for a powerful one-two punch:

=TRIM(CLEAN(A2))

This combination handles the vast majority of text-is-not-matching problems in exported data. It's worth making this your default move on any unfamiliar text column.

PROPER, UPPER, LOWER: Standardizing Case

Inconsistent capitalization is a data integrity problem, not just an aesthetic one. "new york", "New York", and "NEW YORK" are three different strings. If you're grouping by city in a pivot table or using them in a lookup, you'll get three separate buckets when you wanted one.

=PROPER(A2) — Capitalizes the first letter of each word: "new york" → "New York"
=UPPER(A2) — ALL CAPS: "new york" → "NEW YORK"
=LOWER(A2) — all lowercase: "New York" → "new york"

PROPER is the most commonly useful for names and places, though it has one quirk: it capitalizes after any non-letter character, so "O'Brien" becomes "O'Brien" correctly, but some hyphenated terms behave unexpectedly. Use LOWER when feeding data into lookup formulas if you want case-insensitive matching without SEARCH (SEARCH is already case-insensitive, but EXACT and direct = comparisons are not).

Tip: Chain these together with TRIM for the nuclear option: =LOWER(TRIM(CLEAN(A2))). It's overkill for most tasks, but it's your move when a lookup is mysteriously failing and you've exhausted other explanations.

Extracting Substrings: LEFT, RIGHT, MID, FIND, and SEARCH

Sometimes the data you need is inside a cell, jumbled up with data you don't need. A product code like "SKU-7842-A" might need to be split into its component parts. An email address contains a domain. A full name needs to be broken into first and last. These functions let you surgically extract what you want.

LEFT, RIGHT, MID are the three basic extraction functions:

=LEFT(text, num_chars)       — extract from the left
=RIGHT(text, num_chars)      — extract from the right
=MID(text, start_num, num_chars)  — extract from the middle

The problem with a fixed character count is that it only works if every cell has the exact same structure. "Smith, John" and "Rodriguez, Maria" have different lengths. This is where FIND and SEARCH become essential — they let you find the position of something, and then extract relative to that position.

Both find the position of one text string within another:

=FIND("@", A2) — case-sensitive, returns the position of @
=SEARCH("@", A2) — case-insensitive, supports wildcards

To extract everything to the left of an @ sign (the username in an email address):

=LEFT(A2, FIND("@", A2) - 1)

The -1 drops the @ itself from the result. To extract the domain (everything after the @):

=MID(A2, FIND("@", A2) + 1, LEN(A2))

LEN returns the length of the entire string, so asking for more characters than exist just returns everything to the end — a useful shortcut that saves you from having to calculate the exact length.

graph LR
    A["'[email protected]'"] --> B["FIND('@', A2) = 11"]
    B --> C["LEFT: extract first 10 chars → 'john.smith'"]
    B --> D["MID: extract from char 12 → 'company.com'"]

Tip: SEARCH is usually preferable to FIND for cleaning work because it's case-insensitive and supports wildcards. Reserve FIND for when you specifically need case-sensitive matching.

SUBSTITUTE and REPLACE: Fixing Systematic Problems

When you need to swap out text patterns across a whole column, SUBSTITUTE is your scalpel.

=SUBSTITUTE(text, old_text, new_text, [instance_num])

Say every phone number in your dataset was exported as "1-800-555-1234" and you need them as "18005551234" (digits only). Chain SUBSTITUTE calls:

=SUBSTITUTE(SUBSTITUTE(A2, "-", ""), " ", "")

The first SUBSTITUTE removes hyphens, the second removes spaces. You can nest as many as you need.

The optional fourth argument lets you replace only the nth occurrence. =SUBSTITUTE(A2, ".", "", 2) replaces only the second period — useful for some date format cleanup where you want to preserve the first period but not the second.

REPLACE is different: it replaces characters at a specific position rather than matching a specific string:

=REPLACE(old_text, start_num, num_chars, new_text)

This is less commonly used for general cleaning but is handy when data has a consistent structural problem at a known position — like when the first two characters of a product code are always wrong, or when a legacy system always prepends a code you need to strip.

Type Coercion: The Text-That-Looks-Like-Numbers Problem

This one causes more confusion and silent analysis errors than almost anything else in spreadsheet cleaning. Numbers stored as text look like numbers, they sort incorrectly as text (so "10" comes before "9"), and they're silently excluded from SUM and AVERAGE. Excel sometimes shows a little green triangle warning in the corner; often it doesn't. So your SUMIFS appears to work, but it's actually excluding half the matching rows because they're stored as text.

How to spot the problem: select a column of apparent numbers and check the status bar at the bottom right. If it shows "Count: 10" but shows no "Sum" or "Average," Excel thinks they're text, not numbers.

Three ways to fix it:

1. The VALUE function:

=VALUE(A2)

Converts text that looks like a number into an actual number. Clean and explicit. If the text can't be interpreted as a number, it returns an error, which is actually helpful — it flags bad data.

2. The double-negative trick:

=--A2

The -- (double unary minus) coerces the value to a number. It's a slightly cryptic but widely used shorthand among spreadsheet professionals. Same result as VALUE, one less keystroke.

3. Paste Special Multiply: For a quick in-place fix without formulas: type 1 in an empty cell, copy it, select your text-numbers column, go to Paste Special (Ctrl+Alt+V or Cmd+Shift+V), choose Multiply, OK. Excel multiplies each cell by 1, which forces type conversion. This overwrites the source values, so only use it when you're okay modifying the original — which is rare, given our golden rule about preserving source data.

Date Parsing: Taming Text Dates

Dates are where spreadsheet cleaning gets genuinely painful. The root problem is that dates are stored internally as numbers (the number of days since January 1, 1900), and when they come in from external systems, they often arrive as text strings in any number of formats: "03/15/2024", "March 15, 2024", "15-Mar-24", "20240315". There's no universal standard, so each source invents its own.

Following ISO 8601 date format (YYYY-MM-DD) is the gold standard for storage precisely because it avoids this ambiguity, but you rarely have control over how source data arrives.

For dates in recognizable formats, DATEVALUE converts a text date string to a real date serial number:

=DATEVALUE("March 15, 2024")   → 45366 (a date serial number)

Format that result as a date and it displays correctly. The catch: DATEVALUE is finicky about formats. "15/03/2024" in US locale Excel will confuse it (Excel expects MM/DD/YYYY by default). This is where things get locale-dependent and frustrating.

For non-standard formats, use DATE combined with string extraction functions:

For a date stored as "20240315" (YYYYMMDD format, common in enterprise systems):

=DATE(LEFT(A2,4), MID(A2,5,2), RIGHT(A2,2))

This extracts year, month, and day as separate numbers and assembles a real date value. Works reliably regardless of your locale settings.

Warning: Excel's automatic date conversion is famously overzealous. It will silently convert text like "2-3" to "February 3" or gene names like "SEPT2" to dates. This has caused documented errors in published scientific research. Always verify your date column after import, especially if your data contains anything that might look like a date to an overeager parser.

Flash Fill: Excel's Pattern-Recognition Shortcut

Flash Fill was introduced in Excel 2013, but it is available in Excel 2013 and later versions. However, Flash Fill is specifically an Excel 2013+ feature (introduced in Excel 2013). is one of those features that feels like magic until you understand its limits, at which point you appreciate it for what it is: a useful shortcut for simple, consistent transformations.

Type the desired result for the first row (or the first few rows) in an adjacent column, then press Ctrl+E. Excel analyzes the pattern and fills the remaining rows. Watch it work correctly on the preview, then press Enter.

Classic use cases:

Splitting "John Smith" into separate first/last name columns
Extracting area codes from phone numbers
Reformatting dates from "2024-03-15" to "March 15, 2024"
Combining first and last name columns into a single field
Extracting city from an address field

The catches matter: Flash Fill doesn't update when your source data changes (it's a one-time operation, not a formula). It can guess wrong on complex or inconsistent patterns. And it leaves no audit trail — if someone opens the workbook later, they won't know you used Flash Fill or how. For truly messy data with inconsistent patterns, Flash Fill will stumble and require manual correction. For clean, consistent transformations on a static dataset, though, it's a genuine time-saver.

When Formulas Aren't Enough: The Case for Power Query

If you're doing the same formula-based cleaning process every Monday morning when a new data export arrives — paste the data, copy your cleaning formulas down, convert-to-values, check for anomalies — you're doing it wrong. That workflow is manual, error-prone, and doesn't scale to multiple files or frequent refreshes.

This is exactly the problem Power Query was built to solve.

Power Query (available natively in Excel 2016+ and all versions of Excel 365, as well as Power BI) records every transformation you apply to a dataset as a series of Applied Steps. When new data arrives, you refresh the query and all those steps re-execute automatically. The cleaning logic is captured once, not repeated manually every time.

According to Microsoft's own training materials, Power Query lets you "connect to many different data sources and transform the data into the shape you want" — but what the documentation undersells is the repeatability. That's the real killer feature. You build the pipeline once, then click refresh forever.

graph TD
    A[Source Data\nCSV / Excel / Database / Web] --> B[Power Query Editor]
    B --> C[Applied Steps\nRecorded transformations]
    C --> D[Cleaned Output Table\nin your workbook]
    E[New Data Arrives] --> A
    D --> F[Analysis / Pivot / Dashboard]
    E --> G[Click Refresh]
    G --> D

The Power Query Interface: A Quick Orientation

Get there via Data → Get Data → From File → From Workbook (or From CSV, From Web, From Database, etc.), select your source, and choose "Transform Data" rather than "Load." This opens the Power Query Editor — an interface that looks a bit like a specialized spreadsheet.

The five components you need to know:

Ribbon — where you access transformations (Home, Transform, Add Column tabs)
Queries pane (left sidebar) — shows all queries in the workbook
Data preview (center) — a sample of your data as transformed so far
Query Settings pane (right) — lists your Applied Steps in order
Formula bar — shows the M language code behind each step (you can ignore this initially)

The Applied Steps pane is the heart of Power Query. Each transformation you apply — filter a row, change a column type, remove duplicates — becomes a named step. You can click back to any previous step to see what the data looked like at that point. You can rename steps for clarity, reorder them, or delete them. It's a visual, inspectable transformation pipeline, which means you can understand what happened and undo mistakes without blowing everything up.

Common Power Query Transformations

Change Type — You click a column header, click the data type icon (it looks like "ABC" or "123"), and select the right type from the menu. Power Query will parse text dates, coerce text numbers, and flag failures. This alone solves the text-that-looks-like-numbers problem without writing a single formula.

Split Column — Select a column, choose Split Column → By Delimiter. Power Query asks what delimiter and whether to split at the first, last, or every occurrence. The equivalent of a formula nest of LEFT/FIND/RIGHT, but visual and automatic.

Remove Duplicates — Right-click a column header → Remove Duplicates. Or select multiple columns for composite key deduplication. Done. Power Query shows you how many duplicates it removed.

Fill Down — When data has merged cells or blanks that represent "same as above" (a common export problem from systems designed for humans to read, not machines), Fill Down propagates the last non-null value downward through blanks. This one transformation is worth learning Power Query for if you regularly get data with this structure.

Pivot and Unpivot — Unpivot is particularly powerful: it converts wide-format data (many columns representing the same variable) into long/tidy format (one column per variable, more rows). This is the transformation that enables proper analysis of data someone structured incorrectly at the source. Select the columns you want to "stack," right-click → Unpivot Columns.

Group By — Similar to a SUMIFS or Pivot Table, but built into the cleaning pipeline. Useful when you need to aggregate data as part of the transformation process, not just for reporting afterward.

Merge Queries — The Power Query equivalent of VLOOKUP/INDEX-MATCH. Bring in a lookup table and join it to your main dataset without formulas.

Tip: Don't try to learn every Power Query transformation at once. Start with these six: Change Type, Remove Duplicates, Split Column, Fill Down, Rename Column, and Filter Rows. Those will handle 80% of real cleaning scenarios you'll encounter.

The Refresh Workflow: Why This Changes Everything

Here's the scenario: your company runs a monthly sales report from the ERP. Every month you get a new CSV file, paste it in, run your cleaning steps (copy formulas down, convert to values, check for anomalies), then rebuild your analysis on top of it. Sound familiar?

With Power Query, the workflow becomes:

Build your cleaning pipeline once against the first export
When next month's file arrives, either replace the source file at the same path, or change the source path in Power Query settings
Click Data → Refresh All
Every applied step reruns. Your cleaned output table updates. Your analysis updates. Done.

This isn't just a time-saver — it's a reliability upgrade. Humans make mistakes when they repeat manual steps. They copy a formula down four times, then on the fifth time they forget. They apply one transformation but forget another. Power Query doesn't. The transformation logic is documented, auditable, and deterministic. It either works or it doesn't, every time.

Google Sheets Data Cleaning: What's Available (and What Isn't)

Google Sheets supports all the text-cleaning formulas covered above: TRIM, CLEAN, PROPER, UPPER, LOWER, SUBSTITUTE, REPLACE, VALUE, LEFT, RIGHT, MID, FIND, SEARCH, DATEVALUE. The syntax is identical to Excel, so anything formula-based translates directly.

Flash Fill's equivalent in Google Sheets is Smart Fill, which appears automatically when Sheets detects a pattern. It's less reliable than Excel's Flash Fill but serviceable for straightforward transformations.

What Google Sheets notably lacks is Power Query. There's no native equivalent of a recorded, refreshable transformation pipeline. The closest alternatives are:

Google Sheets macros — record UI actions, but fragile and not reliable for production use
Apps Script — JavaScript-based automation, powerful but requires coding knowledge
Coefficient, Coupler.io, or similar add-ons — third-party tools that add data connector and refresh functionality, but they vary in reliability

If you're doing serious, repeatable data cleaning work, this is a genuine limitation of Google Sheets. For one-time cleaning with formulas, Sheets is perfectly capable. For automated pipelines that run monthly or weekly, Excel with Power Query (or dedicated ETL tools) is the professional-grade choice.

Building Your Cleaning Checklist

The value of a cleaning checklist isn't that it covers every possible problem — it's that it turns "I'll look for issues" (vague, incomplete) into "I'll run these specific checks" (systematic, auditable). Here's a standard sequence to run on any unfamiliar dataset:

1. Inspect first — transform second. Before touching anything, look at the data. How many rows? How many columns? Do column names make sense? Scroll through a sample of values in each column. Do you see obvious inconsistencies or typos?

2. Check for trailing and leading spaces. Add a helper column with =LEN(A2) and another with =LEN(TRIM(A2)). If they differ, you have spaces to clean.

3. Verify numeric columns are actually numeric. Select each numeric column and check the status bar for Sum/Average. If you only see Count, you have text-numbers.

4. Check date columns. Are dates stored as real dates (numeric, formatted as dates) or text strings? Apply DATEVALUE to a sample value and see if it returns a number.

5. Check for case inconsistencies. In categorical columns (status, region, category, location), look for duplicates caused by capitalization. =LOWER(A2) + Remove Duplicates in Power Query will surface these quickly.

6. Check for duplicate rows. Either use Power Query's Remove Duplicates preview or Data → Remove Duplicates in Excel to see how many exact duplicates exist.

7. Check for unexpected blank cells. Use COUNTBLANK on each column. Are blanks expected (legitimate missing data) or accidental? Are there patterns — like entire columns with no data?

8. Spot-check a calculated field against source. If you have something like "Total" that should equal Quantity × Price, verify a few rows manually. If they don't match, there's a data integrity problem upstream that cleaning won't fix.

9. Document what you did and why. A brief notes section — even just a text cell above the cleaning area — explaining what transformations were applied and why is worth its weight in gold when you or a colleague revisit this workbook in six months.

Warning: Never clean data in your only copy of the source file. Always keep the raw data preserved in a separate sheet, file, or as a Power Query source that remains untransformed. If a cleaning step turns out to be wrong, you need to be able to go back.

The cleaning workflow is unglamorous but foundational. Every analysis technique we cover in subsequent sections — SUMIFS, Pivot Tables, dynamic arrays, dashboards — depends on data that's structurally sound, type-consistent, and free of the silent corruptions that make formulas silently lie to you. Getting this right isn't the boring part of analytics. It is analytics, as much as any chart or model you build on top of it.

Now that your data is clean, we can start making it intelligent.

Data Entry Best Practices and Validation Methods How to Use Excel Tables and Structured References

Only visible to you