Data Literacy: Reading, Exploring, and Visualizing Data
Section 18 of 21

How to Create a Data Literacy Workflow for Reading and Visualizing Data

Putting It All Together — A Data Literacy Workflow

From Raw Data to Insight

Let's walk through a complete example that ties together everything we've covered. Imagine you're analyzing data from a fictional bookstore chain. You have a spreadsheet with one row per book sold over the past year, including: title, genre, sale price, date sold, store location, customer age group, and whether the customer was a loyalty member.

Your boss asks: "Tell me something interesting about our sales this year."

This is the kind of open-ended, real-world request that data literacy prepares you for. You don't have a formula to plug numbers into — instead, you need to follow a workflow that moves from understanding your raw material, through systematic exploration, to actionable insight. What follows is exactly how a data-literate person approaches this challenge.

Step 1: Understand the Data

Before doing anything else, you need to orient yourself. What kind of data are you actually holding? This first step is about taxonomy — knowing what you're working with unlocks what you can do with it.

  • Title: nominal categorical (each unique book title is a category with no inherent order)
  • Genre: nominal categorical (Fiction, Non-Fiction, Children's, etc. — no ranking)
  • Sale price: continuous numerical (can be any value within a range; you can do math with it, like calculating an average)
  • Date sold: time data (a specific type of continuous numerical, but with special properties — it naturally orders itself and can be grouped by month, season, year, whatever)
  • Store location: nominal categorical (store names or IDs — think of these as labels, not numbers)
  • Customer age group: ordinal categorical (18-24, 25-34, 35-44, etc. — there's a clear order, but the gaps between groups aren't necessarily equal)
  • Loyalty member: binary categorical (Yes/No — the simplest case, just two options)

You've already started doing real data work. Categorizing your variables tells you what's possible and what isn't. For instance, you can calculate the mean price of all books. You cannot meaningfully calculate a "mean" genre or a "mean" store location — that would just be nonsense. You could rank age groups, sure, but averaging them would be meaningless. This classification step is what prevents you from making absurd analytical moves later on.

Step 2: Check Data Quality

You run some basic checks — the kind of defensive work that separates solid analysis from the kind that gets you embarrassed in a meeting:

  • Missing values: You find that 12% of customer age groups are missing (people didn't provide this at checkout). That's significant. You'll need to account for it in any analysis involving age. Your move: for some questions, you'll exclude those rows; for others, you'll create a separate "Unknown" age group to see if that cluster shows different spending patterns.
  • Outliers and errors: Three books show a sale price of $0.00 — probably employee discounts, system glitches, or returned items that shouldn't be there. You flag these for investigation before they mess up your numbers. You also spot one entry priced at $9,999, which turns out to be a data entry error (should have been $99.99). Without catching this, your mean and standard deviation would be wildly inflated.
  • Date range: You verify that all dates fall within the past year. You check for impossible dates — anything in the future, anything from before the store opened. Everything checks out.
  • Categorical consistency: The genre "Non-Fiction" shows up as "Nonfiction" in some rows and "non-fiction" in others. Store location sometimes reads "Downtown Store" and sometimes just "Downtown." These inconsistencies break your analysis — when you group by store, the system will treat "Downtown" and "Downtown Store" as completely different categories. You standardize everything to a single format.

This data quality work might not feel like it's producing insights, but it's preventing disasters. A messy dataset with undetected errors leads to confident-sounding conclusions that are simply wrong. Think of this as checking your foundation before building the house.

graph LR
    A[Raw Data] --> B[Data Type<br/>Audit]
    B --> C[Quality Check<br/>Missing/Outliers]
    C --> D[Clean &<br/>Standardize]
    D --> E[Exploratory<br/>Analysis]
    E --> F[Visualization<br/>& Description]
    F --> G[Insight &<br/>Communication]
    G -.-> |New questions| E

Step 3: Describe Each Variable

Now that you've got clean data, you calculate summary statistics for your key numerical variables. This is the univariate analysis phase — looking at one variable at a time, getting a feel for it.

Sale Price:

  • Mean = $18.40
  • Median = $16.99
  • The mean is higher than the median. This is your signal that the distribution is right-skewed — a handful of expensive titles (art books, box sets, special editions) are pulling the average up.
  • Standard deviation = $11.20
  • IQR (the middle 50%) = $12.99 to $24.99, a spread of $8
  • You plot a histogram. There it is: a tall cluster in the $14.99–$19.99 range (your bestsellers and standard paperbacks) with a smaller tail stretching out to $75+ (specialty and premium stuff).

Number of Books Sold by Genre:

  • Fiction: 8,240 books (45%)
  • Non-Fiction: 5,100 books (28%)
  • Children's: 3,890 books (21%)
  • Other: 770 books (4%)

Your store is heavily fiction-focused — nearly half your business. But that 21% for Children's catches your eye. It's a significant slice. That's worth understanding better.

Loyalty Member Distribution:

  • 43% of transactions involve loyalty members
  • 57% are non-members

The program has decent penetration, but there's room to grow.

Step 4: Explore Relationships

Here's where things get interesting. You move to bivariate and multivariate analysis — asking how variables relate to one another. This is where insights usually hide.

Do loyalty members buy more expensive books?
You split the data into two groups and compare:

  • Loyalty members: mean price = $21.30, median = $19.99
  • Non-members: mean price = $16.80, median = $15.49
  • Difference: Loyalty members spend about $4.50 more per book on average. That's a 27% difference.

This is striking. But here's the critical question that immediately jumps out: Does the loyalty program cause higher spending, or do wealthier customers just naturally sign up? This is a causal ambiguity you'll need to flag when you talk about findings.

Does genre vary by store location?
You create a cross-tabulation showing what's selling where:

Store Fiction Non-Fiction Children's Other
Downtown 2,100 (48%) 1,200 (27%) 750 (17%) 350 (8%)
Mall 3,200 (44%) 1,500 (20%) 2,100 (29%) 200 (3%)
Riverside 2,940 (45%) 2,400 (37%) 1,040 (16%) 220 (3%)

An interesting pattern emerges. The Mall location has an unusually high Children's percentage (29% vs. 21% overall). Investigation reveals the Mall store sits near two elementary schools. The Riverside store skews toward Non-Fiction — probably because it's in an affluent neighborhood with an older customer base. These insights come from relationships, not just from counting things.

Is there a trend in sales over time?
You plot total revenue (or units sold) by week across the year:

  • Baseline through January–August (pretty flat)
  • Spike in September (back-to-school)
  • Much bigger spike in November–December (holiday gift-giving)
  • Dip in February (post-holiday hangover, plus it's a short month)

This is classic retail seasonality. It matters because it affects your inventory planning, staffing decisions, and when you run promotions.

Step 5: Visualize the Key Finding

After exploring, you've identified your strongest finding: loyalty members significantly outspend non-members, and this gap is widening.

You create a visualization: a horizontal bar chart showing average spend per visit, loyalty members at $21.30 and non-members at $16.80, with a clear title: "Loyalty Members Spend 27% More Per Visit." You add a footnote showing your sample size (18,000 transactions total) and your data period (the full year).

Then you add a line chart showing the trend over time: average spend for loyalty members versus non-members, plotted month-by-month across the year. The two lines start closer together (12% gap) early on and gradually diverge to 27% by year-end. This trend visualization tells a richer story than a single bar chart — it shows momentum and a pattern developing.

Step 6: Communicate the Story

You sit down with your boss and tell a story, not just recite statistics:

Setup (Context): "Loyalty members now account for 43% of all transactions, up from 31% a year ago. The program is definitely working."

Conflict (The Surprise): "Here's what's interesting: the spending gap isn't just big — it's growing. When the program launched, loyalty members spent only 12% more per visit. Today, that's 27%. And it's been climbing steadily month by month."

Resolution (What to Do): "The data suggests that as the loyalty program matures and members become more engaged — through repeat visits, accumulated rewards, exclusive offers — they're also spending more. This probably justifies expanding the program and putting more resources into member engagement."

Caveat (What You Don't Actually Know): "One important note: correlation doesn't equal causation. The data shows the two things moving together, but we can't say for certain that the loyalty program causes higher spending. It's possible that higher-spenders were already more likely to sign up. To answer that definitively, we'd need to compare each customer's spending before and after they joined, or run a controlled experiment."

Notice what you've done: you've followed a real workflow, reported a finding accurately, and — most importantly — flagged the causal ambiguity without throwing away the insight. You've given your boss something actionable while being intellectually honest about what the data actually proves. That's data literacy.

mindmap
  root((Data<br/>Literacy<br/>Workflow))
    Understand
      Classify variables
      Know your sources
      Understand context
    Clean
      Handle missing data
      Spot outliers
      Standardize categories
    Describe
      Summary statistics
      Distributions
      One variable at a time
    Explore
      Ask questions
      Find relationships
      Check for trends
    Visualize
      Choose appropriate charts
      Label clearly
      Tell a story
    Communicate
      Set up context
      Reveal the insight
      Suggest action
      Acknowledge limits

Iteration and Follow-Up Questions

In real life, your boss might ask follow-up questions, and you loop back to exploration:

  • "Who are our highest-spending customers? What age groups, genres, store locations show up?"
  • "How does loyalty member retention compare to revenue? Are these customers sticking around?"
  • "What if we tested a new reward structure? How would we know if it's working?"

Each new question sends you back into the explore-visualize-communicate loop, refining your understanding. This is why data literacy is a process, not a destination. You're never really done — you're just always asking better questions.