Why You Must Validate Data

basic data analytics1Before you analyze data, you must first validate it.

Otherwise, your analysis may not be accurate, and you may miss some important insights or errors.

This post is part of the Excel: Basic Data Analytic series.

Before analyzing your data, you need to check the following:

  • Duplicate transactions do not exist.
  • Required fields/key fields do not contain blanks, spaces, zeroes, unprintable characters, or other invalid data.
  • Date fields contain real dates, and the range of dates is appropriate.
  • Amount fields don’t contain inappropriate zero, positive, or negative amounts, and the range of values is appropriate.
  • Each field is stored in the correct format. This prevents data from being converted on the fly into something else unexpectantly (e.g., user ID JUL15 becomes 15-Jul).

Why is validating data prior to analysis critical?

  • Duplicates and invalid values will affect totals, counts, percentages, and averages.
  • Blanks can cause trouble with filters and pivot tables, especially if an entire line is blank.
  • Invalid values may be ignored when filters or formulas are applied, and you might miss data that should match or should not.
  • Some key business processes feed data from user-developed MS Office applications that don’t have data entry validation checks.
  • Some data gets changed or corrupted when it is converted, reformatted, or extracted from systems.
  • People make mistakes when querying, copying, or extracting data for you.

To check for duplicate transactions, see Excel: Remove Duplicates and Excel Identify Duplicates.

To check for the other issues noted above, see Check Excel Data for Blank and Invalid Values.

Data Validation versus Population Validation

I do not consider data validation and population validation to be the same thing.

Data validation is checking whether fields have appropriate values; population validation is ensuring that you received data from the correct system, that the procedures used to gather or extract the data are appropriate, and that you received ALL the data.

In some cases, the difference gets blurry. For example, if you want all the transactions between January 1 and July 31 this year, population validation is ensuring you received all the transactions for those dates and only those dates. Data validation is ensuring that all the values in the date column are actually dates, not dollar amounts, fractions, or smiley faces.

For more on population validation and how to do it, see How to Perform Data Validation (I confuse the issue by naming this post data validation instead of population validation, but I did that specifically because that is the broader term most people use for this; in that post, I’m clear on what I’m referring to).

Ultimately, I don’t care what your call them or whether you separate the 2–just make sure you do both BEFORE doing any analysis.

See all the posts for these series at Excel: Basic Data Analytic series.

Leave a comment

Filed under Audit, Data Analytics, Excel, How to...

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s