Before you start analyzing data, you need to 1) know you have the right data, and 2) understand the data and the process that produced it.
This post assumes, of course, that you already accomplished some of the hardest tasks already: figuring out what data you need, where to get it, and actually getting the data. Good luck with that. :)
This post is part of the Excel: Basic Data Analytic series.
Before testing or analyzing any data, perform these steps:
- Population validation is ensuring that you received data from the correct system or location, that the procedures used to gather or extract the data are appropriate, and that you received ALL the appropriate data. This is the first step that should be performed when you receive data. See How to Perform Population Validation.
If you discover you have bad or insufficient data, don’t waste your time on the next steps; go back and request better data.
- Data validation is checking whether fields have appropriate values and the data is error-free; for example, date fields contain dates in date format, and numeric fields contain only numbers, not text or blanks. Do this step after population validation. See Why You Must Validate Data.
Again, you should do this as much as possible before you move to the next step, as errors missed here will lead to errors later.
- Data profiling is slicing and dicing the data to help you understand it better. It includes things like simple statistics (counts, totals, ratios), identifying outliers, the frequency that transactions or values occur, and checking relationships between data fields (or fields in other files). See How to Profile Data.
Data profiling gives you ideas of what to review indepth and, sometimes, helps you identify what isn’t worth looking at. It provides you with questions to take back to your business contacts, which will lead you to deeper insights into the process you are testing.
However, these tasks can overlap, and you seldom do them 1-2-3. For example, if you know you have 200 branch offices, and only 196 appear in your data, you need to ask why. That’s part of population validation, but technically, it’s also a count of records, which is a simple statistic.
Let me help you sort this out. For example, regarding counting and totaling items:
- If I perform the count to ensure the data is accurate and complete, it’s part of population validation (are all branch offices included?). The count is just a quick check. After determining the count is correct or not, you would still want to compare the list to a complete list of branch offices.
- If I perform a count to determine which branch office makes the most errors, that’s part of data profiling; this procedure doesn’t validate anything, it gives you insight (branch with the most errors). So you do similar tasks in all population validation, data validation, and data profiling for different reasons, and on different sections of the data, depending on what you’re trying to accomplish.
KEY POINT:
Don’t get hung up on whether a procedure is population validation, data validation, or data profiling. What’s important is that you DO ALL 3 STEPS before your start testing. Otherwise, your testing could be ineffective and a big waste of effort.