Otherwise, your analysis may not be too broad, too narrow, or you may miss some important insights or errors.
This post is part of the Excel: Basic Data Analytic series.
Data profiling is developing a profile of your data, just as facial profiles of a person, taken from various angles, helps you size up a person’s nose, identify whether his chin is sagging, and how far apart the person’s eyes are.
Profiling data gives you ideas of what to review in depth and, sometimes, helps you identify what isn’t worth looking at. It provides you with questions to take back to your business contacts, which will lead you to deeper insights into the process you are testing.
Data profiling is one of the quickest and easiest way to start understanding your data. It includes things like simple statistics (counts, totals, ratios), identifying outliers, the frequency that transactions or values occur, and checking relationships between data fields (or fields in other files), and other methods.
Below are examples of testing that you can do during data profiling.
- Record counts – is this number reasonable, or were you expecting less or more Windows servers in the data center? Does it make sense that the number of shipments for each month is only 20?
- Totals – Does it make sense that the Mail Room has more servers dedicated to run their software than Accounts Payable? Does the total payroll amount seem reasonable for 50 employees?
- Ratios – Do we have 3 times as many Windows Servers than Unix servers? Is the pay and bonuses for the Service Center department substantially less than those for the Legal department?
- Minimum and maximum values – What is the lowest and highest number of calls the Service Center has answered in a single 24-hour period?
- Number of positive, negative, and zero values – and are they acceptable in this particular data set?
- Mean – What is the average number of vacation days per person not used across the company? What is the average number of sales made per person during the year? [Be careful of averages, as high values in one time period can offset low values in another time period.]
- Classification/Percentages – How many transactions have a source type of A, B, or C? How many laptops are still running Windows XP or Windows 7 vs Windows 10?
Outliers are values that are on the ‘ends’ of the data and are not close to any other values. For example, if most people scored between 70-80 out of a 100 on a test except for 2 people, who scored 40 and 100. Those 2 values are your outliers.
Outliers are not always a sign of a problem, but should be considered carefully. The further away from the natural group a value is, the more you should examine it. Yous should ask questions like:
- Why does that value occur? Is is a positive or negative occurrence?
- Do other values in the data correspond with this value?
- If history is available, do the same people/servers/processes/etc. always appear on this end of the data?
Excel has a real easy scatterplot chart to help you visualize outliers. Just select the column with the data you want to analyze and select the scatterplot chart.
Frequency is similar to classification/percentages (see above), but it looks over time or in conjunction with other factors. Inconsistencies may indicate data quality issues, errors, or possible fraud. For example:
- Transactions from bank A fail the weekend of the month-end cycle, but never any other time.
- The number of airplane repairs increase immediately after flights abroad.
- No transactions are entered by user ID JON316 during the third week of each month.
- Client X had changed their mailing address 3 times a year for the past 3 years.
Relationships examine the links between 2 or more fields in a file (or in mulitple files). Again, inconsistencies may point to data quality issues, errors, or wrongdoing. For example:
- Salespeople are marked ‘Active’, but have a date of death recorded.
- Invoice date is always the same as or earlier than payment date.
- Client Z submitted 2 online requests for loans in the past year, but hasn’t logged into the website for 2 years.
- Customer who applied for a retirement account, but is 100 years old.
- Employee submitted hotel expenses from out-of-town, but didn’t submit any transportation or meal expenses (this is more of a test than data profiling, which shows the overlap that can occur).
- Stratification – What is the average number of calls in each 1-week period each month? How many vendors are 30, 60, and 90 days behind in their payments?
- Sequences and gaps – Why does the most heavily used server not have an log entries for Monday and Tuesday? Why is every other invoice number missing?
All of these tests can be performed in Excel by using filters, vLookups or other Excel formulas, pivot tables, and charts.
And of course, this list is not exhaustive, but data profiling should be performed on each file used in an audit.