Quote of the Weak – Clean Data Manually

clean data manuallyIf you are in IT, audit, or security (or any other job requiring data analysis), you should NOT be cleaning data manually.

Let me share a recent experience with you….

A young IT auditor texted me at work and asked for some Active Directory user account data that I capture automatically every week, using some scheduled ACL scripts.

If you’re not familiar with my ‘Quote of the Weak’ series, I described it briefly in About. For a list of posts in this series, see here.

He wanted to add user names, cost center, and other user account data to the list of user IDs he had obtained from his client.

I directed the auditor to the location of the data, to which the entire audit department has access.

Step 1: Map the Drive

The auditor (let’s call him Jamie) texted me again a short time later, saying he could not find it (imagine that, an IT auditor with 2 years of ‘experience’ who can’t map a shared drive [see my list of What IT Auditors Ought to Know – and Don’t!]).

I walked over to his desk and showed him how to connect to the drive.

Step 2: Identify a Problem

After he opened the data I captured and saved a copy of it, I asked to see the user list that he had obtained from the client.

I noticed that his user list included some user IDs that contained extra characters at the end, which (in this company) indicate the ID is a special ID and has increased access (don’t get me going on this ridiculous scheme, that’s another story).

This particular company uses the first 3 letters of the user’s last name and 3 numbers to create user IDs, such as JON316.

However, for special IDs, this company uses an ‘x’ after the ID refers to the first additional security level, ‘y’ to the second, ‘z’ to the third.

So a gal named Bella Jonah is assigned JON316 for her everyday account and JON316x and JON316y for her special accounts.

The data that I capture each week contains only basic user IDs, like JON316; no x, y, or z accounts.

Step 3: Transform the Data

I mentioned this difference to Jamie, and told him that he’d need to use an Excel formula to remove the extra character (x, y, z) at the end of his list of user IDs so that they can be matched to the data he received from me.

Blank stare.

“Have you used the LEFT formula in Excel? Just keep the first 6 characters?”, I prompted.

Blank stare.

“Put your cursor in cell D1 and type =LEFT(A1, 6) and press ENTER. Got it?” I searched his face for an answer.

“Uh, sure,” came back the reply. The words didn’t match the face expression.

“Just copy it down the column. Will you remember to use that formula in the future?” I asked.

“I’m not familiar with Excel formulas. Still learning them.”

(Did I mention this auditor had 2 years of experience?)

Step 4: Eliminate Manual Labor

“So you know how to use vLookup* to match data, but you’re not familiar with the LEFT formula? So how were you going to trim those user accounts in your data?,” I asked.

*I had heard that whenever Jamie used vLookup, he asked another auditor to help get the formula right.

“I was going to clean the data manually,” Jamie said, almost defiantly.

“All 150 accounts? By hand?” I questioned. “Really?”

“Yes.” Silence. Blank stare. Not any trace of guilt or shame.

Look Mom, No Formulas!

“Then let’s forget Excel formulas,” I said, recovering. ” I’ll show you an easier way, no formula involved,” I directed. “Position your cursor in the next blank column, cell E1.”

“Ever heard of Flash Fill?” I asked (I’d written a couple articles on the department Sharepoint site about Flash Fill and had demonstrated how to use it in a previous department meeting. One that this auditor had attended.).

See my post about Flash Fill. If you don’t know it, don’t miss this one. Flash Fill can do so very much.

“No,” the auditor replied swiftly.

“With your cursor in cell E1,” I said, “Note that the user ID is cell A1 is JON316x. Type what you want to appear in cell E1–JON316.”

The auditor typed the user ID flawlessly.

“With your cursor in cell E2, note that the user ID in A2 is ROM323z. Type what you want to appear in cell E2–ROM323.”

Again, flawless typing.

“OK, now put your cursor in cell E3, click on the Data tab, then click the Flash Fill button.”

“It finished the entire column!,” the auditor exclaimed. “Wow!”

“Now go back and click inside cell E1. See a formula?”

“No, I don’t. That’s cool!” the auditor admitted.

“I hope that changes your life,” I said, turning to exit his cube.

Questions for You

  1. What would have been the result if I had not noticed the x, y, z characters added to the end of some of Jamie’s IDs and he successfully performed a vlookup?
  2. Editing data manually takes a lot of time. What is the other major problem with Jamie’s’ manual solution?

Please comment below, and we’ll discuss.

See my post about Flash Fill.



Filed under Audit, Case Files, Data Analytics, Excel, How to..., Quote of the Weak, Security, Technology

3 responses to “Quote of the Weak – Clean Data Manually

  1. William

    I have not used Flash Fill like that before, but that is pretty neat. I will have to show this to my coworkers next time I help them with Excel.

    It is probably best that Flash Fill outputs the values instead of formulas, but it would be an interesting learning opportunity if it gave some insight on how to perform that transformation with formulas.

    If he performed the VLOOKUP() match without cleaning the extra characters, the results would depend on the optional ‘range lookup’ parameter in the VLOOKUP() function:

    ‘FALSE’ – all of the additional security accounts would return an ‘#N/A’ error, making it look like those IDs did not exist. This is bad.

    ‘TRUE’ or omitted entirely – depending on how the data was sorted those security accounts could return:
    – An ‘#N/A’ error.
    – Information for the correct employee.
    – Information for an entirely different employee.
    This is even worse since the wrong information is less likely to be noticed than an error.

    When the ‘range lookup’ parameter is TRUE or omitted, VLOOKUP() looks at the table array and uses “The Price is Right” rules to look for a match, starting from the top it looks for the closest value without going over.

    As for editing data manually, besides the time involved there are several problems:
    – No record/log of what changed. When changes are made manually, if an error is noted in audit testing it could be difficult (or impossible) to know if the testing used accurately cleaned data or the auditor made a mistake in the manual cleaning.

    – Has to be performed again from scratch if updated data is obtained or when that audit is performed next time.

    – Does not scale at all, the next time there might be 2,000 data points instead of 150.

    – Squanders an opportunity to practice and learn. (…and here is where I climb up on my soapbox…)
    When you can think of a way to do it with formulas and features you already know, then you have a great chance to practice, make improvements, or even try a different approach. When you cannot think of a way to do it, then an opportunity to learn and grow just fell into your lap. You might get to learn a new formula or feature, how to use a formula in a new way, how to combine two or more formulas together, a way to leverage other data you already have. You might also discover a great resource you can use later; it could be a reference guide, finding out your coworker is knows JOINs in SQL really well, a detailed tutorial video, a blog, etc. (I believe this is how I came across ITAuditSecurity to begin with).

    The times when using a non-manual approach seems like overkill to others (“it’s only 25 records, it would just take a few minutes”) are actually the PERFECT time to learn a new technique:
    > There is a well defined starting point and objective, so you have something specific to accomplish.
    > You can easily validate if what you tried worked, comparing the result to what you would have done manually.
    > It is a low stakes scenario, which makes it easier to be creative and try different approaches. If you hit a wall and cannot figure out what to do, you can still just do it manually.

    I passionately believe that the benefits from consistently seeking out and taking opportunities to learn and practice are not cumulative; the benefits are compounding, if not exponential.


    • William,
      You are indeed passionate. I think that’s the longest comment anyone has ever left (this blog started in 2009, so that’s an accomplishment!).
      I wish half the auditors I’ve known were half as interested in learning as you are. Are the auditors you know learners?
      I’ll wait until a couple more people comment before I chime in, but you covered it pretty well :)


Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.