9 ways to clean foreign or imported data

Whether you're importing data into Access or Excel, Excel is a great place to clean data before putting it into production.

istock-871802194excel.jpg
Image: UberImages, Getty Images/iStockphoto

Whether you're working in Excel or Access, the accuracy of your data can make, or break, the information you glean from that data. If you own the app, you can combine data validation and data types to eliminate most input errors. On the other hand, if you inherit an app that doesn't employ these features or you're receiving foreign data, you have little control. In this article, I'll review several ways to clean that data before introducing it into your apps.

You can clean data in Excel before importing it into another workbook or even an Access table. I'm going to demonstrate 9 ways to clean foreign or imported data. Make sure there's a backup of the original file, regardless of its format. I'm using Office 365 Excel (desktop) on a Windows 10 64-bit system, but everything applies to earlier versions. There's no demonstration file; you won't need one.

SEE: Windows 10 power tips: Secret shortcuts to your favorite settings (Tech Pro Research)

1. Spell check

Even good validation and data types can't prevent typos, so the first thing you should do is run a spell check on the entire data set. But that's only a part of the problem. Sometimes values are spelled correctly but entered inconsistently. For instance, Susan S. Harkins, Susan L. Harkins, and Susan Harkins might all be the same person, or not. Sometimes the inconsistency is simpler, but regardless, it's best to pick a value and use it consistently. This is where input controls shine—you can force users to choose a value rather than allowing them to enter the value from the keyboard. This is a great way to protect the validity of your data when the values are known.

To learn how to work more efficiently with Office's spell check feature, read the following article 10 tips for using spell check more efficiently.

2. Replace mistakes

Running spell check might help you spot inconsistent values, but it won't always help you correct them. When that happens, try Excel's Find and Replace feature. Fortunately, this feature is flexible and powerful. For instance, you can use this feature to change all instances of Susan Harkins to Suzanne Harkins or Susan Hawkins. You can also use it to fix lesser errors, such as case sensitivity. If formatting matters, you can even change formats. You could insert missing commas before Inc., and so on. Look for this feature under Find & Select in the Editing group on the Home tab.

3. Remove duplicates

Before you delete anything, you need to define duplicate—the data and how you use it will determine your definition. Does every field need to be repeated? Perhaps you only need one record per person (or entity). Once you know what constitutes a duplicate, you can decide what to do with them. You might choose to highlight them using conditional formatting or add a new field that indicates its status as a duplicate. Most likely, you'll decide to delete duplicates outright. Either way, Excel can help.

To delete duplicates, use the Remove Duplicates tool in the Data Tools group on the Data tab. You'll apply your definition of a duplicate to select/unselect the appropriate fields evaluated during the process. There are lots of ways to conditionally format duplicates. The article How to use built-in Excel features to find duplicates should help.

SEE: Power checklist: Managing and troubleshooting Windows user accounts (Tech Pro Research)

4. Remove unseen characters

Text values, especially those coming from a foreign source, can contain space and non-printing characters that cause unexpected results when sorting, filtering, and searching. It's best to remove them early on using one of two functions:

  • TRIM(): Removes leading and trailing spaces as well as multiple spaces between words.
  • CLEAN(): Removes the non-printing ASCII code characters 0 through 31.

Remember to replace the formulas with their calculated results (see #9).

5. Divide value into multiple columns

Usually, you'll want each field to contain the smallest autonomous value possible. For instance, if you import a field that stores both the first and last name, you'll want to split those two names into separate fields. Fortunately, Excel's Text to Column feature makes this simple. Simply select the cells in question, and then click Text to Columns in the Data Tools group on the Data tab. The resulting wizard will walk you through the process.

This feature overwrites existing data, so insert the appropriate number of blank columns to the right before you start. You can also use this feature to quickly convert numeric values to text (ZIP codes, phone numbers, and so on).

For specific instructions on how to use this feature, read Office Q&A: Built-in Excel features that eliminate the need for complex expressions.

6. Transpose for better structure

If you need imported data arranged in a simple two-dimensional table, you might need to rearrange things a bit, which is difficult. If, on the other hand, you want to switch rows to columns and vice versa, you'll need to transpose values. It sounds difficult, but fortunately, it's very easy.

  1. Select the row or column and press Ctrl+c to copy those values to the Clipboard.
  2. Select an anchor cell for the pasted values and then choose Transpose (T) from the Paste dropdown (in the Clipboard group.)

To learn more about this easy-to-use feature, read the section "Transpose the data set" in the article How to reverse and transpose Excel data with this powerful but simple solution.

7. Convert text to numeric

Numeric values often import as text so take a quick look and format as necessary. Numbers stored as text are easy to spot because Excel aligns them to the left of the cell. Excel aligns numeric values to the right. Converting them to numbers should be as easy as changing the format from Text to General. If you need more help, use VALUE() to convert text values to numbers. In addition, you can use DATEVALUE() and TIMEVALUE() to convert text into date and time values. Remember to replace the formulas with their calculated results (see #9).

SEE: Windows 10 spotlight: Prepare, repair, and recover (Tech Pro Research)

8. Deal with blanks

A business rule will determine how you deal with blank cells, but you will want to deal with them. You might want to enter the value 0 or the text NA. Fortunately, you can find and fill those blank cells at the same time, as follows:

  1. Select the entire data set.
  2. Press F5 to open the Go To dialog and click Special (bottom-left corner).
  3. Select Blanks and click OK.
  4. With all the blanks selected, enter 0, NA, or some other value and press Ctrl+Enter. Doing so will enter the value into all the blank cells.

9. Replace formulas with values

Anytime you need a function or formula to clean a value, you'll want to convert those expressions into their calculated results before moving the values into production. You won't want to import the expressions; you want to import the clean values. To replace expressions (formulas and functions) with their calculated results, do the following:

  1. Select the formula cells and press Ctrl+c to copy them to the Clipboard.
  2. From the Paste dropdown in the Clipboard group on the Home tab, choose Values (V) from the Paste Values section.

Clean data to protect validity

Whether you're importing from another database, a text file, or even a web page, you must put that data through a set of cleaning tasks to ensure its validity. Don't forget about business rules either. You'll need a set of custom cleaning tasks to enforce those.

The process can be much more complex than what I've alluded to, but these cleaning tips are a great place to start to the process. Please share your data-cleaning tips with others in the comments section below.

Send me your question about Office

I answer readers' questions when I can, but there's no guarantee. Don't send files unless requested; initial requests for help that arrive with attached files will be deleted unread. You can send screenshots of your data to help clarify your question. When contacting me, be as specific as possible. For example, "Please troubleshoot my workbook and fix what's wrong" probably won't get a response, but "Can you tell me why this formula isn't returning the expected results?" might. Please mention the app and version that you're using. I'm not reimbursed by TechRepublic for my time or expertise when helping readers, nor do I ask for a fee from readers I help. You can contact me at susansalesharkins@gmail.com.

See also

By Susan Harkins

Susan Sales Harkins is an IT consultant, specializing in desktop solutions. Previously, she was editor in chief for The Cobb Group, the world's largest publisher of technical journals.