Data Centers

Use these four methods to clean up your data

Is your ERP or CRM data accurate? You may be surprised how easily large databases can become corrupted with incorrect, outdated, and unusable data. Here are four ways you can clean up your enterprise data.

Problematic data can lead users to distrust the very applications they rely on to make marketing and sales decisions. The only way to reverse this situation is to “clean” your data. But writing code to do that can be time-consuming and costly. Fortunately, there are a number of data quality methods that will clean your data for you. This article looks at four of these: data parsing, data correction, data standardization, and data quality matching.
Read Mary Ann Richardson’s article ”Data quality is vital to CRM or ERP user acceptance.”
Data parsing
With data coming in from different sources throughout an organization (e.g., phone orders, catalog orders, and the Internet), it is not surprising that users are often unsure about whether a field entry from one source matches a field entry from another. For example, does Name=Dr. Jones, Mary in the Accounts Receivable database represent the same person as Last Name=Jones, First Name=Mary in the Sales Contact database?

Sponsored by
PeopleSoft is the exclusive sponsor of TechRepublic's special series on Enterprise Applications. For more information, check out TechRepublic's Enterprise Application Center, or visit PeopleSoft

PeopleSoft is the exclusive sponsor of TechRepublic's special series on Enterprise Applications. For more information, check out TechRepublic's Enterprise Application Center, or visit PeopleSoft

To determine whether two records from disparate data sources represent the same entity, data quality software employs parsing tools to identify and isolate the individual elements of a record so that they conform to your organization’s field definitions and business rules. For example, suppose you have the following record in the Accounts Receivable database:
Accounts Receivable
Name: Dr. Jones, Mary
Address: 147 Elm St., Brooklyn, New York

It would be extremely difficult, if not impossible, to match this record with the following record in the Sales Contact database:
Sales Contact
Title: Doctor
First Name: Mary
Last Name: Jones
Address: 145 Pine St.
City: Brooklyn
State: NY

Parsing would transform the Accounts Receivable record as follows:
Accounts Receivable
Title: Dr.
First Name: Mary
Last Name: Jones
Address: 147 Elm St.
City: Brooklyn
State: NY

Once parsed, the individual elements in the Accounts Receivable record better match their corresponding elements in the Sales Contact record. However, upon closer inspection, we find that the data in the Address fields do not match. Before we can truly say whether this is the same person, we need to determine if the data in the parsed fields is correct.

Data correction
The most expeditious way to correct and verify your data is to use data quality software whose data correction tools reference a reliable secondary data source. These tools compare an organization's data files against those of an established data vendor for validation and correction. Tools vendors generally have contractual arrangements with one or more established data vendors. Address cleansing tools, for example, typically use the United States Postal Service’s national ZIP+4 directory to verify names and addresses. Other data sources used to validate data may include those from such vendors as Dun & Bradstreet, Experian, Group 1 Software, Inc., and MAILERS Software.

Tools are also available for cleansing data that deal with customers or businesses outside the United States. These tools, when used with data vendors, allow you to identify country names, format and standardize addresses, correct missing address information, and assign postal codes for each country that you deal with. The software also provides regional language translation for multilingual countries as well as full recognition of diacritical characters.

Data standardization
Data cleansing tools also use standardization software. Data standardization typically employs algorithms based on match standards. Match standards are agreed-upon representations of data elements that can be assigned by standardization software. For example, whereas disparate data sources may list XYX Plumbing as XYZ, Xyz Plumbing, or XYZ Inc., standardization software will ensure that all entries conform to an agreed-upon standard (for example, XYZ Plumbing).

In the Mary Jones example, the Title field of the Sales Contact record contains Doctor, while the record in Accounts Receivable contains Dr. These entries can be standardized against an agreed-upon match standard, which in this case could be Dr.

Information that follows names and addresses can be standardized with data standardization tools. Examples include Junior to Jr. following a name and Building to Bldg or Suite to Ste following an address. While some tools use only algorithms that rely on input from the users for matching standards, others standardize entries by integrating secondary data sources, a prime example of which is standardizing addresses against the U.S. Postal Service’s national directory.

Making the match
After all data is parsed, corrected, and standardized, it is ready to be handed over to data quality matching software that will identify similar data records within and across all data sources. Matching software first establishes the user criteria for determining a match. For example, the matching criteria may be that both name and address information of two records be exactly alike for a match to occur or that there be an exact match between the Name and Phone Number fields.

Matching software may employ one or more matching methods. Typical matching methods include key-code, fuzzy, and soundex matching. A key-code match performs identical comparisons using the first few characters in one or more fields. Fuzzy matching identifies matches by computing a degree of likeness between two discrete data elements. It considers all characters in a field and their position to determine the degree of a match. Because fuzzy matching does not require identical character matches and can adjust for spelling, phonetic, typographical, and transpositional errors, it is the preferred method for matching data that resists standardization, such as last names and house numbers.

Where data entries are received over the phone, soundex matching is recommended. Soundex matches are based on phonetic similarities such as f and ph or Quick and Kwik. Weighted matching can be combined with soundex or fuzzy matching to give users an idea of the relative importance of the fields used in determining the match.

Going beyond your own data
Most data quality methods that employ external data sources for data correction can also append information from those same data sources (such as SIC codes or income ranges) to enhance your in-house data stores. In my next article, I’ll describe how data quality methods can help you clean and enhance your data in real time.
How does your organization ensure data accuracy? What tools and techniques have you found to be most effective in maintaining data integrity? Post a comment or write to Mary Ann Richardson and share your thoughts.

Editor's Picks

Free Newsletters, In your Inbox