Justin James recently had an important data integrity lesson reiterated to him in quite memorable fashion by a glitch in the UPS package tracking system. Read about the glitch, and learn how you can prevent from deploying an application with serious data integrity flaws.
I work from home, and I deal with delivery services quite a bit. (I rarely buy something without checking with Amazon first for reviews and "baseline pricing.") A few weeks ago, I needed to file an Return Merchandise Authorization (RMA) on a hard drive. Imagine my shock when my tracking number arrived, and a glitch in the UPS package tracking system is what I saw (Figure A). Figure A
Click the image to enlarge.
What exactly is wrong here? Look carefully... this tracking number is handling two packages. One from Ontario, California to Carmichaels, Pennsylvania, and the other going from Ontario, California to Lexington, South Carolina (me), separated by several days. This is a huge problem, because anyone with a tracking number to a package can make alterations to its delivery. We count on the supposed uniqueness of the tracking number to provide the security.
A quick call to UPS revealed the problem; apparently, certain major senders are given the means of generating their own tracking numbers. I'm not sure if the senders get a block of numbers, or if they are given an algorithm to generate them. Regardless, the UPS system accepts these customer-created numbers with no error checking, and as a result, a glitch in the sender's system can create the chaos that we see in Figure A.
So, what can we learn about data usage from this glitch? The following are three important three data integrity lessons.
1: Identify values that should usually be generated by the system
While a package tracking number is not a true "key" value, it might as well be; it is the primary identifier for the record in the system as far as the user is concerned. In some cases, it is a good idea to let the primary identifier be determined outside of the system. For example, using an email address to identify a user is fine, as long as the user has an internal key that the system uses, so the email address can change as needed. But for something like a package tracking number, there is no reason why the user should be creating this. Even if there is a block or a special algorithm on a per-customer basis, it should be implemented by the vendor and accessed through some sort of service or remote call.
2: Check for duplicate values
This should be obvious, but you would be surprised how many applications I've encountered that fail to perform this most basic data integrity check. While at a technical level, much of the data in an application does not need to be unique, at a business level it does. A tracking number needs to be unique not because the system will mess up with a duplicate, but because you will have confused customers and potential security breaches with duplicate numbers.
3: Validate data in real time
This is one that's a new idea to me. Clearly, the UPS system is allowing the sender to operate totally connectionless, and probably gets a batch update of packages on a regular basis. The problem is that somewhere along the way, a number that should be unique is being created but not validated as unique in real time. Whether the sender needs to check against a central database first or needs to always go straight to UPS, I do not know, but you cannot accept data into the database and then perform validation after the fact. The horse has left the stables.
These may seem like obvious things to be wary of, but it's good to have a refresher about important data integrity lessons from time to time. Unfortunately, data integrity problems are usually visible to the end user before the developer sees them, and they are quite embarrassing. By putting these three principles in place, it will help prevent you from deploying an application with serious data integrity flaws.