Data Management

Always check for whitespace


Last month Justin talked about his quest for fundamental programming theory resources. One fundamental I keep running into is people who forget to check for whitespace -- or don't even know how to check for whitespace.

I can't tell you the number of times I've seen some poor engineer spend countless hours trying to debug a problem that came down to extraneous whitespace. I don't care what programming language you're using, the string " hello " will not equal the string "hello". The little extra whitespace before and after doesn't affect the readability to us humans but is generally fatal to string comparison operators.

For some reason, the biggest culprit I've found is extra whitespace stored in a database string. I think it happens more often there because when you query a table in a tool like TOAD, the result comes back in a neat little spreadsheet-type display. It's difficult to tell at a glance that you might have some extra trailing whitespace on some of the data.

The simplest trick I've found (which I use pretty much everywhere) is to wrap the string in some special characters like square brackets; that way, I can see when there's extra whitespace. Now I am primarily an Oracle developer these days, so the syntax for that might look something like this:

select '[' || keyword || ']' from keyword_table;

This wraps square brackets around every keyword so I can tell at a glance that there's extra space. I assume there's similar syntax in other variants of SQL.

If the strings don't have internal whitespace, you could also just search for anything that has a space in it -- though this method only catches regular spaces and would miss stuff like tabs. Plus, it could be really slow since it might result in a straight old-fashioned table scan:

select keyword from keyword_table where keyword like '% %';

Whenever some engineer comes and asks me to help debug why their system isn't logging into the database correctly, the first thing I do is check for whitespace in the login credentials they're using. (The second, of course, is to check the credentials themselves.) I've found that erroneous whitespace is the culprit more often than you might think. 

15 comments
C_Tharp
C_Tharp

are also a problem. Capital Oh used when it should have been zero. Lower case L used when it should have been the one digit. Fat finger errors are a big problem, too. They are notoriously hard to find because you do not have any idea which character was used. There really is no better answer for these problems than proper input validation. Even so, many can not be caught. They just have to be dealt with when they are found.

Wayne M.
Wayne M.

Although I have seen my share of whitespace issues, case sensitivity has usually been a bigger problem. Learn the trim commands and case insensitive compares both at the database level and at the language / UI component level.

The family Jules
The family Jules

some environments are case-sensitive and should maintain case. The case insensitive search for MS SQL also blew my mind, though. The DBA over here said he thinks there is a DB flag in configuration settings that can be flipped, though. Speaking of case sensitivity, I came across a bug in VB .NET a few years ago where a method in the FileSystemObject was case insensitive, but the FileSystemWatcher was case sensitive.

paulo
paulo

It is your choice when you set up SQL Server whether you want case sensitive or case insensitive. Personally we set all of our databases up insensitive as that is what we always need (Mr SMITH is the same as Mr Smith to us). In .net we usually find ourselves forcing case insensitive compares as well. My personal opinion is that this is a mistake in our language (English), it is a huge waste having a second complete alphabet that is redundant. Why should everyone learn a second alphabet so that sentences can start with a capital (there is already a full stop, so you already know it is a new sentence). We could easily find an alternative punctuation for proper nouns such as underlining, square brakets, a dot above the letter. Most users are case insensitive, as in they expect to be able to search for "smith" and get SMITH, Smith, smith, SMith etc, and although it costs the computer more it is what people expect. Values should be stored and displayed as entered but treated the same, for example, my last name is O'Donnell. The third character of my name is a capital. I have one company that sends me mail outs twice because in thier system O'Donnell is not the same as O'donnell and one of thier staff recorded my name with a lower case d.

Justin James
Justin James

... I discovered by accident that SQL Server 2005 seems to be case insensitive for the = operator (maybe I need to be using EQ or something liked it instead?). It blew my mind. Microsoft: "You" != "YOU"! Especially when I'm using alphanumerics for a Base62 (a-z, A-Z, and 0 - 9 = 62 values per position, therefore, Base62) encoding system! J.Ja

MikeSQLDBA
MikeSQLDBA

It should be common knowledge that the default collation SQL Server uses is SQL_Latin1_General_Cp1_CI_AS, where the CP1 means code page 1252 and the CI part means case insensitive (CS means case-sensitive). If you temporarily want a case-sensitive search, try a query like this: SELECT myPKey, myVarchar FROM myTable WHERE myVarchar COLLATE SQL_Latin1_General_CP1_CS_AS = 'YOU'

hlhowell
hlhowell

I speak English, as do most of my users. It is imperitive that whitespace be part of the data, because without it, the data becomes one mess. IspeakEnglish,asdomostofmyusers.Itisimperativethat whitespacebepartofthedata,becausewithoutit,theatabecomesamess. See?! I personally hate CRM programs that seem to forbid tabs, tables, fixed space fonts, and other commonly accepted spaces used for language demarcation and emphasis. The lack of proper type programs (such as the entry box for this message, which will not directly enter a tab, thus paragraphs are boxed, means a loss of expression in a medium that already is tough to get the point across in an effective manner. So choose you issues and text parsing carefully. Regards, Les H

Tony Hopkinson
Tony Hopkinson

Select Distinct CustName from Table Case insensitive Put that inside a correlated subquery for an update, and you get "more row returned by subquery" if Bill and bill are in there. This was for a one off fix to rationalise the data enetered from a legacy app by the way, not something I'd recommend for frequent use in an application. F'ing annoying though, had me scratching my head for ages as I wasn't rationalising by for the case differences, initially anyway.

Justin James
Justin James

... is data validation. If your column is supposed to be used for a single word, the fact that a padded word (or one that otherwise won't match) points out that a developer failed to properly validate their data going in! They should having either rejected the value, or quiestly stripped it of the offending characters! Not disagreeing with your approach in the slightest, just saying that the need for it points out a more basic issue at hand! J.Ja

Dr Dij
Dr Dij

is unprintable chars, including control chars, and tabs, for example, often from manual input in m$ programs such as excel. I convert all unprintables to blanks when reading data files then trim begin and end of strings to eliminate this.

Tony Hopkinson
Tony Hopkinson

Don't put bad data in your database, you get it out far more times than you put it in.

RexWorld
RexWorld

You're absolutely correct, catching extraneous whitespace before it enters the data stream is the best approach. But in my experience, the people who do the front-end data capture part of the app are not always the same as those of us who spend most of our time doing the back-end data analysis. So trying to get it fixed on the data capture or data entry side means asking some other team to do the work. And that requires approvals, scheduling, etc. I just am too lazy to push the other team into fixing their code.

Wayne M.
Wayne M.

Though I can understand your frustration, I would recommend that you continue to work with the development team to implement the necessary data validation. There is nothing quite so frustrating to a user than pressing "Save" after a data entry session and having a SQL error thrown back in his face. I agree that there are solid reasons for having data validation at the database level, but the user interface should always be more restrictive than the database.

Justin James
Justin James

... I forgot to give you a suggestion better (IMHO) than the sqare brackets. I'm too tired to get the syntax right, so forgive me... SELECT col1, (IF LENGTH(col1) - LENGTH(RTRIM(LTRIM(col1)) = 0, '', 'X') AS padded_flag That will be MUCH more obvious than the brackets, plus give you a column that you can perform further logic on! J.Ja

RexWorld
RexWorld

Ooh, that's a great idea. I'm going to steal that idea next time I have to do some whitespace hunting.

Editor's Picks