A friend and colleague recently faced an interesting problem. Each month he receives a huge amount of data from a client, which he then must massage before it's usable. The tables contain millions of rows, and the structure of the tables may vary from month to month. To make matters worse, much of the data contains leading or trailing blanks or both.
After my friend did the job by hand the first time around, he quickly discovered the tedium of a manual approach. He wondered whether there was a way to automate the process of trimming all the data—the problem is that he cannot know beforehand how many columns there are or the names of the columns. Fortunately, it is possible to automate this process.
In SQL Server 2005, there are a number of very useful system views that reside in the INFORMATION_SCHEMA schema. The view of interest in this situation is INFORMATION_SCHEMA.COLUMNS.
All of these system views are hidden from your scrutiny, so you can't simply right-click and Script To a query window. Microsoft did this on purpose so it is free to change the innards. If you know something about the system tables, you can guess how the view is built, but for our purposes here, the innards are not important.
You can, however, obtain a wealth of information about the columns in your tables. Here are some of the interesting ones:
- Table_Catalog: name of the database
- Table_Schema: Schema
- Table_Name: name of the table
- Column_Name: name of the column
- Ordinal_Position: column number
- Column_Default: default value if any
- Is_Nullable: contains Yes or No
- Data_Type: indicates the data type of the column
For a list of the columns on any given table, your query would look something like this:
WHERE TABLE_NAME ='ProductDescription'
Try this command on one of your databases, substituting its name and a table of interest. Run it to get an appreciation of its output.
Now I'll use SQL to generate the UPDATE statements I need to trim all the data in the table. I achieve this by assembling a literal for each column in the table of interest. I'll skip all the data types that are irrelevant to our needs and concentrate on the char and varchar types. See Listing A. The output of this command is a series of UPDATE statements similar to what appears in Listing B. All you have to do is collect the output and run the block of statements as a query.
In the examples above, I was interested in only one table, but if I want to process all the tables within a schema, I would simply modify the WHERE clause to interrogate the Schema_Name column instead. See Listing C. Check out Listing D to see the output of this command.
This reveals a small problem. As denoted by my naming scheme, a view got included, and it makes no sense to try to update a view when I am already updating its underlying tables. So I need a way to skip the views. Looking only at the INFORMATION_SCHEMA.COLUMNS, there is no way to avoid this because views are not distinguished from tables.
Another view, INFORMATION_SCHEMA.Tables, comes to the rescue. I just need to add an EXISTS predicate to the above query. See Listing E. This version strips out the views from the list to process.
My colleague is now a happy camper. The solution is simple to use and very flexible. He can use it on a single table, or all the tables in a schema, or even on all the tables in a database, just by modifying the WHERE clause.
As you can see, SQL Server 2005 makes it far easier to work with table and column information than was the case in SQL Server 2000. I have also demonstrated how to generate SQL code using SQL. Now you know how to turn an onerous task into a breeze.
Miss a tip?
Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller's column.