Data Management

Dealing with case and accents in SQL Server

Collation describes the code page, case sensitivity, accent sensitivity, and language or alphabet in use. This SQL Server tip illustrates how you can force the use of any specific collation.

Collation describes the code page, case sensitivity, accent sensitivity, and language or alphabet in use. I often see collation handled on a per-database scale. For instance, in my experience, most DBAs in North America go with the default collation, which is case insensitive. The collation has no effect on the data itself, but rather how it is compared and indexed.

This tip illustrates how you can force the use of any specific collation without regard to the collation specified when the database of interest is created. It also shows some of the complications involved when your database attempts to deal with multiple character sets.

To try out the experiments in this column, create a test database and populate it with a few rows, as in Listing A. (Note: Regular readers of my tips will notice that I create a schema in which to house the tables related to any given tip. This is because I always use a single database called SQLTips for this purpose. Using schemas nicely groups the tables relevant to any particular tip.)

Get SQL tips in your inbox
TechRepublic's SQL Server newsletter, delivered each Tuesday, contains hands-on tips that will help you become more adept with this powerful relational database management system.
Automatically sign up today!

Listing B contains a few test queries for you to run. In each case, the query returns all four rows. This is to be expected when a case-insensitive collation is chosen. But suppose your task is to locate one of these rows using an exact match?

This is where collation comes in. Although a DBA usually specifies the collation when the database is created, you can apply a different collation after the fact simply by adding a COLLATE predicate to the WHERE clause. See Listing C. Each of the first three queries returns exactly one row, while the fourth query returns no rows.

Suppose that you want to change all possible spellings of 'fuller' (including the nonsense ones such as 'fUlLEr') to 'Fuller'. You can do that quite easily with the code in Listing D.

I like to run such a transaction when dealing with data inherited from a mainframe database because I tend to think of upper-case words as yelling, and doing it once to the data enables me to avoid worrying about it thereafter.

If you want a list of all the collations available, along with terse explanations, run this SQL:

SELECT * FROM ::fn_helpcollations()

To determine the default collation of the database of interest, do this:

SELECT collation_name 
FROM master.sys.databases
WHERE Name='SQLTips'

In Listing E, I add some rows containing special alphabetic characters. I use the name of my good friend and colleague, DejanSunderic, whose surname is properly written as Šunderic' (the trailing 'c' should actually have an acute accent above it — the display is a result of the selected font).

As you see, interesting things occur without the COLLATE predicate:

FROM CaseCheck.Names

This produces the following result set:

NamePK    Name
7    sUnDeRIc
8    Sunderic

Notice that the special case of 'Šunderic' is not included. Let's experiment further:

SELECT * FROM CaseCheck.Names

This produces the following result set:

NamePK    Name
5    Šunderic
1    Fuller
2    FuLLer
4    fuller
7    sUnDeRIc

This is clearly incorrect. PK 5 should reside somewhere below all occurrences of 'Fuller.' I am no language expert, so I cannot claim to know where among the Ss this ought to reside, but I suspect that this decision depends on local rules. It's clear that 'Šunderic' ought to reside somewhere after all occurrences of 'Fuller' and not before — but where in the list of names beginning with 'S'?

I did another experiment using the name of my friend Alexander Karmanov. To see his name in the Cyrillic alphabet, check out Figure A. I ran the statement in Figure B using the Cyrillic alphabet spelling of his name.

Using the default character set in SSMS, a query on the table returns his name as a series of question marks. However, if I run the following statement, SSMS returns the correct row but still displays his name as a series of question marks due to the default character set. The important thing is that the Unicode characters were entered and accepted. See Figure C.

If you want to determine the native collation in a given database, use this code:

SELECT collation_name 
FROM master.sys.databases
WHERE Name = 'SQLTips'

Miss a tip?

Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller's column.

Editor's Picks

Free Newsletters, In your Inbox