Collation describes the code page, case sensitivity, accent
sensitivity, and language or alphabet in use. I often see collation handled on
a per-database scale. For instance, in my experience, most DBAs
in North America go with the default collation, which is case insensitive. The
collation has no effect on the data itself, but rather how it is compared and
indexed.

This tip illustrates how you can force the use of any
specific collation without regard to the collation specified when the database
of interest is created. It also shows some of the complications involved when
your database attempts to deal with multiple character sets.

To try out the experiments in this column, create a test
database and populate it with a few rows, as in
Listing A. (Note: Regular readers of my tips will notice that I
create a schema in which to house the tables related to any given tip. This is
because I always use a single database called SQLTips
for this purpose. Using schemas nicely groups the tables relevant to any
particular tip.)

Get SQL tips in your inbox

TechRepublic’s SQL Server newsletter, delivered each Tuesday, contains hands-on tips that will help you become more adept with this powerful relational database management system.

Automatically sign up today!

Listing B
contains a few test queries for you to run. In each case, the query returns all four rows. This is to be
expected when a case-insensitive collation is chosen. But suppose your task is
to locate one of these rows using an exact match?

This is where collation comes in. Although a DBA usually
specifies the collation when the database is created, you can apply a different
collation after the fact simply by adding a COLLATE predicate to the WHERE
clause. See Listing C. Each of the first three queries returns exactly one row,
while the fourth query returns no rows.

Suppose that you want to change all possible spellings of
‘fuller’ (including the nonsense ones such as ‘fUlLEr’)
to ‘Fuller’. You can do that quite easily with the code in Listing D.

I like to
run such a transaction when dealing with data inherited from a mainframe
database because I tend to think of upper-case words as yelling, and doing it
once to the data enables me to avoid worrying about it thereafter.

If you want
a list of all the collations available, along with terse explanations, run this
SQL:

SELECT * FROM ::fn_helpcollations()

To
determine the default collation of the database of interest, do this:

SELECT collation_name 
FROM master.sys.databases
WHERE Name='SQLTips'

In Listing E, I
add some rows containing special alphabetic characters. I use the name of my
good friend and colleague, DejanSunderic,
whose surname is properly written as Ĺ underic’ (the
trailing ‘c’ should actually have an acute accent above it — the display is a
result of the selected font).

As you see,
interesting things occur without the COLLATE predicate:

SELECT * 
FROM CaseCheck.Names
WHERE Name LIKE N'S%'

This
produces the following result set:

NamePK    Name
6    SUNDERIC
7    sUnDeRIc
8    Sunderic

Notice that
the special case of ‘Ĺ underic’ is not
included. Let’s experiment further:

SELECT * FROM CaseCheck.Names
ORDER BY Name

This
produces the following result set:

NamePK    Name
5    Ĺ underic
1    Fuller
2    FuLLer
3    FULLER
4    fuller
6    SUNDERIC
7    sUnDeRIc

This is
clearly incorrect. PK 5 should reside somewhere below all occurrences of
‘Fuller.’ I am no language expert, so I cannot claim to know where among the Ss this
ought to reside, but I suspect that this decision depends on local rules. It’s clear that ‘Ĺ underic’ ought to reside somewhere
after all occurrences of ‘Fuller’ and not before — but where in the list of
names beginning with ‘S’?

I did another experiment using the name of my friend
Alexander Karmanov. To see his name in the Cyrillic alphabet, check out Figure A. I ran the statement in
Figure B using the Cyrillic
alphabet spelling of his name.

Using the
default character set in SSMS, a query on the table returns his name as a
series of question marks. However, if I run the following statement, SSMS
returns the correct row but still displays his name as a series of question
marks due to the default character set. The important thing is that the Unicode
characters were entered and accepted. See Figure C.

If you want to determine the native collation in a given database, use
this code:

SELECT collation_name 
FROM master.sys.databases
WHERE Name = 'SQLTips'

Miss a tip?

Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller’s column.

Subscribe to the Data Insider Newsletter

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more. Delivered Mondays and Thursdays

Subscribe to the Data Insider Newsletter

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more. Delivered Mondays and Thursdays