Enterprise Software

Benign entities and misfit characters

Manage difficult characters in HTML.


By Paul Anderson

Time for another one of my Web peeves. All too often, especially on sites with lots of written copy, you'll come across a strange statement looking something like this:



These odd hieroglyphs tend to occur where apostrophes, quotes, or em dashes should be; you are not seeing an epidemic of remarkably consistent typos. The problem is a poor understanding of character sets among both Web authors and tool vendors.

Computers handle text as a series of numbers. To display those numbers as text, a computer uses a character set that determines which numeric code value represents which character. For example, the value 65 may produce a capital A, while 66 indicates a capital B. The problem is that character sets vary from computer to computer, and there is no hard, fast rule that 65 equals A and 66 equals B.

The trick on the Web is to make sure that your page uses a character set supported by your readers' browsers. Thanks to a number of standards and conventions, this is fairly simple.

Note that we're specifically discussing Western European character sets here. Multiple language support and the universal character set are larger issues beyond the scope of this column. But since anyone reading this presumably understands English, this will probably be useful to you.

Characters disagree; entities mediate
On Windows or a Web browser, the example from the previous page is illegible. But if you copy the text to a non-browser Macintosh application, the numeric data comprising the text corresponds to different characters:



The statement, while still strange, becomes legible. This is because the example was originally written on a Macintosh, which has a proprietary MacRoman character set. Windows has the proprietary Windows-1252, while Web browsers, regardless of platform, use still other sets. Luckily, figuring out how your text will look on browsers is not that complicated.

Common ground in Latin-1
First of all, the common denominator for most Western character sets is the ASCII set from 0 to 127. The control characters from 0 to 31 vary, but the legible characters from 32 (space) to 126 (tilde) are very consistent. These are the letters, numbers, and symbols you can type directly from any Western keyboard with a Shift key:



These characters reliably transfer from one platform to the next. Even non-Western character sets often include these characters since computer languages such as HTML are written in them.

However, since the ASCII set isn't adequate for human language, HTML has always (since version 2.0) supported the ISO-8859-1 character set, or Latin-1. This is equal to ASCII plus 96 more characters:



Not everyone can type these characters from the keyboard, and operating systems differ over how to encode them. So HTML provides entities that let you signify these characters by their Latin-1 numbers or by a name, such as either ã or ã to signify ã.

All browsers, including Macintosh browsers, support Latin-1 and use it as the default Western character set. This creates problems for Macintosh documents that don't use HTML entities, since the proprietary Macintosh set uses different codes for those characters. It just happens that the Windows character set is identical to Latin-1 in this range, so Windows documents tend to display properly.

Danger—or Unicode
But Windows and Macintosh each have characters beyond the Latin-1 set, in the range 128 to 159, and this is where they both lead to trouble. For Windows it includes such desirables as curly quotes, em dashes, and the trademark symbol.

Macintosh adds most of the same characters as well as ligatures, accents, and mathematic symbols.

Even where the two platforms have the same extended characters, they use different numbers to encode them. The only standard encoding for these characters is Unicode, a universal character set and work in progress that assigns a unique value for every possible character from all human languages. Unicode is a superset of Latin-1, so its character codes (32 to 126, 160 to 255) remain the same. HTML 4.0 lets you display any character using its Unicode value in a numeric entity, and it offers named entities for the most popular ones. However, only recent (5.0) browsers fully support the Unicode entities.

This is why CNET uses double hyphens—as you see here—for the em dash. While most platforms have an em-dash character, there is no common code number for it. We'll be able to use the — or — once support for Unicode and HTML 4.0 is sufficiently prevalent. We currently get away with • for the bullet character simply because, for whatever reason, non-Windows browsers almost all support this number as a bullet. But the Latin-1 entity · would be more defensible, and eventually we will switch to • or &#bull; instead.

Guidelines for extended character use
These guidelines will help you keep illegible characters out of your Web pages. They assume that your operating system has a Western character set (Windows-1252 or MacRoman) and your Web pages will be viewed in ISO-8859-1 (Latin-1). But you can apply similar guidelines to non-Western sets based on how your system and viewing sets' character codes do and do not line up. You may also need to adapt these guidelines to your site's content entry system, and advise the people who use it. Word processors, for an example, are probably the most frequent source of proprietary characters in Web pages, and simply turning off smart quotes can make all the difference.

If you're using a Macintosh
While Macintosh browsers use the Latin-1 set, authoring tools such as BBEdit and Dreamweaver input and display text using the local MacRoman character set. Unfortunately this set is proprietary beyond the visible ASCII characters, so you'll have to use escaped entities in your code and find substitutes for characters with no Latin-1 equivalents. These tools may at least have a conversion utility to replace the extended Macintosh characters with their appropriate entities for you.



If you're using Windows
While using entities is the kindest and surest approach, with Windows you have extended characters from codes 160 to 255 that map directly to their Latin-1 equivalents. This gives you the option to directly type or copy these characters into your Web pages. If you do this, take two steps to make sure your end users can read these characters.

First, studiously avoid the Windows characters outside the Latin-1 set. The usual hazards are daggers, trademark symbols, ellipses, bullet points, smart quotes, and em dashes. To illustrate, the last three are replaced with HTML 4.0 named entities below:
Acceptable with HTML 4.0 support
<p>• Check it out&emdash;we&rsquo;re selling a 79¢ &ldquo;collectible&rdquo; Björk poster (© 1999) for œ price!</p>


Second, make certain your Web page clearly identifies the character set to be used in the browser, either in the Content-Type HTTP header or, if you can't arrange that, with a META tag in the page's HEAD:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

This is in case a reader's browser is configured to use a different set. Although you could try simply setting the character set to Windows-1252, not all browsers reliably support it, whereas Latin-1 (ISO-8859-1) is a sure bet.

Isn't this rather a hassle?
The whole character entity system is a standby for older systems while we wait for widespread Unicode support. Now that the standard is well underway, newer operating systems such as Windows NT and 2000, Mac OS X, Solaris, and even PalmOS support Unicode natively, as do office, browser, and developer applications for these systems. With native Unicode encoding you can enter and store your data so that it can be reliably viewed on any compliant application. So be patient while you struggle with the entities; time is on your side.

Editor's Picks

Free Newsletters, In your Inbox