I was recently teaching a class, and a student asked about the character encoding used in a Web page. This led to a good discussion of the topic, and it got me thinking about the myriad options available to Web developers. This column examines why developers use character encoding for Web pages, outlines the character encoding options, and offers guidance on how to choose a character encoding.
Why worry about character encoding?
The character encoding associated with a Web page determines how the page renders in a Web browser. One important distinction to understanding the concept is the difference between character encoding and a character set.
Dictionary.com defines a character set as a particular mapping between characters and byte strings (i.e., a set of characters required for a certain language). It is the combination of a particular character encoding (which maps between byte strings and integers) and a particular coded character set. A coded character set is a set of characters for which a unique number has been assigned to each character. Character encoding is how these abstract characters are mapped to bytes for manipulation in a computer. To sum it up, character encoding tells the Web browser what set of characters to use when converting the bits to characters. Here are several reasons you should specify character encoding:
- You should worry about character encoding since its declaration became a requirement with the HTML 4.01 specification.
- If a character encoding is not specified in a Web page, the browser will guess at what encoding should be used to render Web page content. This guesswork can result in the wrong encoding scheme being used.
- Browsers allow users to choose a default character encoding. This choice may not match the setting for a Web page.
A Web page's character encoding is specified in the first line.
What is available?
The character encoding supported in HTML is defined with the Unicode character set. Unicode supports every alphabet with the capacity to represent millions of characters, including accented characters. Each character is assigned a two byte code value. This goes against the popular ASCII encoding used in the United States, which uses one byte.
Here is a sampling of available character encodings:
- ISO 8859-1: This is the standard encoding of the Latin alphabet. Also know as Latin1, it includes the Latin-based languages of the world.
- UTF-8 (8-bit UCS/Unicode Transformation Format): This character encoding is able to represent any character in the Unicode standard. A key difference is the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
- UTF-16 (16-bit Unicode Transformation Format): This is a variable-length character encoding for Unicode that is capable of encoding every Unicode character.
- US-ASCII: This is a subset of UTF-8 that covers the ASCII standard set of characters.
A full listing of character encoding options is available online, but UTF-8 is the recommended and most popular encoding scheme used today.
Choosing a character encoding
The main issue with character encoding selection is the need to use one that covers all the different languages and requirements of the intended audience. Character encoding is critical when dealing with multilingual applications that may use different languages that utilize different character encoding schemes.
When choosing a character encoding scheme, you must be aware of the characters that you will be using, along with the character encoding supported by the browser and any other applications that may be used to work with the files. The standards UTF-8 (which I stick with for my work) and US-ASCII are widely supported by browsers. You should do your research when working with standards other than these two.
Using a character encoding
When accessing a Web application, a Web browser will use the following steps to determine its character encoding:
- The HTTP Content-Type header sent by the server is the default way to define character encoding. This is the preferred method, and it takes precedence over other items in this list. Here is an example of the Content-Type line sent as part of the HTTP header:
Content-Type: text/html; charset=utf-8
Web developers may specify the Content-Type header for a page via the syntax available to the developer. For example, an ASP.NET developer may use the following line:
A PHP developer may use this line:
header('Content-type: text/html; charset=utf-8');
- XHTML docents may use the XML declaration in the first line of the page to specify character encoding. Here is one example:
<?xml version="1.0" encoding="UTF-8"?>
- You can use the HTML/XHTML meta content-type element. It is placed inside the header portion of the page with the character encoding specified in its charset property.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
You may declare the encoding of external CSS style sheets. This step is not necessary with CSS embedded in a page, as the page's character encoding takes care of it. You may designate the character encoding for a CSS file by adding a line to the top of the CSS file. The following syntax is used:
In addition, the charset attribute of the link element may be used.
Web pages have a variety of options that developers often overlook. Once such feature is the character encoding, which allows you to specify the set of characters supported by a page. You can specify the character encoding of a page numerous ways, including the HTTP header and a meta element. You should always specify the character encoding to ensure a page is properly displayed.
Do you specify the character encoding used in your applications? Do you use a standard other than UTF-8? Share your thoughts and experience with the Web Development community.
Tony Patton began his professional career as an application developer earning Java, VB, Lotus, and XML certifications to bolster his knowledge.
Get weekly development tips in your inbox
Keep your developer skills sharp by signing up for TechRepublic's free Web Development Zone newsletter, delivered each Tuesday. Automatically subscribe today!
Tony Patton has worn many hats over his 15+ years in the IT industry while witnessing many technologies come and go. He currently focuses on .NET and Web Development while trying to grasp the many facets of supporting such technologies in a production environment on a daily basis.