Recently on Experts Exchange there have been lots of questions relating to the initial steps that are required when creating multi-lingual web sites. As I've had a fair bit of experience (and headaches) doing this, I thought it might be helpful to detail the initial steps you need to take.
If you want to just read the essential information without the explanations, please skip to the end for a summary.
Use UTF-8 to globalize your web pages
Every character (letter, number, symbol etc) used on your computer is referred to using something called a character code. All of these different character codes are collectively referred to as a character set. The character set used by your computer depends on the Locale you specify when you install Windows.
So, if you are in the United Kingdom, you would specify English (United Kingdom) as your locale, and Windows will then install the English UK character set, meaning you can then use all of the relevant letters, numbers and symbols on your computer. Obviously there is a large similarity with most of the character sets, but some (e.g. Chinese) can be very different to others.
If you open Character Map in Windows (Start > Run > charmap) you can actually see the character codes for different characters. Below, you can see that the uppercase letter 'S' has a character code of 0053:
Defining Character Sets in Web Pages (Badly!)
Quite a few years ago when the internet was growing, most people never considered requiring a web page that would be delivered to users in their own language. WYSIWYG editors such as Dreamweaver would create an empty page that generally included something like this in the header:
<meta http-equiv="Content-Language" content="en-gb">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
The top line above code tells the browser that the page's language and numbers are in English, with Great British formatting applied to dates and numbers. The second line tells the browser which character set it should use to display the text on the page. So, the browser would map the characters codes from the web page against the Windows 1252 characters using each respective code.
Some authors wouldn't even bother to include this data, assuming that because a page looked fine on their own computer that it would look correct everywhere else.
If you click the link above and look at the characters, you'll see that there aren't many. As only one character set can be defined on each page, there was no way to show English and Chinese characters on the same page. Users would often see page of 'junk' characters instead of the correct characters from their language.
Defining Character Sets in Web Pages (Correctly!)
Because of this limitation, in the 1990's a universal character set was created, called UTF-8. This contained every character for every language, meaning that web pages that used UTF-8 would be able to display multiple languages correctly. This character set is used as follows in pages:
<meta http-equiv="Content-Language" content="en-gb">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
It is important to note that after specifying the above, the actual web page file must be saved as a Unicode file. Most editors will do this automatically, however, editors such as Notepad will not do this by default, which can cause unexpected results.
Larger File Sizes
The downside is that because the UTF-8 character set is so large, the codes for each character needs to be twice as long. This meant that web page text files instantly became twice as large. Also, databases storing UTF-8 increased in size quite dramatically.
Globalizing Databases Using Unicode
The same principle is applied to databases, that suffered exactly the same problem as text files. Here, columns were changed to be Unicode columns. In SQL, this means that VARCHAR is swapped for NVARCHAR (notice the N prefix).
Global Fonts and Intelligent Browsers
Once your pages and database are all storing multinational characters, the next task is to display them correctly. Historically, this has been one of the biggest areas of confusion. Open Character Map again, and pick a funky font that you like to use on your pages, say Tahoma in this example, then scroll down the list of characters used in the font. You will see English, Greek, Hebrew, Latin, Arabic and Cyrillic characters included (a total of 1369 characters). So what about Chinese, Japanese, Hindi, Telegu, Tamil etc etc etc? The answer is that the characters for these cultures do not exist within this font family.
Modern browsers will know automatically detect whether the font you've chosen on your page will be able to display the characters it contains. If it can't, then the browser will automatically substitute your choice of font with one that can display the respective characters. Older browsers (like IE6) cannot do this, and will display a mass of jumbled symbols. One thing to be aware of here is that the browser's choice of font may be completely different (in size and layout) to your own choice, so this might mess with your layouts a little. Here's an example:
<meta http-equiv="Content-Language" content="en-gb" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>My Multi-lingual Page</title>
<p>He said, "Hello"</p>
<p>She replied, "很抱歉，但我不明白你"</p>
<p>Then a friend said, "Ίσως μπορούμε να χρησιμοποιήσουμε το Google Translate?"</p>
The above code will display like this:
Notice that even though the Tahoma font doesn't contain Chinese symbols, a substitute font has been used for line two. Even the size and layout have been matched automatically. Clever stuff. A few years ago this would have looked very different!
Fallback Options for Older Systems
Years ago, as an XP user with IE6, we always had to specify Arial Unicode MS as the page font to use. This is because this font contains every character for every language (it is an enormous font!). If creating pages for a users who may still be using older technologies (especially in the Middle/Far East), I would recommend taking this option (simply because it always worked for us).
Conclusion & Summary
Developing multi-lingual web applications isn't too difficult anymore, providing that the initial steps have been accounted for correctly. Here are the main points summarised:
- Always define Content-Language and Content-Type within your <head> tags, and use UTF-8 for this
- Save your web pages as Unicode (usually found in the Save As dialogue box) rather than plain ASCII
- Amend any database columns to use Unicode encoding (e.g. in SQL Server, VARCHAR should become NVARCHAR)
- Specify Arial Unicode MS as the font for users with older technology, who reside in the Far/Middle East
That's it. I hope that was useful. Please feel free to leave feedback, and in the case that I'm mistaken anywhere above, constructive criticism is most welcome.