Unicode is in Your Now

If this blog entry was written 10–15 years ago, the title would have been “Unicode is in Your Future“. Luckily, the Unicode standard has been widely adopted during the last decade, so much so that it has almost become a part of the process and not something that you need to expend very much extra effort on. It is here Now, and has been for some time now.

However, Unicode still isn’t quite as widely understood as it needs to be, and it is often adopted as a black box that nobody can really fix when something goes wrong. Therefore it is not at all bad to try and bring it into perspective.

You need to understand at least why Unicode should be used, and how not to make it more complicated than it is (even though it can still be quite complicated). So keep reading.

Why Unicode? Because it is currently the best (maybe only) solution to a very hard technical problem in information systems: how to represent multilingual textual content in a global, distributed data processing environment. The birth of the World Wide Web has made this a much bigger problem than it used to be in software development, but even before the Web it was a problem that needed solving, and was definitely worth solving even then, only more so now.

What is Unicode? Essentially, Unicode is a way of assigning a unique numeric code to every character used in human written communication, for both living and “dead” languages. This is very different from previous character encoding schemes, now called “legacy encodings”, which tended to reuse the same numeric codes in various contexts, resulting in near-fatal ambiguity and the need to provide out-of-band context information about the text.

Unicode also provides transformation formats which can be used to transmit Unicode-encoded text over the network and in storing text in filesystems. The most widely used transformation format, UTF-8, makes byte order irrelevant and makes it feasible to treat text files just as any other file, instead of relying on separate “binary” and “text” file types and read-write modes.

How to use it? After 15 years of working in software internationalization, I have come to think that Unicode is a honking great idea, even though it has its quirks. In those years I have also encountered resistance to Unicode based on lack of understanding, lack of motivation, and just plain old carelessness. I’ve solved problems caused by lack of attention that should have been paid to character encodings, and those solutions have caused both enlightenment (as in an enthusiastic “OK, now I get it—this stuff really matters!”) and indifference (as in an annoyed “OK, can we finally ship this now?”).

By adopting just a few ground rules you can successfully leverage Unicode and get more of the good stuff, with less of the bad. (There are really more than three things to care about, but you have to start somewhere.)

Rule 1: Pay attention to the length of characters: it used to be true that one character was equal to one byte, but that is not so anymore. Even many “normal” characters can be legally represented in Unicode as a base character and one or more combining characters.

Rule 2: Stick to one transformation format, and make it UTF-8. You will know if you need the others. Only write out UTF-8, but accept “legacy” encodings as input if necessary.

Rule 3: Use Unicode-savvy APIs for text processing, instead of the old stdio library functions or even the early java.lang.String class. Check your language and library documentation. Don’t roll your own unless you are an OS / language / library designer.

Read up! Since Unicode is a subject of the length of several books, you will do yourself a favor by getting at least one of the good ones. Personally I recommend Unicode Explained(buy from Amazon.co.uk) by Jukka K. Korpela (O’Reilly, 2006), although you should always also refer to the standard. If you are in charge of implementing Unicode-enabled systems, get Unicode Demystified(buy from Amazon.co.uk) by Richard Gillam (Addison-Wesley, 2003).

We can help! If you need a technical, hands-on introduction to Unicode and its uses for the benefit of your software team, contact us and we’ll sort you out.