Character set nightmares

One of the best things about working in IT is solving complex technical problems. Sorting out such problems can present to be quite challenging and you often learn new things. Especially the satisfaction after solving a problem that at first seemed unsolvable is wonderful.

However, there are some problems where solving them hardly brings any satisfaction, knowing they could surface again in another project. Despite the fact you gained extra knowledge at previous occasions, solving them remains tough.

An example of such a knotty problem is something I came across in 4 projects this last month. It involves use of special characters at displaying or entering data in a web application, often in combination with a data base. Have you ever noticed that on the website you’re reading an article, you find all sorts of question marks in the text, at positions where normally you’d find an “a”, “e”, “u” or “à” character? Looks really bad. Editing mistake? Sloppy programmer? I recently encountered this on hyves.nl content.

Why is this always such a knotty problem? Not too long ago there were hardly any standards for use of special characters (in Norway for example they use characters we don’t know over here, let alone in China and Japan), Each country has its own character set, or even several variations. With the advent of internet, web applications had to take into account all these different character sets. There are several standards, each character set with its own ISO standard description. In the Netherlands we mostly use ISO-8859-1 or ISO-8859-2.

Basically you expect by using application standard X and entering data according to this standard, there should be no problem But what makes it really difficult is multiple system layers are involved in entering data in an application and each part needs to agree on the standard to use. A tiny difference and question marks appear on your screen!

What’s more, recently it has been decided UTF-8 will be the general standard, because it can describe all character sets. As a result, many recent operating systems and applications support UTF-8, some more reliable than others. And some applications use a slightly different form of description of UTF-8, to complicate matters further. Many legacy applications use a mix of various standards, and migration of old data to these new systems will undoubtedly bring headaches to many a programmer, application and system manager.

While rounding of this article, I realise there is a slim chance this content will also cause problems displaying certain characters: I am typing this in a Powerbook text editor, and it will be stored in a character set my Mac is comfortable with. Then I will send it in an email to the content manager of our website. My email programme could possibly have a different notion of the character set of the sent email. When the content manager opens my email, the content is shown in a certain character set and he will probably (after some spelling corrections), cut and paste the text into a form on our website CMS. This CMS code has, again, its own idea on the character set use and then stores this in the data base, which also uses a specific character set. Finally you read this article from your computer browser, which will also have its opinion on character set use. Let’s just hope they all agree with each other.

Complicated? In the example above, I even omitted a number of layers to keep things brief.

In case you ever come across question marks on a website instead of special characters (and you will!), please remember there is probably a programmer or system manager out there breaking his head over this and suffering sleepless nights.

Leave a Reply

Screencast

Learn more about our services.

Call us at +31 88 00 22 700