Skip to content

Character set hacking

This little note is mainly for my own benefit. Dealing with character sets can be a bit of headache. If you didn’t understand that statement then begin by reading Joel on Software’s beginners guide right now! Your going need to know all this one day. All programming languages I have ever used have utf-8 compatibility. Java is probably the best and python2 is utterly shocking. This is just one of the reasons python is not, as some people have claimed, “the perfect language”.

Character encoding problems mainly fall into two buckets:

  1. You didn’t know what character encoding you received (or worse you were told what it was but that was a lie)
  2. You are saying your output is a different character set than it is.

Usually option 1 results in option 2 because you take input from the user, store it incorrectly and then output it incorrectly. It doesn’t take long to get in a real mess.

There is a command line tool called iconv which can very helpful for solving these problems. Most languages have an iconv library which you  can make use of. Most people’s go to solution for this problem is clean up the data on input into uft-8, store it as uft-8 and then (you guessed it) output it as utf-8.

You can get what ever crap the user sent you into utf-8 using iconv. The thing you probably need to do is detect the character set as best you can. In PHP that looks like this:

$current_encoding = mb_detect_encoding($text, ‘auto’);
$utf8_text = iconv($current_encoding, ‘UTF-8’, $text);

There will be some situations where the person getting data from you wont be able to accept utf-8. This provides a superb headache if you are storing all your data as utf-8 because you will more than likely have a series of characters they are unable to read. iconv is to the rescue again, it can convert into smaller character sets. You probably don’t want to throw away the data that doesn’t fit in the output character set so use “transliterate” to substitute characters in your data for a near alternative. This means nonsense like right-quote and left-quote get turned into ‘ (quote) and other similar substitutes. Its not perfect but its better than just throwing letters away in the middle of words for all the Áá Ắắ Ấấ Ǻǻ Ćć Ḉḉ Éé Ếế Ǵǵ Íí Ḯḯ Ḱḱ Ĺĺ Ḿḿ Ńń of the world.

If the person on the other end doesn’t know what they actually need (frustratingly common) then you probably want to give them ascii. In PHP that looks like this:

$ascii_output = iconv(‘UTF-8’, ‘ASCII//TRANSLIT’, $output);

For more information on the PHP iconv library you see the docs at

Posted in Best Practice, PHP, Programming.

Tagged with .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.