Converting Character Encodings to UTF-8 in PHP - Fresh Blurbs by Irakli Nadareishvili

Character encodings are always “fun” to deal with. The fortunate and naive ones amongst you, who believe that utf-8 solved all pains in the multilingual encoding arena, let me tell you: “you are lucky to be in your surreal world”.

I was trying to hook up a PHP application with a web-service wrapped around a legacy application, today. Unfortunately, that web-service was only capable of sending me a ISO-8859-15-encoded output. Since the guy who did the service was in enough pain having had to script it in Lotus Notes Scripting Language, I did not dare ask to fix the problem. Neither do I know (or want to know) enough about Lotus Notes to assume that it was possible, at all.

So I tried fixing the problem on the PHP side. Now, PHP does have a nice method called utf8_encode that encodes ISO-8859-1 strings into UTF-8. You may say - No brainer? Well, not quite. My input was ISO-8859-15. The bratty “5” in the end stands for some extra characters mainly used in French and Finnish, but popping up in Turkish, in my case.

What finally worked was encoding the tricky characters with htmlentities like this:

$htmlized = htmlentities( $rawinput, ENT_NOQUOTES, 'ISO-8859-15');

In PHP5, you can actually decode it right away and get a “clean” output if you do something like:

$rawinput_utf8 = html_entity_decode( $htmlized, ENT_NOQUOTES, 'UTF-8');

but it does not work in PHP4, for multibyte encodings (e.g. UTF-8) so - watch out.