{"id":1221,"date":"2015-04-16T11:52:11","date_gmt":"2015-04-16T11:52:11","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/webteam\/?p=1221"},"modified":"2015-04-16T19:30:03","modified_gmt":"2015-04-16T19:30:03","slug":"character-set-hacking","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/webteam\/2015\/04\/16\/character-set-hacking\/","title":{"rendered":"Character set hacking"},"content":{"rendered":"<p>This little note is mainly for my own benefit. Dealing with character sets can be a bit of headache. If you didn&#8217;t understand that statement then begin by <a href=\"http:\/\/www.joelonsoftware.com\/articles\/Unicode.html\">reading Joel on Software&#8217;s beginners guide<\/a> right now! Your going need to know all this one day. All programming languages I have ever used have utf-8 compatibility. Java is probably the best and python2 is utterly shocking. This is just one of the reasons python is not, as some people have claimed, &#8220;the perfect language&#8221;.<\/p>\n<p>Character encoding problems mainly fall into two buckets:<\/p>\n<ol>\n<li>You didn&#8217;t know what character encoding you received (or worse you were told what it was but that was a lie)<\/li>\n<li>You are saying your output is a different character set than it is.<\/li>\n<\/ol>\n<p>Usually option 1 results in option 2 because you take input from the user, store it incorrectly and then output it incorrectly. It doesn&#8217;t take long to get in a real mess.<\/p>\n<p>There is a command line tool called iconv which can very helpful for solving these problems. Most languages have an iconv library which you\u00a0 can make use of. Most people&#8217;s go to solution for this problem is clean up the data on input into uft-8, store it as uft-8 and then (you guessed it) output it as utf-8.<\/p>\n<p>You can get what ever crap the user sent you into utf-8 using iconv. The thing you probably need to do is detect the character set as best you can. In PHP that looks like this:<\/p>\n<p>$current_encoding = mb_detect_encoding($text, &#8216;auto&#8217;);<br \/>\n$utf8_text = iconv($current_encoding, &#8216;UTF-8&#8217;, $text);<\/p>\n<p>There will be some situations where the person getting data from you wont be able to accept utf-8. This provides a superb headache if you are storing all your data as utf-8 because you will more than likely have a series of characters they are unable to read. iconv is to the rescue again, it can convert into smaller character sets. You probably don&#8217;t want to throw away the data that doesn&#8217;t fit in the output character set so use &#8220;transliterate&#8221; to substitute characters in your data for a near alternative. This means nonsense like right-quote and left-quote get turned into &#8216; (quote) and other similar substitutes. Its not perfect but its better than just throwing letters away in the middle of words for all the <span class=\"st\">\u00c1\u00e1 \u1eae\u1eaf \u1ea4\u1ea5 \u01fa\u01fb \u0106\u0107 \u1e08\u1e09 <em>\u00c9\u00e9 \u1ebe\u1ebf<\/em> \u01f4\u01f5 \u00cd\u00ed \u1e2e\u1e2f \u1e30\u1e31 \u0139\u013a \u1e3e\u1e3f \u0143<wbr \/>\u0144<\/span> of the world.<\/p>\n<p>If the person on the other end doesn\u2019t know what they actually need (frustratingly common) then you probably want to give them ascii. In PHP that looks like this:<\/p>\n<p>$ascii_output = iconv(&#8216;UTF-8&#8217;, &#8216;ASCII\/\/TRANSLIT&#8217;, $output);<\/p>\n<p>For more information on the PHP iconv library you see the docs at <a href=\"http:\/\/php.net\/manual\/en\/function.iconv.php\">http:\/\/php.net\/manual\/en\/function.iconv.php<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This little note is mainly for my own benefit. Dealing with character sets can be a bit of headache. If you didn&#8217;t understand that statement then begin by reading Joel on Software&#8217;s beginners guide right now! Your going need to know all this one day. All programming languages I have ever used have utf-8 compatibility. [&hellip;]<\/p>\n","protected":false},"author":21,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[198,352,138,364],"tags":[],"class_list":["post-1221","post","type-post","status-publish","format-standard","hentry","category-best-practice","category-data","category-php","category-programming"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1221","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/comments?post=1221"}],"version-history":[{"count":5,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1221\/revisions"}],"predecessor-version":[{"id":1226,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1221\/revisions\/1226"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/media?parent=1221"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/categories?post=1221"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/tags?post=1221"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}