PHP and Unicode

March 1, 2014

There was a bug dealing with Unicode on this website and when somebody discovered it, it was pointed out in a comment to a post:

Who nowadays use non-Unicode strings??

The comment was addressing the name field in a previous comment that was being displayed as:

JosÃ© DÃez

I felt simply fixing this was not enough. It deserves more explanation on how Unicode is handled on this site and how you can avoid the same bug. After all, I've read Joel Spolsky's post on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and I know all I need to know at a minimum. Highly recommended read.

However, the problem in this case, even though I planned to support Unicode up front and made special considerations for it, is there are still many gotcha scenarios. Especially with PHP. And that's not an understatement.

Before I dive into the bug, here's what I do to support Unicode from the browser to the database, and back.

You'll notice the header for this site contains the HTML5 meta tag which informs the browser of the encoding:

<meta charset="utf-8" />

I know what you're thinking. The browser has to get the page which contains the tag so then it can determine what the encoding is. It makes no sense, but it works.

Just in case it doesn't work, I also supply the encoding in the headers which is a good catch-all.

header('Content-Type: text/html; charset=utf-8');

All text/varchar fields in MySQL the database are explicitly set to CHARACTER SET utf8. Collate settings for columns are handled differently based on how I want the database to do matching.

There's also the connection to the database which needs an encoding. This is often overlooked. You can do this with SQL, but it's preferred to do it with the following PHP function so that the same encoding can be used with things like mysqli_real_escape_string().

mysqli_set_charset($conn,'utf8');

With all that in place, there are things that you have to be careful of when dealing with strings. Particularly, any string operation that is sensitive to encoding.

The PHP documentation has a nice big warning that is really a given when working with utf-8, but it's comforting that they point it out.

Internally, PHP strings are byte arrays. As a result, accessing or modifying a string using array brackets is not multi-byte safe, and should only be done with strings that are in a single-byte encoding such as ISO-8859-1.

So, this means no using strlen(), substr(), strpos(), strlen() or strcmp() without careful consideration. And no using $str[2] to get the third character. In some cases, it makes sense to configure and use the mb_* alternatives. But, what about PHP functions that deal with strings indirectly? This is what caused my bug. One of those functions is htmlentities(). The PHP documentation itself makes no qualms about how messy this is:

Of course, in order to be useful, functions that operate on text may have to make some assumptions about how the string is encoded. Unfortunately, there is much variation on this matter throughout PHP’s functions.

The function htmlentities() is defined like so with all but one optional parameter.

string htmlentities ( string $string [, int $flags = ENT_COMPAT | ENT_HTML401 [, string $encoding = 'UTF-8' [, bool $double_encode = true ]]] )

What this means is it's really easy to just call it like this:

htmlentities($text);

This function depends on knowing the encoding, which is where the other parameters come into play. It should be called like this:

htmlentities($text, ENT_QUOTES, 'UTF-8');

While I do indeed believe this is preferred, of course, there are exceptions mentioned in the documentation for this function to get around having to do this:

Like htmlspecialchars(), htmlentities() takes an optional third argument encoding which defines encoding used in conversion. From PHP 5.6.0, default_charset value is used as default. From PHP 5.4.0, UTF-8 is the default. PHP prior to 5.4.0, ISO-8859-1 is used as the default. Although this argument is technically optional, you are highly encouraged to specify the correct value for your code.

In summary, when using the UTF-8 encoding, it really doesn't mean you need to go out of your way to handle Unicode, but it does mean that you have to be cautious with how you work with strings. Also, make sure your development and production servers are running the same version of PHP with the same settings so these types of things are obvious.

PHP and Unicode

Related Posts