UTF-8 Encoding: Apache, PHP and MySQL

Encoding and programming

The characters that appear on computer screens, like any computer data, are just a sucession of 0 and 1 from the point of view of the machine. It is the number and order of these bits that define the standard of an encoding. The higher the number of bits, the more encoding will support characters.

The problems that can be encountered when switching to UTF-8 come from this difference in standard with the European encoding ISO. Between these two standards the problems will be at the level of “special” characters such as accented characters.

Besides the problems that this may imply, the UTF-8 can handle a larger number of characters, so to manage languages ​​with exotic glyphs, which does not allow the iso with its 256 possibilities.

But if the UTF-8 allows such things is that it is encoded on more bits than the ISO, and if it affects the display, it necessarily affects the processing of channels at the programming level and database storage.

Let’s imagine that we want to know the length of this chain: ‘éé’. Basically a language will count the number of bits in this string.

A function dedicated to this task will find 16 bits, two bytes, or two characters for the ISO. However this same function will find 32 bits on a UTF-8 encoding, so will return a value of 4 characters if it believes to be dealing with ISO … that is the problem.

This tutorial will cover the implementation of a compliant environment, its use and a quick recognition of the display problems between UTF-8 and ISO.

Preparation of the environment

To be sure everything is working properly it is imperative that the entire environment is up to standard so as not to mix everything up, forget to register its sources in the right format and all server configurations become as effective as nothingness.

Editors and BOM

Files must be encoded and saved in UTF-8. A priori simple, it depends on the goodwill of the text editor.

Some editors specify at the beginning of the file a Byte Order Mark (BOM), which is also useless for the UTF-8.

Inserting this character at the beginning of a php file (so before the opening <?php tag) can cause an error like “headers already
sent. ”

We must therefore be careful not to let the editor insert such a character (notepad or scite if we do not specify “UTF-8 Cookie”).

HTML code

For HTML code just specify the encoding using this tag:

<meta http-equiv="Content-type" content="text/html; charset=UTF-8"/>

Apache

Historically apache work in ISO-8859-1, so it is in this standard that he may send his headers.

The instruction to modify in the httpd.conf or in a .htaccess is:

AddDefaultCharset UTF-8

Otherwise via PHP:

header('Content-type: text/html; charset=UTF-8'); 

To know the header used by an apache server just look at the encoding when receiving a page using its browser (about the family: display -> encoding).

Otherwise here.

The http header is authoritative against the meta tag!

MySQL

MySQL fully supports UTF-8 since version 4.1. The instructions given here will work from this architecture and a serious development in UTF-8 will be done using a milesimme equal to or later than this one.

So yes it is possible to store unicode data in a 3.23 database, but it is expected that a string of 25 Cyrillic characters will be truncated in a varchar field of 40 (not to mention the problems related to SQL functions).

In this case all the instructions will be done using sql commands and not from compiler directives or my.cnf instructions.

Example of creating a database:

CREATE DATABASE mydatabase CHARACTER SET utf8 COLLATE utf8_general_ci;

For more information, see Database Character Set and Collation in the MySQL Reference Manual.

Note: The following is now considered a better practice:

CREATE DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;