The Essentials of UTF-8: What Are 4-Byte Characters?

Mayeul
August 25, 2024
12:15 pm
3 minutes

Introduction to UTF-8 Encoding

UTF-8 (Unicode Transformation Format – 8-bit) is a widely used character encoding system that can represent every character in the Unicode character set. It is designed to be backward compatible with ASCII and to handle a wide variety of characters, symbols, and even emojis. UTF-8 is the dominant encoding on the web, and its flexibility makes it a standard choice for many applications.

How UTF-8 Works

In UTF-8 encoding, characters can be represented using one to four bytes, depending on their Unicode code point. Here’s a brief overview of how many bytes are used:

1-byte (7-bit) characters: These are characters that fall within the ASCII range (U+0000 to U+007F). They use a single byte, with the most significant bit set to 0. (e.g., basic ASCII characters like A, 0, !).
2-byte (11-bit) characters: These characters fall within the range U+0080 to U+07FF. The first byte begins with 110, and the second byte begins with 10. (e.g., Latin-1 Supplement, Greek, Cyrillic).
3-byte (16-bit) characters: These represent code points from U+0800 to U+FFFF. The first byte begins with 1110, and the following two bytes begin with 10. (e.g., most commonly used characters in various scripts).
4-byte (21-bit) characters: These are the focus of our guide. They represent code points from U+10000 to U+10FFFF and are used to encode supplementary characters, such as certain emojis and rare script characters. The first byte begins with 11110, followed by three additional bytes that each begin with 10. (e.g., emojis, certain lesser-used characters, some rare symbols).

What are 4(+)-byte characters?

4-byte characters in UTF-8 correspond to Unicode code points from U+10000 to U+10FFFF. These characters are outside the Basic Multilingual Plane (BMP) and include:

Emojis: Many emojis are represented as 4-byte characters.
- 🐱 (U+1F431, “CAT FACE”)
- 😎 (U+1F60E, “SMILING FACE WITH SUNGLASSES”)
- 🚀 (U+1F680, “ROCKET”)
Supplementary Characters:
- 𐍈 (U+10348, “GOTHIC LETTER HWAIR”)
- 𐎀 (U+10380, “UGARITIC LETTER ALPA”)
Rare Symbols:
- 𒀀 (U+12000, “CUNEIFORM SIGN A”)
- 𓀀 (U+13000, “EGYPTIAN HIEROGLYPH A001”)
Historic Scripts:
- 𑀀 (U+11000, “BRAHMI LETTER A”)
- 𐤀 (U+10900, “PHOENICIAN LETTER ALF”)

Examples of 4-Byte UTF-8 Characters

Here are some specific examples of characters encoded as 4 bytes in UTF-8:

Emoji: 😄 (U+1F604, “SMILING FACE WITH OPEN MOUTH AND SMILING EYES”)
- UTF-8 Encoding: F0 9F 98 84
Emoji: 🌍 (U+1F30D, “EARTH GLOBE EUROPE-AFRICA”)
- UTF-8 Encoding: F0 9F 8C 8D
Musical Notation: 𝄞 (U+1D11E, “MUSICAL SYMBOL G CLEF”)
- UTF-8 Encoding: F0 9D 84 9E
Ancient Script: 𐎄 (U+10384, “UGARITIC LETTER DA”)
- UTF-8 Encoding: F0 90 8E 84

Why Do These Characters Use 4 Bytes?

Characters that fall outside the BMP (Basic Multilingual Plane) need 4 bytes in UTF-8 because their Unicode code points require more than 16 bits to represent. UTF-8 encoding uses a variable-length byte sequence, and 4 bytes are needed for code points from U+10000 to U+10FFFF, ensuring that all possible Unicode characters can be represented.

Summary

4-byte characters in UTF-8 typically include emojis, rare or historic script characters, and certain symbols that are not part of the BMP. They are more complex in their representation and are often removed or filtered out in systems that do not support them, like MySQL with the utf8 charset (as opposed to utf8mb4).

Resources

Unicode Blocks (Character Ranges)
Emoji Unicode Blocks (Characters Ranges)
- 1F600-1F64F : Emoticons
- 1F300-1F5FF : Misc Symbols and Pictographs
- 1F680-1F6FF : Transport and Map Symbols
- 1F700-1F77F : Alchemical Symbols
- 1F780-1F7FF : Geometric Shapes Extended
- 1F800-1F8FF : Supplemental Arrows-C
- 1F900-1F9FF : Supplemental Symbols and Pictographs
- 1FA00-1FA6F : Chess Symbols
- 1FA70-1FAFF : Symbols and Pictographs Extended-A
- 2600-26FF : Miscellaneous Symbols
- 2700-27BF : Dingbats