Unicode notes

These are just some of my notes about unicode.

Unicode doesn't just encode letters and numbers, it encodes high-level and abstract concepts found in human languages and cultures. This is because there are a lot of written languages which do not have a clean distinction between letters, characters, words, etc like english does.

So while encoding english in 8 bits is trivial, encoding hieroglyphics, chinese, arabic, etc is not.

Data Representation

Whereas a "character" is the basic unit in ASCII, a ''Code Point'' is the basic unit in Unicode. The only difference between a unicode codepoint and an ASCII character code is that the latter is limited to the range 0-255, but the former is limited to the range 0-4294967296.

However, only 21 bits are actually used for codepoints, not the full 32.

Encodings

  • UTF-8 encodes Unicode codepoints in at least 8-bits per codepoint
  • UTF-16 encodes Unicode codepoints in at least 16-bits per codepoint
  • UTF-32 encodes Unicode codepoints in exactly 32-bits per codepoint

All 3 of those encodings are multi-byte encodings, which means that a codepoint could span multiple units (where a "unit" is, for example, 8-bits for UTF-8, 16-bits for UTF-16).

For example, the codepoint for the chinese character 型 is ''U+578B'' (aka hex 578B). As a decimal value, that is ''22411'', which means that it won't fit in 8-bits, but it will fit in 16. That doesn't mean that it can't be encoded in UTF-8 though:

  • as UTF-8, that character would be encoded as two 8-bit bytes one after the other
  • as UTF-16, that character would be encoded as a single 16-bit short
  • as UTF-32, that character would be encoded as a single 32-bit int, with 16 unused bits.

This means that random access is impossible for UTF-8, but not for UTF-16 (as long as the entire string doesn't contain any codepoints above 2^16, otherwise it behaves like UTF-8 in this case).

Random access for UTF-32 is always possible, since 32 bits is more than enough to contain every single unicode codepoint in existence.

ASCII compatibility

UTF-8 provides backwards compatibility with ASCII, because the character codes for all ASCII characters are identical to their Unicode codepoint counterparts.

So a program that knows how to handle UTF-8 will be able to process ASCII without any special changes, and a program that knows nothing about Unicode can still work with UTF-8 text, as long as no codepoints above 2^8 are used.

If a program that is not aware of Unicode is fed UTF-16 text, it might have trouble rendering more than 1 character. For example:

  • A 16-bit encoding of the letter A might look like 41 00 (or 00 41 depending on endianness of the encoding).
  • When the algorithm encounters the first byte, it will see 41 and interpret it as an A from ASCII correctly.
  • When the algorithm steps forward by 8 bits (instead of 16), it will encounter 00 and incorrectly interpret that as a null terminator.
  • Any other codepoints in the string will not be reached

UCS2 vs UTF-16

The only difference between UCS2 and UTF-16 is that UCS2 is not multi-byte but UTF-16 is. So UCS2 will never be able to represent codepoints above 2^16, but UTF-16 can via "surrogate pairs".

A surrogate pair in UTF-16 is when you use two consecutive 16-bit values to define a full codepoint, which is necessary when encoding a codepoint with a value above 65,536.

Planes

A plane in Unicode is like a unit of measurement, defined as a contiguous group of 65,536 codepoints. There are 17 planes in total.

The first plane, plane 0, is called the Basic Multilingual Plane (or BMP). This plane contains the most commonly used characters, including a lot of asian characters.

This is convenient because it allows programs that only support UCS2 to work with almost every language in the world currently in use. While not all Chinese/Korean/Japanese characters are included in the BMP, the most common ones are.