Unicode

2023-10-17

You never have to think about Unicode until you do.

I recently read this blog post: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!). It’s a nod to a similarly titled 2003 post by Joel Spolsky. And if you are a Go programmer, you may have encountered this blog post, which explains how Go handles strings (it also references the Spolsky post).

I highly recommend all three - the third even if you aren’t a Go programmer.

The premise is that we (software engineers) often take strings for granted, but there are fundamentals to text encoding that are worth grokking. The majority of the time you don’t need to deal with these details. But, like many topics, eventually the edge cases come and demand a deeper understanding.

Key Take-Aways

Unicode

Unicode is a standard that maps all the characters in the world to “code points”.

Code points are just numbers, usually written as hexadecimal digits with a U+ prefix, e.g.:

Name	Character	Code Point	Decimal Value
Latin Capital Letter A	A	U+0041	65
Rightwards arrow	→	U+2192	8594
Dog face	🐶	U+1F436	128054

Unicode is an evolving standard. Its codespace covers U+0000 to U+10FFFF, about 1.1M code points, of which currently only ~15% are defined. The Unicode Consortium periodically releases new versions which include new character mappings.

Encodings

Code points are somewhat theoretical: they just map characters to numbers. But there are different ways for computers to store these values in memory.

Character Encodings are the concrete implementations for storing strings in memory.

UTF-8 is the most popular encoding (98% of web pages according to the post). UTF-8 is a variable-length encoding, so characters are represented by sequences of 1 to 4 bytes depending on their value.

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx	-	-	-
U+0080	U+07FF	110xxxxx	10xxxxxx	-	-
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx	-
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

The first 128 code points (U+0000 - U+007F) are stored in 1-byte and are equivalent to ASCII, making UTF-8 backward compatible with ASCII. Higher code points use 2, 3, or 4 bytes according to the table above. The byte prefixes indicate whether a byte is the leading, or non-leading, byte in a 1-, 2-, 3- or 4-byte sequence.

UTF-16 is another encoding. Despite being less popular than UTF-8, UTF-16 is important for web developers because it’s what JavaScript uses.

UTF-16 encodes strings as sequences of 16-bit “code units”. Because some code points are larger than 16-bits, UTF-16 uses a technique called “surrogate pairs”, which, as the name suggests, are pairs of 16-bit code units that represent a single character. Thus UTF-8 and UTF-16 are both variable length, but the implementation is different.

To illustrate with some JavaScript:

const message = 'hi 😍';
message.length; // => 5
message.split(''); // => [ 'h', 'i', ' ', '\ud83d', '\ude0d' ]

// iterate over UTF-16 code units
for (let i = 0; i < message.length; i++) {
  console.log(`${message[i]} ${message.charCodeAt(i).toString(16)}`);
}
// Output:
// h 68
// i 69
//   20
// � d83d
// � de0d

// iterate over Unicode code points
for (const char of message) {
  console.log(`${char} ${char.codePointAt(0).toString(16)}`);
}
// Output:
// h 68
// i 69
//   20
// 😍 1f60d

This shows a few things:

h, i, and [space] correspond to single UTF-16 code units, \u0068, \u0069, and \u0020 respectively
the 😍 emoji is comprised of a surrogate pair: \ud83d\ude0d; each value alone is invalid and prints as �, but together they are interpreted correctly as the emoji
the length of a string in JS corresponds to the number of UTF-16 code units, NOT Unicode code points, and NOT bytes
when you .split('') a string, or access its indexes, it uses the UTF-16 code points
on the other hand, when you iterate over a string with for...of, or anything that uses the iterable protocol, each loop yields the full unicode character, which could be multiple UTF-16 code units in length

We can also decode and encode UTF-8 in JS:

const bytes = new Uint8Array([
  0x68, // h
  0x69, // i
  0x20, // [space]
  0xf0, 0x9f, 0x98, 0x8d, // 😍
]);

const string = new TextDecoder('utf-8').decode(bytes); // => "hi 😍"
string.length; // => 5

console.log(new TextEncoder().encode('😍')); // => Uint8Array(4) [ 240, 159, 152, 141 ]

Note that the emoji is represented by 4-bytes in UTF-8, rather than the surrogate pair in UTF-16, but in both encodings it would occupy 32 bits. The Latin characters have the same numeric values, but UTF-16 is using 16 bits per character, making UTF-8 more space-efficient.

Character	Code Point	UTF-8 Hex	UTF-8 Binary	UTF-16 Hex	UTF-16 Binary
h	U+0068	68	01101000	00 68	00000000 01101000
i	U+0069	69	01101001	00 69	00000000 01101001
[space]	U+0020	20	00100000	00 20	00000000 00100000
😍	U+1F60D	F0 9F 98 8D	11110000 10011111 10011000 10001101	D8 3D DE 0D	11011000 00111101 11011110 00001101

Extended Grapheme Clusters

There is one more layer to this onion, and that is the “grapheme cluster”. Grapheme clusters are certain sequences of Unicode code points that should be displayed as one visual unit.

Common examples are modified emoji, like thumbs up with a skin tone: 👍🏽.

This symbol is actually comprised of two code points: Thumbs Up (U+1F44D) and a skin tone modifer (U+1F3FD). In UTF-16, each of those code points is comprised of a surrogate pair. In UTF-8, each would be a 4-byte value.

const grapheme = '👍🏽';
grapheme.length; // => 4
[...grapheme]; // => [ '👍', '🏽' ]
for (const char of grapheme) {
  console.log(`${char} ${char.codePointAt(0).toString(16)}`);
}
// Output:
//  👍 1f44d
// 🏽 1f3fd

In Summary

As a programmer, knowing how encodings work is useful when you are slicing, dicing, counting, splitting, or in any way manipulating or parsing strings. A mental model of Unicode and character encoding is prerequisite to understanding these operations.

As shown in the JS examples above, some functions operate on UTF-16 code units (String.length, indexes), while others operate on Unicode code points (iterables such as for...of loops and ... spread). In other languages string methods may operate on bytes.

Handling grapheme clusters is a layer above the encoding, and often times languages don’t have a built-in way to work with them. The JavaScript MDN docs simply concede:

Iterating through grapheme clusters will require some custom code.
MDN

��

Brian's Website

Unicode

Key Take-Aways

Unicode

Encodings

Extended Grapheme Clusters

In Summary