The absolute minimum every software developer must know about unicode in 2023

Last night I stumbled across this wonderful post regarding unicode that I wanted to share with you guys.

It contains many curious information and stuff I did not know about Unicode that I believe may be useful (to a certain point) for many of you devs who struggle to understand Unicode and UTF.

TL;DR;

To sum it up:

  • Unicode has won.
  • UTF-8 is the most popular encoding for data in transfer and at rest.
  • UTF-16 is still sometimes used as an in-memory representation.
  • The two most important views for strings are bytes (allocate memory/copy/encode/decode) and extended grapheme clusters (all semantic operations).
  • Using code points for iterating over a string is wrong. They are not the basic unit of writing. One grapheme could consist of multiple code points.
  • To detect grapheme boundaries, you need Unicode tables.
  • Use a Unicode library for everything Unicode, even boring stuff like strlen, indexOf and substring.
  • Unicode updates every year, and rules sometimes change.
  • Unicode strings need to be normalized before they can be compared.
  • Unicode depends on locale for some operations and for rendering.
  • All this is important even for pure English text.

Overall, yes, Unicode is not perfect, but the fact that

  • an encoding exists that covers all possible languages at once,
  • the entire world agrees to use it,
  • we can completely forget about encodings and conversions and all that stuff

is a miracle. Send this to your fellow programmers so they can learn about it, too.