A few questiosn about encoding

Wed Jun 12 16:30:23 EDT 2013

On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?

U+0000..U+007F  1 byte
U+0080..U+07FF  2 bytes
U+0800..U+FFFF 	3 bytes
>=U+10000       4 bytes

So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).