[Tutor] why is unichr(sys.maxunicode) blank?

Albert-Jan Roskam fomcl at yahoo.com
Sat May 18 12:01:40 CEST 2013



>> I was curious what the "high" four-byte ut8 unicode characters look like.
>

>By the way, your sentence above reflects a misunderstanding. Unicode characters (strictly speaking, code points) are not "bytes", four or otherwise. They are abstract entities represented by a number between 0 and 1114111, or in hex, 0x10FFFF. Code points can represent characters, or parts of characters (e.g. accents, diacritics, combining characters and similar), or non-characters.



Thanks for all your replies. I knew about code points, but to represent the unicode string (code point) as a utf-8 byte string (bytes), characters 0-127 are 1 byte (of 8 bits), then 128-255 (accented chars) 
are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on Joel Spolsky's "standard" page about unicode I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".



>Much confusion comes from conflating bytes and code points, or bytes and characters. The first step to being a Unicode wizard is to always keep them distinct in your mind. By analogy, the floating point number 23.42 is stored in memory or on disk as a bunch of bytes, but there is nothing to be gained from confusing the number 23.42 from the bytes 0xEC51B81E856B3740, which is how it is stored as a C double.
>
>Unicode code points are abstract entities, but in the real world, they have to be stored in a computer's memory, or written to disk, or transmitted over a wire, and that requires *bytes*. So there are three Unicode schemes for storing code points as bytes. These are called *encodings*. Only encodings involve bytes, so it is nonsense to talk about "four-byte" unicode characters, since it conflates the abstract Unicode character set with one of various concrete encodings.


I would admit it if otherwise, but that's what I meant ;-)



>There are three standard Unicode encodings. (These are not to be confused with the dozens of "legacy encodings", a.k.a. code pages, used prior to the Unicode standard. They do not cover the entire range of Unicode, and are not part of the Unicode standard.) These encodings are:



I always viewed the codepage as "the bunch of chars on top of ascii", e.g. cp1252 (latin-1) is ascii (0-127) +  another 128 characters that are used in Europe (euro sign, Scandinavian and Mediterranean (Spanish), but not Slavian chars). A certain locale implies a certain codepage (on Windows), but where does the locale category LC_CTYPE fit in this story?



>
>UTF-8
>UTF-16
>UTF-32 (also sometimes known as UCS-4)
>
>plus at least one older, obsolete encoding, UCS-2.

Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe this is a different abbreviation. I read about bit multilingual plane (BMP) and surrogate pairs and all. The author suggested that messing with surrogate pairs is a topic to dive into in case one's nail bed is being derusted. I wholeheartedly agree.



>UTF-32 is the least common, but simplest. It simply maps every code point to four bytes. In the following, I will follow this convention:
>
>- code points are written using the standard Unicode notation, U+xxxx where the x's are hexadecimal digits;
>
>- bytes are written in hexadecimal, using a leading 0x.
>
>Code point U+0000 -> bytes 0x00000000
>Code point U+0001 -> bytes 0x00000001
>Code point U+0002 -> bytes 0x00000002
>...
>Code point U+10FFFF -> bytes 0x0010FFFF
>
>
>It is simple because the mapping is trivially simple, and uncommon because for typical English-language text, it wastes a lot of memory.
>
>The only complication is that UTF-32 depends on the endianess of your system. In the above examples I glossed over this factor. In fact, there are two common ways that bytes can be stored:
>
>- "big endian", where the most-significant (largest) byte is on the left (lowest address);
>- "little endian", where the most-significant (largest) byte is on the right.


Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is "utf-8" a shorthand for saying "utf-8 le"?



>So in a little-endian system, we have this instead:
>
>Code point U+0000 -> bytes 0x00000000
>Code point U+0001 -> bytes 0x01000000
>Code point U+0002 -> bytes 0x02000000
>...
>Code point U+10FFFF -> bytes 0xFFFF1000
>
>(Note that little-endian is not merely the reverse of big-endian. It is the order of bytes that is reversed, not the order of digits, or the order of bits within each byte.)
>
>So when you receive a bunch of bytes that you know represents text encoded using UTF-32, you can bunch the bytes in groups of four and convert them to Unicode code points. But you need to know the endianess. One way to do that is to add a Byte Order Mark at the beginning of the bytes. If you look at the first four bytes, and it looks like 0x0000FEFF, then you have big-endian UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.

So each byte starts with a BOM? Or each file? I find utf-32 indeed the easiest to understand. In utf-8, how does a system "know" that the given octet of bits is to be interpreted as a single-byte character, or rather like "hold on, these eight bits are gibberish as they are right now, let's check what happens if we add the next eight bits", in other words a multibyte char (forgive me the naive phrasing ;-). Why I mention is in the context of BOM: why aren't these needed to indicate "mulitbyte char ahead!"?



>So that's UTF-32. UTF-16 is a little more complicated.
>
>UTF-16 divides the Unicode range into two groups:
>
>* The first (approximately) 65000 code points which are represented as two bytes;
>
>* Everything else, which are represented as a pair of double bytes, so-called "surrogate pairs".


Just as I thought I was starting to understand it.... Sorry. len(unichr(63000).encode("utf-8")) returns three bytes.
What should I do to arrive at two? Something like len(unichr(63000).encode("<internal unicode encoding that Python 2.7 uses>"))? 


>Last but not least, we have UTF-8. UTF-8 is slowly becoming the standard for storing Unicode on disk, because it is very compact for common English-language text, backwards-compatible with ASCII text files, and doesn't require a BOM. (Although Microsoft software sometimes adds a UTF-8 signature at the start of files, namely the three bytes 0xEFBBBF.)

Ah, ok, this answers one of my questions above.


Thanks again, all, it is much appreciated!



More information about the Tutor mailing list