Ascii to Unicode.

Thu Jul 29 19:49:40 EDT 2010

On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote:

> Don't think of unicode as a byte stream.  It's a bunch of numbers that
> map to a bunch of symbols.

Not only are Unicode strings a bunch of numbers ("code points", in 
Unicode terminology), but the numbers are not necessarily all the same 
width.

The full Unicode system allows for 1,114,112 characters, far more than 
will fit in a two-byte code point. The Basic Multilingual Plane (BMP) 
includes the first 2**16 (65536) of those characters, or code points 
U+0000 through U+FFFF; there are a further 16 supplementary planes of 
2**16 characters each, or code points U+10000 through U+10FFFF.

As I understand it (and I welcome corrections), some implementations of 
Unicode only support the BMP and use a fixed-width implementation of 16-
bit characters for efficiency reasons. Supporting the entire range of 
code points would require either a fixed-width of 21-bits (which would 
then probably be padded to four bytes), or a more complex variable-width 
implementation.

It looks to me like Python uses a 16-bit implementation internally, which 
leads to some rather unintuitive results for code points in the 
supplementary place... 

>>> c = chr(2**18)
>>> c
'\U00040000'
>>> len(c)
2

-- 
Steven