Ascii to Unicode.
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Thu Jul 29 19:49:40 EDT 2010
On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote:
> Don't think of unicode as a byte stream. It's a bunch of numbers that
> map to a bunch of symbols.
Not only are Unicode strings a bunch of numbers ("code points", in
Unicode terminology), but the numbers are not necessarily all the same
width.
The full Unicode system allows for 1,114,112 characters, far more than
will fit in a two-byte code point. The Basic Multilingual Plane (BMP)
includes the first 2**16 (65536) of those characters, or code points
U+0000 through U+FFFF; there are a further 16 supplementary planes of
2**16 characters each, or code points U+10000 through U+10FFFF.
As I understand it (and I welcome corrections), some implementations of
Unicode only support the BMP and use a fixed-width implementation of 16-
bit characters for efficiency reasons. Supporting the entire range of
code points would require either a fixed-width of 21-bits (which would
then probably be padded to four bytes), or a more complex variable-width
implementation.
It looks to me like Python uses a 16-bit implementation internally, which
leads to some rather unintuitive results for code points in the
supplementary place...
>>> c = chr(2**18)
>>> c
'\U00040000'
>>> len(c)
2
--
Steven
More information about the Python-list
mailing list