unichr() question

Martin v. Löwis martin at v.loewis.de
Wed Nov 5 14:27:59 EST 2003


"Ezequiel, Justin" <j.ezequiel at spitech.com> writes:

> I am converting XML files with entities to utf-8 using a lookup table:
> 
> ⏞	0FE37
> ⏟	0FE38
> <sc>O</sc>	1D4AA

The last one is not an XML entity reference, of course. Also, you are
not converting to UTF-8, atleast not in this table - you convert to
Unicode code points.

> I have no idea what I am doing but I sure think that I absolutely
> need it.

If you eventually need UTF-8, you might just as well create a mapping
table that translates to UTF-8.

> Can you explain more on non-BMP characters (and Python's
> capabilities to represent these) and how it applies (if it does) to
> my needs?

Well, the BMP (basic multilingual plane) is the first 65536 characters
of Unicode. Recent Unicode revisions added characters beyond the first
64k, for characters rarely used; the MathML characters got allocated
there as well.

Python traditionally was using a two-byte type to represent Unicode,
so it cannot represent characters outside the BMP, atleast not in
Unicode strings of length 1. If you compile Python with --enable-ucs4,
you can readily represent all these characters. If you have only
UCS-2, you need two-character surrogate pairs to represent non-BMP
characters; this is called UTF-16.

If you want to learn more about UTF-16, see

http://www.wikipedia.org/wiki/UTF-16
http://www.faqs.org/rfcs/rfc2781.html

Python supports UTF-16 in the following contexts:
- encoding and decoding surrogate pairs in the UTF-8 codec
- representing surrogate pairs as a single \U unicode string
  escape sequence.

Other aspects of UTF-16, such as distinguishing between the length of
a string in code points vs. the length of the string in code units are
not considered.

Regards,
Martin




More information about the Python-list mailing list