[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
25 Sep 2002 19:47:20 +0200


Lars Marius Garshol <larsga@garshol.priv.no> writes:

> | 1. Ignore the problem. This is probably fine: nobody is using non-BMP
> |    characters right now. Most systems have serious problem displaying
> |    them, since font systems are restricted to 64k glyphs, and, in many
> |    cases, to displaying characters in the BMP only.
> 
> Actually, Windows 2000 displays non-BMP characters just fine. MSIE can
> be made to do it, Opera 6.0 does it just fine, Mozilla does not (I
> think) do it.

Can you demonstrate this? I failed trying for myself, because:

- I have no fonts that has characters outside the BMP,
- TrueType is limited to 64k glyphs,
- OpenType fonts that want to include non-BMP characters need
  to char-to-glyph tables, one for UCS-2, and one for UCS-4.

  Reportedly, W2k will only use the UCS-2 table in a font that
  contains non-BMP characters, so I somewhat doubt your statement. WXP
  reportedly does support such fonts - but I have none.

- charmap.exe cannot display characters outside the BMP.

> Also, there are locales where non-BMP characters are essential.
> Cantonese is probably the best example. You can't write the Cantonese
> equivalent of the "-ing" ending in Cantonese with the BMP...

W2k/WXP support GB18030 with a special support package, but the font
included (SimSun18030 aka NSimSun) does *not* support the CJK
Extensions B, only CJK extensions A.

> Is the plan that Python will eventually be UCS-4 only?

It's my plan, but I think I don't share this plan with GvR. When I
first presented a Unicode type for Python on IPC6, Guido was quite
upset about my proposal to use a 4-byte wchar_t as the underlying
type, since he considered the space wastage unacceptable.

When Fredrik and I implemented PEP 261, I had to back out my change to
make Py_UNICODE equal to wchar_t by default if wchar_t is four bytes.

Regards,
Martin