[Python-Dev] New Py_UNICODE doc

Mon May 9 06:59:59 CEST 2005

Nicholas Bastin wrote:
>> Changing the documentation that goes along with the option
>> would be fine.
> 
> 
> That is exactly what I proposed originally, which you shot down.  Please
> actually read the contents of my messages.  What I said was "change the
> configure option and related documentation".

What I mean is "change just the documentation, do not change the
configure option". This seems to be different from your proposal,
which I understand as "change both the configure option and the
documentation".

> Wow, what an inane way of looking at it.  I don't know what world you
> live in, but in my world, users read the configure options and suppose
> that they mean something.  In fact, they *have* to go off on their own
> to assume something, because even the documentation you refer to above
> doesn't say what happens if they choose UCS-2 or UCS-4.  A logical
> assumption would be that python would use those CEFs internally, and
> that would be incorrect.

Certainly. That's why the documentation should be improved. Changing
the option breaks existing packaging systems, and should not be done
lightly.

> The current implementation supports the UTF-16 CEF.  i.e., it supports a
> variable width encoding form capable of representing all of the unicode
> space using surrogate pairs.  Please point out a code point that the
> current 2 byte implementation does not support, either directly, or
> through the use of surrogate pairs.

Try to match regular expression classes for non-BMP characters:

>>> re.match(u"[\u1234]",u"\u1234").group()
u'\u1234'

works fine, but

>>> re.match(u"[\U00011234]",u"\U00011234").group()
u'\ud804'

gives strange results.

Regards,
Martin