[Python-Dev] PEP 393 Summer of Code Project

Nick Coghlan ncoghlan at gmail.com
Thu Aug 25 04:47:20 CEST 2011


On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <guido at python.org> wrote:
> Now I am happy to admit that for many Unicode issues the level at
> which we have currently defined things (code units, I think -- the
> thingies that encodings are made of) is confusing, and it would be
> better to switch to the others (code points, I think). But characters
> are right out.

Indeed, code points are the abstract concept and code units are the
specific byte sequences that are used for serialisation (FWIW, I'm
going to try to keep this straight in the future by remembering that
the Unicode character set is defined as abstract points on planes,
just like geometry).

With narrow builds, code units can currently come into play
internally, but with PEP 393 everything internal will be working
directly with code points. Normalisation, combining characters and
bidi issues may still affect the correctness of unicode comparison and
slicing (and other text manipulation), but there are limits to how
much of the underlying complexity we can effectively hide without
being misleading.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list