RE Module Performance

Thu Jul 25 06:07:41 EDT 2013

On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> What I'm trying to say is that it is possible to use UTF-16 internally,
> but *not* assume that every code point (character) is represented by a
> single 2-byte unit. For example, the len() of a UTF-16 string should not
> be calculated by counting the number of bytes and dividing by two. You
> actually need to walk the string, inspecting each double-byte

Anything's possible. But since underlying representations can be
changed fairly easily (relative term of course - it's a lot of work,
but it can be changed in a single release, no deprecation required or
anything), there's very little reason to continue using UTF-16
underneath. May as well switch to UTF-32 for convenience, or PEP 393
for convenience and efficiency, or maybe some other system that's
still mostly fixed-width.

ChrisA