PEP 393 vs UTF-8 Everywhere

Fri Jan 20 18:06:24 EST 2017

On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4+usenet at gmail.com> wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation? I've
> found good articles by Nick Coghlan, Armin Ronacher and others on the
> matter. What I have not found is discussion of pros and cons of
> alternatives to the old narrow or wide implementation of Unicode
> strings.

The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.

> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. How important is it to be able to
> get to part of a string with a simple index? Just because old skool
> strings could be treated as a sequence of characters, is that a reason
> to shoehorn the subtleties of Unicode into that model?

I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

--
Chris Kaynor