PEP 393 vs UTF-8 Everywhere

Fri Jan 20 19:30:16 EST 2017

Chris Kaynor <ckaynor at zindagigames.com> writes:

> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4+usenet at gmail.com> wrote:
>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation?
>> I've found good articles by Nick Coghlan, Armin Ronacher and others
>> on the matter. What I have not found is discussion of pros and cons
>> of alternatives to the old narrow or wide implementation of Unicode
>> strings.
>
> The PEP itself has the rational for the problems with the narrow/wide
> idea, the quote from https://www.python.org/dev/peps/pep-0393/: There
> are two classes of complaints about the current implementation of the
> unicode type:on systems only supporting UTF-16, users complain that
> non-BMP characters are not properly supported. On systems using UCS-4
> internally (and also sometimes on systems using UCS-2), there is a
> complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> Basically, narrow builds had very odd behavior with non-BMP
> characters, namely that indexing into the string could easily produce
> mojibake. Wide builds used quite a bit more memory, which generally
> translates to reduced performance.

I'm taking as a given that the old way was often sub-optimal in many
scenarios. My questions were about the alternatives, and why PEP 393 was
chosen over other approaches.

>> ISTM that most operations on strings are via iterators and thus
>> agnostic to variable or fixed width encodings. How important is it to
>> be able to get to part of a string with a simple index? Just because
>> old skool strings could be treated as a sequence of characters, is
>> that a reason to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.

I was asserting that most useful operations on strings start from index
0. The r* operations would not be slowed down that much as UTF-8 has the
useful property that attempting to interpret from a byte that is not at
the start of a sequence (in the sense of a code point rather than
Python) is invalid and so quick to move over while working backwards
from the end.

The only significant use of an index dereference that I could come up
with was the result of a find() or index(). I put out this public
question so that I could be enclued as to other uses. My personal
experience is that in most cases where I might consider find() that I
end up using re and use the return from match groups which has copies of
the (sub)strings that I want.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.

PEP 393 already says that "the specification chooses UTF-8 as the
recommended way of exposing strings to C code".

-- 
Pete Forman