PEP 393 vs UTF-8 Everywhere

Fri Jan 20 19:18:51 EST 2017

On 2017-01-20 23:06, Chris Kaynor wrote:
> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4+usenet at gmail.com> wrote:
>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation? I've
>> found good articles by Nick Coghlan, Armin Ronacher and others on the
>> matter. What I have not found is discussion of pros and cons of
>> alternatives to the old narrow or wide implementation of Unicode
>> strings.
>
> The PEP itself has the rational for the problems with the narrow/wide
> idea, the quote from https://www.python.org/dev/peps/pep-0393/:
> There are two classes of complaints about the current implementation
> of the unicode type:on systems only supporting UTF-16, users complain
> that non-BMP characters are not properly supported. On systems using
> UCS-4 internally (and also sometimes on systems using UCS-2), there is
> a complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> Basically, narrow builds had very odd behavior with non-BMP
> characters, namely that indexing into the string could easily produce
> mojibake. Wide builds used quite a bit more memory, which generally
> translates to reduced performance.
>
>> ISTM that most operations on strings are via iterators and thus agnostic
>> to variable or fixed width encodings. How important is it to be able to
>> get to part of a string with a simple index? Just because old skool
>> strings could be treated as a sequence of characters, is that a reason
>> to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.
>
You could implement rsplit and rstrip easily enough, but rfind and 
rindex return the index, so you'd need to scan the string to return that.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.
>
As someone who has written an extension, I can tell you that I much 
prefer dealing with a fixed number of bytes per codepoint than a 
variable number of bytes per codepoint, especially as I'm also 
supporting earlier versions of Python where that was the case.