PEP 393 vs UTF-8 Everywhere

Fri Jan 20 21:10:15 EST 2017

On 2017-01-21 00:51, Pete Forman wrote:
> MRAB <python at mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.
>
Or I could decide not bother and leave it to someone else to continue 
the project. After all, it's not like I'm not getting paid for the work, 
it's purely voluntary.

> PEP 393 / Python 3.3 required extension writers to revisit their access
> to strings. My explicit question was about why PEP 393 was adopted to
> replace the deficient old implementations rather than another approach.
> The implicit question is whether a UTF-8 internal representation should
> replace that of PEP 393.
>
I already had to handle 1-byte bytestrings and 2/4-byte (narrow/wide) 
Unicode strings, so switching to 1/2/4 strings wasn't too bad. Switching 
to a completely different, variable-width system would've been a lot 
more work.