[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 14:20:04 CEST 2014

On 5 June 2014 22:01, Paul Sokolovsky <pmiscml at gmail.com> wrote:
>> Aside from
>> some of the POSIX locale handling issues on Linux, many of the
>> concerns are with the usability of bytes and bytearray, not with str -
>> that's why binary interpolation is coming back in 3.5, and there will
>> likely be other usability tweaks for those types as well.
>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes), while move unicode back to an explicit
> type to be used explicitly only when needed (bloated frameworks like
> Django can force users to it anyway, but that will be forcing on
> framework level, not on language level, against which people rebel.)
> People can dream, right?

If you don't model strings as arrays of code points, or at least
assume a particular universal encoding (like UTF-8), you have to give
up string concatenation in order to tolerate arbitrary encodings -
otherwise you end up with unintelligible data that nobody can decode
because it switches encodings without notice. That's a viable model if
your OS guarantees it (Mac OS X does, for example, so Python 3 assumes
UTF-8 for all OS interfaces there), but Linux currently has no such
guarantee - many runtimes just decide they don't care, and assume
UTF-8 anyway (Python 3 may even join them some day, due to the
problems caused by trusting the locale encoding to be correct, but the
startup code will need non-trivial changes for that to happen - the
C.UTF-8 locale may even become widespread before we get there).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia