[Python-Dev] Bytes path support

Nick Coghlan ncoghlan at gmail.com
Thu Aug 28 01:54:31 CEST 2014


On 28 Aug 2014 04:20, "Glenn Linderman" <v+python at g.nevcal.com> wrote:
>
> On 8/27/2014 5:16 AM, Nick Coghlan wrote:
>>
>> On 27 August 2014 08:52, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>>
>>> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at udel.edu> wrote:
>>>>
>>>> Nick, I think the first half of your post is one of the clearest
>>>> expositions yet of 'why Python 3' (in particular, the str to unicode
>>>> change).  It is worthy of wider distribution and without much change,
it
>>>> would be a great blog post.
>>>
>>> Indeed, I had the same idea - I had been assuming users already
understood
>>> this context, which is almost certainly an invalid assumption.
>>>
>>> The blog post version is already mostly written, but I ran out of
weekend.
>>> Will hopefully finish it up and post it some time in the next few days
:)
>>
>> Aaand, it's up:
>>
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
>>
>> Cheers,
>> Nick.
>>
>
> Indeed, I also enjoyed and found enlightening your response to this
issue, including the broader historical context. I remember when Unicode
was first published back in 1991, and it sounded interesting, but far
removed from the reality of implementations of the day. I was intrigued by
UTF-8 at the time, and even wrote an encoder and decoder for it for a
software package that eventually never reached any real customers.
>
> Your blog post says:
>>
>> Choosing UTF-8 aims to treat formatting text for communication with the
user as "just a display issue". It's a low impact design that will "just
work" for a lot of software, but it comes at a price:
>>
>> because encoding consistency checks are mostly avoided, data in
different encodings may be freely concatenated and passed on to other
applications. Such data is typically not usable by the receiving
application.
>
>
> I don't believe this is a necessary result of using UTF-8. It is a
possible result, and I guess some implementations are using it this way,
but a proper language could still provide and/or require proper usage of
UTF-8 data through its type system just as Python3 is doing with PEP 393.

Yes, Go works that way, for example. I doubt it actually checks for valid
UTF-8 at OS boundaries though - that would be a potentially expensive
check, and as a network service centric language, Go can afford to place
more constraints on the operating environment than we can.

>In fact, if it were not for the requirement to support passing character
strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython
add-on packages) and the resulting practical performance considerations of
converting to/from UTF-8 repeatedly when calling those APIs, Python3 could
have evolved to using UTF-8 as its underlying data format, and obtained
equal encoding consistency as it has today.

We already have string processing algorithms that work for fixed width
encodings (and are known not to work for variable width encodings, hence
the bugs in Unicode handling on the old narrow builds).

It isn't that variable width encodings aren't a viable choice for
programming language text modelling, it's that the assumption of a fixed
width model is more deeply entrenched in CPython (and especially the C API)
than the exact number of bits used per code point.

> Of course, nothing can be "required" if the user chooses to continue
operating in the encoded domain, and manipulate data using the necessary
byte-oriented features of of whatever language is in use.
>
> One of the choices of Python3, was to retain character indexing as an
underlying arithmetic implementation citing algorithmic speed, but that is
a seldom needed operation, and of limited general applicability when
considering grapheme clusters.

The choice that was made was to say no to the question "Do we rewrite a
Unicode type that we already know works from scratch?". The decisions about
how to handle *text* were made way back before the PEP process even
existed, and later captured as PEP 100.

What changed in Python 3 was dropping the hybrid 8-bit str type with its
locale dependent behaviour, and parcelling its responsibilities out to
either the existing unicode type (renamed as str, as it was the default
choice), or the new locale independent bytes type.

> An iterator based approach can solve both problems, but would have been
best introduced as part of Python3.0, although it may have made 2to3
harder, and may have made it less practical to implement six and other "run
on both Py2 and Py3" type solutions harder, without introducing those same
iterative solutions into Python 2.6 or 2.7.

The option of fundamentally changing the text handling design was never on
the table. The Python 2 unicode type works fine, it is the Python 2 str
type that needed changing.

> Such solutions could still be implemented as options. Even PEP 393
grudgingly supports some use of UTF-8 when requested by the user, as I
understand it.

Not quite. PEP 393 heavily favours and optimises UTF-8, trading memory for
speed by implicitly caching the UTF-8 representation the support isn't
begrudged, it's enthusiastic. We just don't use it for the text processing
algorithms, because those assume a fixed width encoding.

> Whether such an implementation would be better based on bytes or str is
uncertain without further analysis, although type checking would probably
be easier if based on str. A high-performance implementation would likely
need to be implemented at least partly in C rather than CPython, although
it could be prototyped in Python for proof of functionality. The iterators
could obviously be implemented to work based on top of solutions such as
PEP 393, by simply using indexing underneath, when fixed-width characters
are available, and other techniques when UTF-8 is the only available format
(rather than converting from UTF-8 to fixed-width characters because of
calling the iterator).

For the cost of rewriting every single string manipulation algorithm in
CPython to avoid relying on C array access, the only thing you would save
over PEP 393 is a bit of memory - we already store the UTF-8 representation
when appropriate.

There's simply not a sufficient payoff to justify the cost.

Cheers,
Nick.

>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140828/847cbfc7/attachment.html>


More information about the Python-Dev mailing list