[Python-Dev] bytes / unicode

P.J. Eby pje at telecommunity.com
Sun Jun 27 19:02:28 CEST 2010


At 03:53 PM 6/27/2010 +1000, Nick Coghlan wrote:
>We could talk about this even longer, but the most effective way
>forward is going to be a patch that improves the URL parsing
>situation.

Certainly, it's the only practical solution for the immediate problems in 3.2.

I only mentioned that I "hate the idea" because I'd be more 
comfortable if it was explicitly declared to be a temporary hack to 
work around the absence of a string coercion protocol, due to the 
moratorium on language changes.

But, since the moratorium *is* in effect, I'll try to make this my 
last post on string protocols for a while...  and maybe wait until 
I've looked at the code (str/bytes C implementations) in more detail 
and can make a more concrete proposal for what the protocol would be 
and how it would work.  (Not to mention closer to the end of the moratorium.)


>There are a *very small* number of APIs where it is appropriate to 
>be polymorphic

This is only true if you focus exclusively on bytes vs. unicode, 
rather than the general issue that it's currently impractical to pass 
*any* sort of user-defined string type through code that you don't 
directly control (stdlib or third-party).


>The virtues of a separate poly_str type are that:
>1. It can be simple and implemented in Python, dispatching to str or
>bytes as appropriate (probably in the strings module)
>2. No chance of impacting the performance of the core interpreter (as
>builtins are not affected)

Note that adding a string coercion protocol isn't going to change 
core performance for existing cases, since any place where the 
protocol would be invoked would be a code branch that either throws 
an error or *already* falls back to some other protocol (e.g. the 
buffer protocol).


>3. Lower impact if it turns out to have been a bad idea

How many protocols have been added that turned out to be bad 
ideas?  The only ones that have been removed in 3.x, IIRC, are 
three-way compare, slice-specific operations, and __coerce__...  and 
I'm going to miss __cmp__.  ;-)

However, IIUC, the reason these protocols were dropped isn't because 
they were "bad ideas".  Rather, they're things that can be 
implemented in terms of a finer-grained protocol.  i.e., if you want 
__cmp__ or __getslice__ or __coerce__, you can always implement them 
via a mixin that converts the newer fine-grained protocols into 
invocations of the older protocol.  (As I plan to do for __cmp__ in 
the handful of places I use it.)

At the moment, however, this isn't possible for multi-string 
operations outside of __add__/__radd__ and comparison -- the coercion 
rules are hard-wired and can't be overridden by user-defined types.



More information about the Python-Dev mailing list