[Python-Dev] Generalised String Coercion

Phillip J. Eby pje at telecommunity.com
Mon Aug 8 15:54:20 CEST 2005


At 10:07 AM 8/8/2005 +0200, Martin v. Löwis wrote:
>Phillip J. Eby wrote:
> >>Hm. What would be the use case for using %s with binary, non-text data?
> >
> >
> > Well, I could see using it to write things like netstrings,
> > i.e.  sock.send("%d:%s," % (len(data),data)) seems like the One Obvious 
> Way
> > to write a netstring in today's Python at least.  But perhaps there's a
> > subtlety I've missed here.
>
>As written, this would stop working when strings become Unicode. It's
>pretty clear what '%d' means (format the number in decimal numbers,
>using "\N{DIGIT ZERO}" .. "\N{DIGIT NINE}" as the digits). It's not
>all that clear what %s means: how do you get a sequence of characters
>out of data, when data is a byte string?
>
>Perhaps there could be byte string literals, so that you would write
>
>   sock.send(b"%d:%s," % (len(data),data))

Actually, thinking about it some more, it seems to me it's actually more 
like this:

    sock.send( ("%d:%s," % 
(len(data),data.decode('latin1'))).encode('latin1') )

That is, if all we have is unicode and bytes, and 'data' is bytes, then 
encoding and decoding from latin1 is the right way to do a netstring.  It's 
a bit more painful, but still doable.


>but this would raise different questions:
>- what does %d mean for a byte string formatting? str(len(data))
>   returns a character string, how do you get a byte string?
>   In the specific case of %d, encoding as ASCII would work, though.
>- if byte strings are mutable, what about byte string literals?
>   I.e. if I do
>
>   x = b"%d:%s,"
>   x[1] = b'f'
>
>   and run through the code the second time, will the literal have
>   changed? Perhaps these would be displays, not literals (although
>   I never understood why Guido calls these displays)

I'm thinking that bytes.decode and unicode.encode are the correct way to 
convert between the two, and there's no such thing as a bytes literal.  We 
can always optimize "constant.encode(constant)" to a bytes display 
internally if necessary, although it will be a pain for programs that have 
lots of bytestring constants.  OTOH, we've previously discussed having a 
'bytes()' constructor, and perhaps it should use latin1 as its default 
encoding.



More information about the Python-Dev mailing list