Binary strings, unicode and encodings

Thu Jan 15 18:17:04 EST 2004

On Thu, Jan 15, 2004 at 11:38:39AM -0800, Laurent Therond wrote:
> Maybe you have a minute to clarify the following matter...
> 
> Consider:
> 
> ---
> 
> from cStringIO import StringIO
> 
> def bencode_rec(x, b):
>     t = type(x)
> 
>     if t is str:
>         b.write('%d:%s' % (len(x), x))
>     else:
>         assert 0
> 
> def bencode(x):
>     b = StringIO()
> 
>     bencode_rec(x, b)
> 
>     return b.getvalue()
> 
> ---
> 
> Now, if I write bencode('failure reason') into a socket, what will I get
> on the other side of the connection?
> 
> a) A sequence of bytes where each byte represents an ASCII character

  Yes.

> 
> b) A sequence of bytes where each byte represents the UTF-8 encoding of a
> Unicode character

  Coincidentally, yes.  This is not because the unicode you wrote to the
socket is encoded as UTF-8 before it is sent, but because the *non*-unicode
you wrote to the socket *happened* to be a valid UTF-8 byte string (All
ASCII byte strings fall into this coincidental case).

> 
> c) It depends on the system locale/it depends on what the site module
> specifies using setdefaultencoding(name)

  Not at all.  'failure reason' isn't unicode, there are no unicode
transformations going on in the example program, the default encoding is
never used and has no effect on the program's behavior.

  bencode_rec has an assert in it for a reason.  *Only* byte strings can be
sent using it.  If you want to send unicode, you'll have to encode it
yourself and send the encoded bytes, then decode it on the other end.  If
you choose to depend on the default system encoding, you'll probably end up
with problems, but if you explicitly select an encoding yourself, you won't.

  Jp