Binary strings, unicode and encodings

Fri Jan 16 02:38:23 EST 2004

Laurent Therond wrote:

> Now, if I write bencode('failure reason') into a socket, what will I get
> on the other side of the connection?

Jp has already explained this, but let me stress his observations.

> a) A sequence of bytes where each byte represents an ASCII character

A sequence of bytes, period. 'failure reason' is a byte string. The
bytes in this string are literally copied from the source code .py file
to the cStringIO object.

If your source code was in an encoding that is an ASCII superset
(such as ascii, iso-8859-1, cp1252), then yes: the text 'failure reason'
will come out as a byte string representing ASCII characters.

Python has a second, independent string type, called unicode. Literals
of that type are not simply written in quotes, but with a leading u''.

You should never use the unicode type in a place where byte strings
are expected. Python will apply the system default encoding to these,
which gives exceptions if the Unicode characters are outside the 
characters supported in the system default encoding (which is us-ascii).

You also should avoid byte string literals with non-ASCII characters
such as 'stringé'; use unicode literals. The user invoking your script
may use a different encoding on his system, so he would get moji-bake,
as the last character in the string literal does *not* denote
LATIN SMALL LETTER E WITH ACUTE, but instead denotes the byte '\xe9'
(which is that character only if you use a latin-1-like encoding).

HTH,
Martin