Totally confused by the str/bytes/unicode differences introduced in Pythyon 3.x

MRAB google at mrabarnett.plus.com
Fri Jan 16 20:24:04 EST 2009


Giampaolo Rodola' wrote:
 > Hi, I'm sure the message I'm going to write will seem quite dumb to
 > most people but I really don't understand the  str/bytes/unicode
 > differences introduced in Python 3.0 so be patient. What I'm trying
 > to do is porting pyftpdlib to Python 3.x. I don't want to support
 > Unicode. I don't want pyftpdlib for py 3k to do anything new or
 > different. I just want it to behave exactly the same as in the 2.x
 > version and I'd like to know if that's possible with Python 3.x.
 >
 > Now. The basic difference is that socket.recv() returns a bytes
 > object instead of a string object and that's the thing which confuses
 > me mainly. My question is: is there a way to convert that bytes
 > object into exactly *the same thing* returned by socket.recv() in
 > Python 2.x (a string)?
 >
 > I know I can do:
 >
 > data = socket.recv(1024)
 > data = data.decode(encoding)
 >
 > ...to convert bytes into a string but that's not exactly the same
 > thing. In Python 2.x I didn't have to care about the encoding. What
 > socket.recv() returned was just a string. That was all. Now doing
 > something like b''.decode(encoding) puts me in serious troubles since
 > that can raise an exception in case client and server use a different
 > encoding.
 >
 > As far as I've understood the basic difference I see now is that a
 > Python 2.x based FTP server could handle a 3.x based FTP client using
 >  "latin1" encoding or "utf-8" or anything else while with Python 3.x
 > I'm forced to tell my server which encoding to use and I don't know
 > how to deal with that.
 >
Originally Python had a single string type 'str' with 8 bits per
character. That was a bit limiting for international use. Then a new
string type 'unicode' was introduced.

Now, in Python 3.x, it's time to tidy things up.

The 'str' type has been renamed 'bytes' and the 'unicode' type has been
renamed 'str'. If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

socket.send() and socket.recv() are still the same, it's just that it's
now clearer that they work with bytes and not strings.



More information about the Python-list mailing list