Totally confused by the str/bytes/unicode differences introduced in Pythyon 3.x

Giampaolo Rodola' gnewsg at gmail.com
Fri Jan 16 21:34:25 EST 2009


On 17 Gen, 03:09, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Fri, 16 Jan 2009 17:32:17 -0800, Giampaolo Rodola' wrote:
> > On 17 Gen, 02:24, MRAB <goo... at mrabarnett.plus.com> wrote:
>
> >> If you're truly working with strings of _characters_ then 'str' is what
> >> you need, but if you're working with strings of _bytes_ then 'bytes' is
> >> what you need.
>
> > I work with string of characters but to convert bytes into string I need
> > to specify an encoding and that's what confuses me. Before there was no
> > need to deal with that.
>
> In Python 2.x, str means "string of bytes". This has been renamed "bytes"
> in Python 3.
>
> In Python 2.x, unicode means "string of characters". This has been
> renamed "str" in Python 3.
>
> If you do this in Python 2.x:
>
>     my_string = str(bytes_from_socket)
>
> then you don't need to convert anything, because you are going from a
> string of bytes to a string of bytes.
>
> If you do this in Python 3:
>
>     my_string = str(bytes_from_socket)
>
> then you *do* have to convert, because you are going from a string of
> bytes to a string of characters (unicode). The Python 2.x equivalent code
> would be:
>
>     my_string = unicode(bytes_from_socket)
>
> and when you convert to unicode, you can get encoding errors. A better
> way to do this would be some variation on:
>
>     my_str = bytes_from_socket.decode('utf-8')
>
> You should read this:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> --
> Steven

Thanks, that clarifies a bit even if I still have a lot of doubts.
I wish I could do:

my_str = bytes_from_socket.decode('utf-8')

That would mean avoiding to replace "" with b"" almost everywhere in
my code but I doubt it would actually be a good idea.
RFC-2640 states that UTF-8 is the preferable encoding to use for both
clients and servers but I see that Python 3.x's ftplib uses latin1,
for example (bug?). How my server is supposed to deal with that?
I think that using bytes everywhere, as Christian recommended, would
be the only way to behave exactly like the 2.x version, but that's not
easy at all.


--- Giampaolo
http://code.google.com/p/pyftpdlib




More information about the Python-list mailing list