Need debugging knowhow for my creeping Unicodephobia

mk mrkafk at gmail.com
Thu Feb 11 16:43:17 EST 2010


MRAB wrote:

> When working with Unicode in Python 2, you should use the 'unicode' type
> for text (Unicode strings) and limit the 'str' type to binary data
> (bytestrings, ie bytes) only.

Well OK, always use u'something', that's simple -- but isn't str what I 
get from files and sockets and the like?

> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
> 'bytes' for binary data (bytes!).

Neat, except that the process of porting most projects and external 
libraries to P3 seems to be, how should I put it, standing still? Or am 
I wrong? But that's the impression I get?

Take web frameworks for example. Does any of them have serious plans and 
work in place to port to P3?

> Strictly speaking, only Unicode can be encoded.

How so? Can't bytestrings containing characters of, say, koi8r encoding 
be encoded?

> What Python 2 is doing here is trying to be helpful: if it's already a
> bytestring then decode it first to Unicode and then re-encode it to a
> bytestring.

It's really cumbersome sometimes, even if two libraries are written by 
one author: for instance, Mako and SQLAlchemy are written by the same 
guy. They are both top-of-the line in my humble opinion, but when you 
connect them you get things like this:

1. you query SQLAlchemy object, that happens to have string fields in 
relational DB.

2. Corresponding Python attributes of those objects then have type str, 
not unicode.

3. then I pass those objects to Mako for HTML rendering.

Typically, it works: but if and only if a character in there does not 
happen to be out of ASCII range. If it does, you get UnicodeDecodeError 
on an unsuspecting user.

Sure, I wrote myself a helper that iterates over keyword dictionary to 
make sure to convert all str to unicode and only then passes the 
dictionary to render_unicode. It's an overhead, though. It would be 
nicer to have it all unicode from db and then just pass it for rendering 
and having it working. (unless there's something in filters that I 
missed, but there's encoding of templates, tags, but I didn't find 
anything on automatic conversion of objects passed to method rendering 
template)

But maybe I'm whining.


> Unfortunately, the default encoding is ASCII, and the bytestring isn't
> valid ASCII. Python 2 is being 'helpful' in a bad way!

And the default encoding is coded in such way so it cannot be changed in 
sitecustomize (without code modification, that is).

Regards,
mk




More information about the Python-list mailing list