Python 2.7.5: Strange and differing behavior depending on sys.setdefaultencoding being set
Hans-Peter Jansen
hpj at urpla.net
Tue Dec 3 19:15:34 EST 2013
Hi Chris,
On Mittwoch, 4. Dezember 2013 10:20:31 Chris Angelico wrote:
> On Wed, Dec 4, 2013 at 9:32 AM, Hans-Peter Jansen <hpj at urpla.net> wrote:
> > I'm experiencing strange behavior with attached code, that differs
> > depending on sys.setdefaultencoding being set or not. If it is set, the
> > code works as expected, if not - what should be the usual case - the code
> > fails with some non-sensible traceback.
>
> Interesting. You're mixing str and unicode objects a lot here. The
> cleanest solution, IMO, would be to either switch to Python 3 or add
> this to the top of your code:
>
> from __future__ import unicode_literals
>
> Either way, you'll have all your quoted strings be Unicode, rather
> than byte, strings. Then take away the requirement that Unicode
> strings contain non-ASCII characters, and let everything go through
> that code branch.
>
> Looking at this line in reprstr():
>
> s = "u'%s'" % s.replace("'", "\\'")
>
> Two potential problems with that. Firstly, the representation is
> flawed: a backslash in the input string won't be changed, so it's not
> a true repr; but if this is just for debugging output, that's not a
> big deal. Secondly, this code might produce either a str or a unicode,
> depending on the type of s. That may cause messes later; since you
> seem to be mostly working with the unicode type after that, it'd
> probably be simpler/safer to make that always return one:
The code serves three purposes: make simple strings more readable, document
the others as being unicode, and display those correctly ;)
> s = u"u'%s'" % s.replace("'", "\\'")
>
> But the actual problem, I think, is that repr() guarantees to return a
> str, and you're trying to return a unicode. Here's an illustration:
>
> # -*- coding: utf-8 -*-
> class Foo(object):
> def __repr__(self):
> return u'äöü'
>
> foo = Foo()
> print(foo.__repr__())
> print(repr(foo))
>
> The first one succeeds, because building up that string isn't at all a
> problem. The second one then tries to turn the return value of
> __repr__ into a string using the default encoding - which defaults to
> 'ascii', hence the problem you're seeing.
>
> Solution 1: Switch to Python 3, in which this will work fine (because
> repr() in Py3 returns a Unicode string, since _everything_ is
> Unicode).
>
> Solution 2: Explicitly encode in frec, or at the end of Record.__repr__():
>
> def __repr__(self):
> s = u'%s(\n%s\n)' % (self.__class__.__name__,
> frec(self.__dict__)) return s.encode("utf-8")
>
> (that could be a one-liner, but it's already pushing 80-chars, so if
> you have a length limit, breaking it helps)
>
> Solution 3: Don't use __repr__ here, but simply have your frec
> function intelligently handle Record types. Effectively, you have your
> own method of generating a debug description of a Record, which could
> then return a unicode instead of a str.
Thanks for all your considerations, they are very helpful indeed. Even more
helpful, that I understand the issue in question now. I will take some rest
and then decide, what to do about this with your precious help.
> I personally recommend switching to Python 3 :) But presumably that's
> not an option, or you'd already have considered it.
You nailed it ;)
Given the amount of special unicode handling code, that is necessary to keep
Python 2 happy, makes proceeding with it no real fun on a longer term..
And the biggest proponent for hacking in Python IS the fun part of it. Then
productivity, elegance, ..., you name it.
Have-a-good-day-ly y'rs,
Pete
More information about the Python-list
mailing list