[Tutor] Polymorphic function in Python 2 & 3?

Steven D'Aprano steve at pearwood.info
Sun Sep 8 05:28:58 CEST 2013


On Sat, Sep 07, 2013 at 12:45:02PM -0700, Albert-Jan Roskam wrote:
> Hi,
> 
> I have a class and I want it's initializer to be able to take both 
> byte strings (python 3: byte objects) and unicode strings (python 3: 
> strings). [...] I need bytes because I am 
> working with binary data.

Consider only accepting binary data. It is not very difficult for the 
caller to explicitly convert their text strings into binary data ahead 
of time, and besides, "convert text to binary" is ambiguous. As the Zen 
of Python says, resist the temptation to guess.

Consider the *text* string "abcdef". Which of the following binary 
data (shown in hex) does it represent?

Six values, limited between 0 and FF?
1) 61 62 63 64 65 66
2) 81 82 83 84 85 86  # Hint: IBM mainframe users might expect this.

Six values, limited between 0 and FFFF?
3) 6100 6200 6300 6400 6500 6600
4) 0061 0062 0063 0064 0065 0066

Twelve values between 0 and FF?
5) 61 00 62 00 63 00 64 00 65 00 66 00
6) 00 61 00 62 00 63 00 64 00 65 00 66

Three values between 0 and FF?
7) AB CD EF

Something else? 
8) ValueError: expected decimal digits but got "abcdef"

Even assuming that you are expecting single byte data, not double bytes, 
there are still six legitimate ways to convert this string. It seems 
risky to assume that if the caller passes you "▼□■" they actually meant 
E2 96 BC E2 96 A1 E2 96 A0.


If I were designing this, I'd probably take the rule:

- byte strings are accepted by numeric value, e.g. b'a' -> hex 61

- text strings are expected to be pairs of hex digits, e.g. u'a' is an 
error, u'abcdef' -> hex AB, CD EF, u'hello' is an error.


That seems more useful to me than UTF-8.


> So it's foward compatible Python 2 code (or backward 
> compatible Python 3 code, if you like). If needed, the arguments of 
> __init__ are converted into bytes using a function called encode(). I 
> pasted the code that I wrote here: http://pastebin.com/2WBQ0H87. Sorry 
> if it's a little long. It works for Python 2.7 and 3.3. But is this 
> the best way to do this? In particular, is inspect.getargs the best 
> way to get the argument names? (I don't want to use **kwargs). Also, I 
> am using setattr to set the parameters, so the encode() method has 
> side effects, which may not be desirable. 

Some questions/observations:

* Why do you bother setting attributes a, b, ... e only to then set them 
again in the encode method?

* It isn't clear to me what the encode method is supposed to do, which 
suggests it is trying to do too much. The doc string says:

    Params can be bytes, str, unicode,
    dict, dict of dics, list of str/bytes/unicode

but it isn't clear what will happen if you pass these different values 
to encode. For instance, if params = {1: None, 2: None}, what do you 
expect to happen? How about {None: 42} or ['a']?

My *guess* is that it is actually expecting a list of (attribute name, 
string value) pairs, that at least is how you seem to be using it, 
but that documentation gives me no help here.


I think a much better approach would be to use a helper function:

def to_bytes(string):
    if isinstance(string, unicode):
        return string.encode('uft-8')  # But see above, why UTF-8?
    elif isinstance(string, bytes):
        return string
    raise TypeError


class Test:
    def __init__(self, a, b, *args):
        self.a = to_bytes(a)
        try:
            self.b = to_bytes(b)
        except TypeError:
            self.b = None
        self.extras = [to_bytes(s) for s in args]


Short, sweet, easy to understand, efficient, requires no self-inspection 
magic, doesn't try to guess the caller's intention, easily useable 
without worrying about side-effects, which means it is easy to test.


Some other observations... 

I dislike the fact that on UnicodeDecodeError you assume that the dodgy 
bytes given must be encoded in the default encoding:

        except UnicodeDecodeError:
            # str, python 2
            cp = locale.getdefaultlocale()[-1].lower()
            if cp != "utf-8":
                return arg.decode(cp).encode("utf-8")
            return arg

I think the whole approach is complicated, convoluted and quite frankly 
nasty (you have no comments explaining the logic of why you catch some 
exceptions and why you do what you) but if you're going to use the 
default locale this is how you ought to do it IMO:

    lang, encoding = locale.getdefaultlocale()
    if encoding is None:
        # no default locale, or can't determine it
        # what the hell do we guess now???
    else:
        try:
            unistr = arg.decode(encoding)
        except UnicodeDecodeError:
            # And again, what guess do we make now???
        else:
            return unistr.encode('utf-8')  # Why UTF-8?


but as already mentioned, I think that being less "Do What I Mean" and 
more "be explicit about what you want" is a better idea.


As far as your 2/3 compatibility code at the top of the module:

try:
    unichr
except NameError:
    unichr = chr  # python 3
 
try:
    unicode
except NameError:
    unicode = basestring = str  # python 3


I don't believe you use either unichr or basestring, so why bother with 
them? I normally do something like this:

try:
    unicode
except NameError:
    # Python 3
    unicode = str

which I think is all you need.

Also, rather than building all the 2-and-3 logic into _bytify, I think 
it is better to split it into two functions:

def _bytify2(s):
    ...

def _bytify2(s):
    ...

if sys.version() < '3':
    _bytify = _bytify2
else:
    _bytify = _bytify3


which makes it much easier to understand the code, and much easier to 
drop support for version 2 eventually.



-- 
Steven


More information about the Tutor mailing list