[Tutor] %s %r with cutom type

Sat Mar 13 10:40:31 CET 2010

On Sat, 13 Mar 2010 13:50:55 +1100
Steven D'Aprano <steve at pearwood.info> wrote:

> On Fri, 12 Mar 2010 10:29:17 pm spir wrote:
> > Hello again,
> >
> > A different issue. On the custom Unicode type discussed in another
> > thread, I have overloaded __str__ and __repr__ to get encoded byte
> > strings (here with debug prints & special formats to distinguish from
> > builtin forms):
> [...]
> > Note that Unicode.__str__ is called neither by "print us", nore by
> > %s. What happens? Why does the issue only occur when using both
> > format %s & %s?
> 
> The print statement understands how to directly print strings 
> (byte-strings and unicode-strings) and doesn't call your __str__ 
> method.
> 
> http://docs.python.org/reference/simple_stmts.html#the-print-statement

Right. But then how to print out customized strings?

> As for string interpolation, I have reported this as a bug:
> 
> http://bugs.python.org/issue8128

Yes, at least the actual behaviour should be clear and properly documented. And it should be the same for str and unicode type, and their subclasses. But I cannot see the advantage of not calling str(): this only prevents customization -- or use of string interpolation with cutomized string types.

> I have some additional comments on your class below:
> 
> 
> > class Unicode(unicode):
> >     ENCODING = "utf8"
> >     def __new__(self, string='', encoding=None):
> 
> This is broken according to the Liskov substitution principle.
> 
> http://en.wikipedia.org/wiki/Liskov_substitution_principle
> 
> The short summary: subclasses should only ever *add* functionality, they 
> should never take it away.
> 
> The unicode type has a function signature that accepts an encoding and 
> an errors argument, but you've missed errors.

All right, I'll have a closer look to the semantics of unicode's error arg and see if it makes sense in my case.

Notes for the following comments of yours:
(1) What I posted is test code written only to show the issue. (eg debug prints are not in the original code)
(2) This class is intended for a kind parsing and string processing library (think at pyparsing, but designed very differently). It should work only with unicode string, so convert source and every bit of string in pattern defs (eg for literal match). __str__ and __repr__ are intended for feedback (programmer test and user information, in both cases mainly error messages). __repr__ should normally not be used, I wrote it rather for completion.

[...] 

> >         if isinstance(string,str):
> >             encoding = Unicode.ENCODING if encoding is None else
> > encoding string = string.decode(encoding)
> >         return unicode.__new__(Unicode, string)
> >     def __repr__(self):
> >         print '+',
> >         return '"%s"' %(self.__str__())
> 
> This may be a problem. Why are you making your unicode class pretend to 
> be a byte-string? 

(This answer rather for __str__)
Not to pollute output. Eg parse tree nodes (= match results) show like:
integer:[sign:- digit:123]

> Ideally, the output of repr(obj) should follow this rule:
> 
> eval(repr(obj)) == obj
> 
> For instance, for built-in unicode strings:
> 
> >>> u"éâÄ" == eval(repr(u"éâÄ"))
> True

> but for your subclass, us != eval(repr(us)). So again, code that works 
> perfectly with built-in unicode objects will fail with your subclass.
> 
> Ideally, repr of your class should return a string like:
> 
> "Unicode('...')"

I 100% agree with your comment and this what I do in general. But it does not make much sense in my case, I guess. When I'm rather sure __repr__ will not normally be used, then I will probably rewrite to show Unicode("...").

> >     def __str__(self):
> >         print '*',
> >         return '`'+ self.encode(Unicode.ENCODING) + '`'
> 
> What's the purpose of the print statements in the __str__ and __repr__ 
> methods?

Note (1).

> Again, unless you have a good reason to do different, you are best to 
> just inherit __str__ from unicode. Anything else is strongly 
> discouraged.

Note (2).

> > An issue happens in particuliar cases, when using both %s and %r:
> >
> > s = "éâÄ"
> 
> This may be a problem. "éâÄ" is not a valid str, because it contains 
> non-ASCII characters.

It's just a test case (note (1)) for non-ascii input, precisely.

> As far as I know, the behaviour of stuffing unicode characters into 
> byte-strings is not well-defined in Python, and will depend on external 
> factors like the terminal you are running in, if any. It may or may not 
> work as you expect. It is better to do this:
> 
> u = u"éâÄ"
> s = u.encode('uft-8')

Yo, but I cannot expect every user to always use only unicode everywhere as input to my lib (both in sources to be parsed and in pattern defs) like a robot. One main reason for my Unicode type (that accepts both str and unicode).
Anyway, all that source of troubles disappears with py3 :-)
Then, I only need __str__ to produce nice, clear, unpolluted output.

> which will always work consistently so long as you declare a source 
> encoding at the top of your module:
> 
> # -*- coding: UTF-8 -*-

Yes, this applies to my own code. But what about user code calling my lib? (This is the reason for Unicode.ENCODING config param).

Denis
________________________________

la vita e estrany

spir.wikidot.com