[Python-Dev] Known doctest bug with unicode?

Fri Apr 18 18:05:19 CEST 2008

On Fri, Apr 18, 2008 at 8:27 AM, Jeroen Ruigrok van der Werven
<asmodai at in-nomine.org> wrote:
> # vim: set fileencoding=utf-8 :
>
>  kanamap = {
>     u'$B$"(B': 'a'
>  }
>
>  def transpose(word):
>     """Convert a word in kana to its equivalent Hepburn romanisation.
>
>     >>> transpose(u'$B$"(B')
>     'a'
>     """
>     transposed = ''
>     for character in word:
>         transposed += kanamap[character]
>     return transposed
>
>  if __name__ == '__main__':
>     import doctest
>     doctest.testmod()
>
>  doctest:
>
>  [16:24] [ruigrok at akuma] (1) {20} % python trans.py
>  **********************************************************************
>  File "trans.py", line 11, in __main__.transpose
>  Failed example:
>     transpose(u'$B$"(B')
>  Exception raised:
>     Traceback (most recent call last):
>       File "doctest.py", line 1212, in __run
>         compileflags, 1) in test.globs
>       File "<doctest __main__.transpose[0]>", line 1, in <module>
>         transpose(u'$B$"(B')
>       File "trans.py", line 16, in transpose
>         transposed += kanamap[character]
>     KeyError: u'\xe3'
>  **********************************************************************
>  1 items had failures:
>    1 of   1 in __main__.transpose
>  ***Test Failed*** 1 failures.
>
>  normal interpreter:
>
>  >>> fromm trans import transpose
>  >>> transpose(u'$B$"(B')
>  'a'

What you've got is an 8-bit string containing a unicode literal.
Since this gets past the module's compilation stage, it doctest passes
it to the compiler again, and it defaults to iso-8859-1.  Thus
u'$B$"(B'.encode('utf-8').decode('latin-1') -> u'\xe3\x81\x82'.

Possible solutions:
1. Make the docstring itself unicode, assuming doctest allows this.
2. Call doctest explicitly, giving it the correct encoding.
3. See if you can put an encoding declaration in the doctest itself.
4. Make doctest smarter, so that it can grab the original module's encoding.
5. Wait until 3.0, where this is hopefully fixed by making doctests
use unicode by default?

-- 
Adam Olsen, aka Rhamphoryncus