[issue12691] tokenize.untokenize is broken

Sat Aug 6 01:04:05 CEST 2011

Gareth Rees <gdr at garethrees.org> added the comment:

Please find attached a patch containing four bug fixes for untokenize():

* untokenize() now always returns a bytes object, defaulting to UTF-8 if no ENCODING token is found (previously it returned a string in this case).
* In compatibility mode, untokenize() successfully processes all tokens from an iterator (previously it discarded the first token).
* In full mode, untokenize() now returns successfully (previously it asserted).
* In full mode, untokenize() successfully processes tokens that were separated by a backslashed newline in the original source (previously it ran these tokens together).

In addition, I've added some unit tests:

* Test case for backslashed newline.
* Test case for missing ENCODING token.
* roundtrip() tests both modes of untokenize() (previously it just tested compatibility mode).

and updated the documentation:

* Update the docstring for untokenize to better describe its actual behaviour, and remove the false claim "Untokenized source will match input source exactly". (We can restore this claim if we ever fix tokenize/untokenize so that it's true.)
* Update the documentation for untokenize in tokenize.rdt to match the docstring.

I welcome review: this is my first proper patch to Python.

----------
keywords: +patch
Added file: http://bugs.python.org/file22842/Issue12691.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12691>
_______________________________________