[issue12486] tokenize module should have a unicode API

Sun Oct 4 23:27:56 EDT 2015

Martin Panter added the comment:

I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good.

However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me.

Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?

----------
nosy: +martin.panter
stage:  -> patch review

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12486>
_______________________________________