readline tokenizer newline sticky wicket

Arthur ajsiegel at optonline.net
Mon Feb 6 21:55:22 EST 2006


Given a "linemess.py" file with inconsistent line ending:

line 1 \r
\r\n
line \n

tokenized as per:

import tokenize
f=open('linemess.py','r')
tokens=tokenize.generate_tokens(f.readline)
for t in tokens:
    print t

get output as follows:

(1, 'line', (1, 0), (1, 4), 'line 1\r\n')
(2, '1', (1, 5), (1, 6), 'line 1\r\n')
(4, '\r\n', (1, 6), (1, 8), 'line 1\r\n')
(1, 'line', (2, 0), (2, 4), 'line 2\n')
(2, '2', (2, 5), (2, 6), 'line 2\n')
(4, '\n', (2, 6), (2, 7), 'line 2\n')
(0, u'', (3, 0), (3, 0), u'')

So that the Windows \r\n is tokenized as a single literal token rather 
than as \n under the convention of universal newline support.

Isn't this a problem? 

I think this must have been at the  route of the issue I ran into when a 
file of messy inconsistent line ending that nonetheless compiled and ran 
without a problem was rejected by tokenizer.py as  having an indent issue.

On the theory that if tokenizer needs to fail when crap is thrown at it, 
it should do so more gracefully - is this bug reportable?

Art







More information about the Python-list mailing list