2to3 chokes on bad character

Fri Feb 25 03:49:33 EST 2011

John Machin wrote:

> On Feb 25, 12:00 am, Peter Otten <__pete... at web.de> wrote:
>> John Machin wrote:
> 
>> > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
>> > this case just trying to run or import the offending code file would
>> > have given an informative syntax error (you have declared the .py file
>> > to be encoded in UTF-8 but it's not).
>>
>> The problem is that Python 2.x accepts arbitrary bytes in string
>> constants.
> 
> Ummm ... isn't that a bug? According to section 2.1.4 of the Python
> 2.7.1 Language Reference Manual: """The encoding is used for all
> lexical analysis, in particular to find the end of a string, and to
> interpret the contents of Unicode literals. String literals are
> converted to Unicode for syntactical analysis, then converted back to
> their original encoding before interpretation starts ..."""
> 
> How do you reconcile "used for all lexical analysis" and "String
> literals are converted to Unicode for syntactical analysis" with the
> actual (astonishing to me) behaviour?

You are right, the current behaviour is probably an implementation accident 
stemming from the assumption that

s.decode("utf-8").encode("utf-8") == s

always holds. Other encodings (I tried cp1252) produce the expected 
SyntaxError.