2to3 chokes on bad character
Peter Otten
__peter__ at web.de
Fri Feb 25 03:49:33 EST 2011
John Machin wrote:
> On Feb 25, 12:00 am, Peter Otten <__pete... at web.de> wrote:
>> John Machin wrote:
>
>> > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
>> > this case just trying to run or import the offending code file would
>> > have given an informative syntax error (you have declared the .py file
>> > to be encoded in UTF-8 but it's not).
>>
>> The problem is that Python 2.x accepts arbitrary bytes in string
>> constants.
>
> Ummm ... isn't that a bug? According to section 2.1.4 of the Python
> 2.7.1 Language Reference Manual: """The encoding is used for all
> lexical analysis, in particular to find the end of a string, and to
> interpret the contents of Unicode literals. String literals are
> converted to Unicode for syntactical analysis, then converted back to
> their original encoding before interpretation starts ..."""
>
> How do you reconcile "used for all lexical analysis" and "String
> literals are converted to Unicode for syntactical analysis" with the
> actual (astonishing to me) behaviour?
You are right, the current behaviour is probably an implementation accident
stemming from the assumption that
s.decode("utf-8").encode("utf-8") == s
always holds. Other encodings (I tried cp1252) produce the expected
SyntaxError.
More information about the Python-list
mailing list