[Python-3000] Invalid \U escape in source code give hard-to-trace error

Wed Jul 18 05:36:05 CEST 2007

> When a source file contains a string literal with an out-of-range \U
> escape (e.g. "\U12345678"), instead of a syntax error pointing to the
> offending literal, I get this, without any indication of the file or
> line:
> 
> UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in
> position 0-9: illegal Unicode character
> 
> This is quite hard to track down.

I think the fundamental flaw is that a codec is used to implement
the Python syntax (or, rather, lexical rules).

Not quite sure what the rationale for this design was; doing it on
the lexical level is (was) tricky because \u escapes were allowed
only for Unicode literals, and the lexer had no knowledge of the
prefix preceding a literal. (In 3k, it's still similar, because
\U escapes have no effect in bytes and raw literals).

Still, even if it is "only" handled at the parsing level, I
don't see why it needs to be a codec. Instead, implementing
escapes in the compiler would still allow for proper diagnostics
(notice that in the AST the original lexical form of the string
literal is gone).

> (Both the location of the bad
> literal in the source file, and the origin of the error in the parser.
> :-) Can someone come up with a fix?

The language definition makes it difficult to fix it where I would
consider the "proper" place, i.e. in the tokenization:

http://docs.python.org/ref/strings.html

says that escapeseq is "\" <any ASCII character>. So
"\x" is a valid shortstring.

Then it becomes fuzzy: It says that any unrecognized escape
sequences are left in the string. While that appears like a clear
specification, it is not implemented (and has not since Python
2.0 anymore). According to the spec, '\U12345678' is well-formed,
and denotes the same string as '\\U12345678'.

I now see the following choices:
1. Restore implementing the spec again. Stop complaining about
   invalid escapes for \x and \U, and just interpret the \
   as '\\'. In this case, the current design could be left in
   place, and the codecs would just stop raising these errors.
2. Change the spec to make it an error if \x is not followed
   by two hex digits, \u not by four hex digits, \U not by
   8, or the value denoted by the \U digits is out of range.
   In this case, I would propose to move the lexical analysis
   back into the parser, or just make an internal API that
   will raise a proper SyntaxError (it will be tricky to
   compute the column in the original source line, though).
3. Change the spec to make constrain escapeseq, giving up
   the rule that uninterpreted escapes silently become
   two characters. That's difficult to write down in EBNF,
   so should be formulated through constraints in natural
   language. The lexer would have to keep track of what kind
   of literal it is processing, and reject invalid escapes
   directly on source level.
There are probably other options as well.

Regards,
Martin