[Python-Dev] SRE incompatibility

Andrew Kuchling akuchlin@mems-exchange.org
Fri, 30 Jun 2000 10:29:00 -0400


On Fri, Jun 30, 2000 at 04:18:13PM +0200, Fredrik Lundh wrote:
>    re.match('\\x00ffffffffffffff', '\377') != None
>or in other words, long hexadecimal escapes are cast
>down to 8-bit characters in RE.

This is for compatibility with Python string literals:

kronos Python-1.6>./python
>>> '\x00fffffff'
'\377'
>>> u'\x00fffffff'
u'\uFFFF'

(Where do these semantics come from, BTW?  C's \x seems to take any
number of hex digits but then reports an error if the character is
greater than 256, too large to fit into a byte.)

Note that the \u escape for Unicode characters uses exactly 4 digits,
no more, no less.  It would certainly be simpler and clearer to only
support a fixed number of digits with \x, since I find the casting
down behaviour is magical and not obvious.  But I don't know if we
want to make that change now.  (Guido now realizes the downside to
numbering it 2.0, as everyone hurries to suggest their favorite
backward-incompatible change.)

That doesn't help with regexes, of course, since a pattern might be
written as a regular string but be intended to match Unicode.  Maybe
the simplest rule is the best; always take 4 digits, even if it winds
up being incompatible with the \x in string literals.

--amk