[Python-Dev] SRE incompatibility
Tim Peters
tpeters@beopen.com
Fri, 30 Jun 2000 12:38:21 -0400
[Andrew Kuchling]
> ...
> This is for compatibility with Python string literals:
>
> kronos Python-1.6>./python
> >>> '\x00fffffff'
> '\377'
> >>> u'\x00fffffff'
> u'\uFFFF'
>
> (Where do these semantics come from, BTW? C's \x seems to take any
> number of hex digits but then reports an error if the character is
> greater than 256, too large to fit into a byte.)
The behavior of \x in C is mostly implementation-defined. The committee
knew that C had to do *something* to support "large characters" down the
road, but in those early days they had no clear idea exactly what. So,
rather than do something sensible <0.5 wink>, they invented a perfectly
general mechanism without portable semantics. "C itself" isn't complaining
if the character "is greater than 256", it's the specific implementation of
C you're using that's complaining. A different implementation is free to (&
probably will!) do something different.
Guido adopted the most commonly implemented semantics (ignore all but the
last byte) in Python, apparently under the delusion that this would be a
Good Thing <wink>. Marc-Andre followed suit by generalizing this madness to
Unicode.
> Note that the \u escape for Unicode characters uses exactly 4 digits,
> no more, no less.
I pushed for that obnoxiously. Glad you appreciate it <wink>. Java does
the same.
> It would certainly be simpler and clearer to only support a fixed
> number of digits with \x, since I find the casting down behaviour is
> magical and not obvious.
Yes, it's basically nuts.
> But I don't know if we want to make that change now.
No from me, because it may break stuff. Wait for Python 2.0 <ahem>.
> (Guido now realizes the downside to numbering it 2.0, as everyone
> hurries to suggest their favorite backward-incompatible change.)
Guido always realized that, I believe. It's a "least of evils" kind of
thing, mixed with a celebration, not a pure win.
> That doesn't help with regexes, of course, since a pattern might be
> written as a regular string but be intended to match Unicode. Maybe
> the simplest rule is the best; always take 4 digits, even if it winds
> up being incompatible with the \x in string literals.
I vote for backward compatibility for now, and not only because that will
irritate /F the most.