UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
M.-A. Lemburg
mal@lemburg.com
Wed, 17 Nov 1999 11:03:59 +0100
Tim Peters wrote:
>
> [MAL]
> > ...demo script...
>
> It looks like
>
> r'\\u0000'
>
> will get translated into a 2-character Unicode string.
Right...
> That's probably not
> good, if for no other reason than that Java would not do this (it would
> create the obvious 7-character Unicode string), and having something that
> looks like a Java escape that doesn't *work* like the Java escape will be
> confusing as heck for JPython users. Keeping track of even-vs-odd number of
> backslashes can't be done with a regexp search, but is easy if the code is
> simple <wink>:
> ...Tim's version of the demo...
Guido and I have decided to turn \uXXXX into a standard
escape sequence with no further magic applied. \uXXXX will
only be expanded in u"" strings.
Here's the new scheme:
With the 'unicode-escape' encoding being defined as:
· all non-escape characters represent themselves as a Unicode ordinal
(e.g. 'a' -> U+0061).
· all existing defined Python escape sequences are interpreted as
Unicode ordinals; note that \xXXXX can represent all Unicode
ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
error to have fewer than 4 digits after \u.
Examples:
u'abc' -> U+0061 U+0062 U+0063
u'\u1234' -> U+1234
u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c
Now how should we define ur"abc\u1234\n" ... ?
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 44 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/