UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

M.-A. Lemburg mal@lemburg.com
Wed, 17 Nov 1999 11:03:59 +0100


Tim Peters wrote:
> 
> [MAL]
> > ...demo script...
> 
> It looks like
> 
>     r'\\u0000'
> 
> will get translated into a 2-character Unicode string.

Right...

> That's probably not
> good, if for no other reason than that Java would not do this (it would
> create the obvious 7-character Unicode string), and having something that
> looks like a Java escape that doesn't *work* like the Java escape will be
> confusing as heck for JPython users.  Keeping track of even-vs-odd number of
> backslashes can't be done with a regexp search, but is easy if the code is
> simple <wink>:
> ...Tim's version of the demo...

Guido and I have decided to turn \uXXXX into a standard
escape sequence with no further magic applied. \uXXXX will
only be expanded in u"" strings.

Here's the new scheme:

With the 'unicode-escape' encoding being defined as:

· all non-escape characters represent themselves as a Unicode ordinal
  (e.g. 'a' -> U+0061).

· all existing defined Python escape sequences are interpreted as
  Unicode ordinals; note that \xXXXX can represent all Unicode
  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.

· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
  error to have fewer than 4 digits after \u.

Examples:

u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234
u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

Now how should we define ur"abc\u1234\n"  ... ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/