UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

M.-A. Lemburg mal@lemburg.com
Thu, 18 Nov 1999 10:39:30 +0100


Tim Peters wrote:
> 
> [MAL]
> > Guido and I have decided to turn \uXXXX into a standard
> > escape sequence with no further magic applied. \uXXXX will
> > only be expanded in u"" strings.
> 
> Does that exclude ur"" strings?  Not arguing either way, just don't know
> what all this means.
> 
> > Here's the new scheme:
> >
> > With the 'unicode-escape' encoding being defined as:
> >
> > · all non-escape characters represent themselves as a Unicode ordinal
> >   (e.g. 'a' -> U+0061).
> 
> Same as before (scream if that's wrong).
> 
> > · all existing defined Python escape sequences are interpreted as
> >   Unicode ordinals;
> 
> Same as before (ditto).
> 
> > note that \xXXXX can represent all Unicode ordinals,
> 
> This means that the definition of \xXXXX has changed, then -- as you pointed
> out just yesterday <wink>, \xABCDq currently acts like \xCDq.  Does the new
> \x definition apply only in u"" strings, or in "" strings too?  What is the
> new \x definition?

Guido decided to make \xYYXX return U+YYXX *only* within u""
strings. In  "" (Python strings) the same sequence will result
in chr(0xXX).
 
> > and \OOO (octal) can represent Unicode ordinals up to U+01FF.
> 
> Same as before (ditto).
> 
> > · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
> >   error to have fewer than 4 digits after \u.
> 
> Same as before (ditto).
> 
> IOW, I don't see anything that's changed other than an unspecified new
> treatment of \x escapes, and possibly that ur"" strings don't expand \u
> escapes.

The difference is that we no longer take the two step approach.
\uXXXX is treated at the same time all other escape sequences
are decoded (the previous version first scanned and decoded
all standard Python sequences and then turned to the \uXXXX
sequences in a second scan).
 
> > Examples:
> >
> > u'abc'          -> U+0061 U+0062 U+0063
> > u'\u1234'       -> U+1234
> > u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c
> 
> The last example is damaged (U+05c isn't legit).  Other than that, these
> look the same as before.

Corrected; thanks.
 
> > Now how should we define ur"abc\u1234\n"  ... ?
> 
> If strings carried an encoding tag with them, the obvious answer is that
> this acts exactly like r"abc\u1234\n" acts today except gets a
> "unicode-escaped" encoding tag instead of a "[whatever the default is
> today]" encoding tag.
> 
> If strings don't carry an encoding tag with them, you're in a bit of a
> pickle:  you'll have to convert it to a regular string or a Unicode string,
> but in either case have no way to communicate that it may need further
> processing; i.e., no way to distinguish it from a regular or Unicode string
> produced by any other mechanism.  The code I posted yesterday remains my
> best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> fiddling with backslashes just enough to get the \u escapes expanded, in the
> same way Java's (conceptual) preprocessor does it).

They don't have such tags... so I guess we're in trouble ;-)

I guess to make ur"" have a meaning at all, we'd need to go
the Java preprocessor way here, i.e. scan the string *only*
for \uXXXX sequences, decode these and convert the rest as-is
to Unicode ordinals.

Would that be ok ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/