UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Guido van Rossum guido@CNRI.Reston.VA.US
Thu, 18 Nov 1999 12:11:51 -0500


> > > Now how should we define ur"abc\u1234\n"  ... ?
> > 
> > If strings carried an encoding tag with them, the obvious answer is that
> > this acts exactly like r"abc\u1234\n" acts today except gets a
> > "unicode-escaped" encoding tag instead of a "[whatever the default is
> > today]" encoding tag.
> > 
> > If strings don't carry an encoding tag with them, you're in a bit of a
> > pickle:  you'll have to convert it to a regular string or a Unicode string,
> > but in either case have no way to communicate that it may need further
> > processing; i.e., no way to distinguish it from a regular or Unicode string
> > produced by any other mechanism.  The code I posted yesterday remains my
> > best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> > fiddling with backslashes just enough to get the \u escapes expanded, in the
> > same way Java's (conceptual) preprocessor does it).
> 
> They don't have such tags... so I guess we're in trouble ;-)
> 
> I guess to make ur"" have a meaning at all, we'd need to go
> the Java preprocessor way here, i.e. scan the string *only*
> for \uXXXX sequences, decode these and convert the rest as-is
> to Unicode ordinals.
> 
> Would that be ok ?

Read Tim's code (posted about 40 messages ago in this list).

Like Java, it interprets \u.... when the number of backslashes is odd,
but not when it's even.  So \\u.... returns exactly that, while
\\\u.... returns two backslashes and a unicode character.

This is nice and can be done regardless of whether we are going to
interpret other \ escapes or not.

--Guido van Rossum (home page: http://www.python.org/~guido/)