UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Tim Peters tim_one@email.msn.com
Thu, 11 Nov 1999 01:25:27 -0500


[/F, dripping with code]
> ...
> Note that the 'u' must be followed by four hexadecimal digits.  If
> fewer digits are given, the sequence is left in the resulting string
> exactly as given.

Yuck -- don't let probable error pass without comment.  "must be" == "must
be"!

[moving backwards]
> \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> character is stored using UTF-8 encoding, which means that this
> sequence can result in up to three encoded characters.

The code is fine, but I've gotten confused about what the intent is now.
Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
literals, but now he's got Unicode-escaped literals instead -- and you favor
an internal 2-byte-per-char Unicode storage format.  In that combination of
worlds, is there any use in the *language* (as opposed to in a runtime
module) for \uxxxx -> UTF-8 conversion?

And MAL, if you're listening, I'm not clear on what a Unicode-escaped
literal means.  When you had UTF-8 literals, the meaning of something like

    u"a\340\341"

was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
were just a way of specifying a byte stream.  As a Unicode-escaped string, I
assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
escapes to be taken as two separate Latin-1 characters (in their role as a
Unicode subset), or as an especially clumsy way to specify a single 16-bit
Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
escapes.

One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
There probably should be; and while Guido will hate this, a ur string should
probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
Java defines \uxxxx expansion as occurring in a preprocessing step.

BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).