UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Tim Peters tim_one@email.msn.com
Fri, 12 Nov 1999 00:18:09 -0500


[MAL]
> ...
> The conversion goes as follows:
> · for single characters (and this includes all \XXX sequences
>   except \uXXXX), take the ordinal and interpret it as Unicode
>   ordinal for \uXXXX sequences, insert the Unicode character
>   with ordinal 0xXXXX instead

Perfect!

[about "raw" Unicode strings]
> ...
> Not sure whether we really need to make this even more complicated...
> The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> filenames won't hurt much in the context of those \uXXXX monsters :-)

Alas, this won't stand over the long term.  Eventually people will write
Python using nothing but Unicode strings -- "regular strings" will
eventurally become a backward compatibility headache <0.7 wink>.  IOW,
Unicode regexps and Unicode docstrings and Unicode formatting ops ...
nothing will escape.  Nor should it.

I don't think it all needs to be done at once, though -- existing languages
usually take years to graft in gimmicks to cover all the fine points.  So,
happy to let raw Unicode strings pass for now, as a relatively minor point,
but without agreeing it can be ignored forever.

> ...
> BTW, if you want to type in UTF-8 strings and have them converted
> to Unicode, you can use the standard:
>
> u = unicode('...string with UTF-8 encoded characters...','utf-8')

That's what I figured, and thanks for the confirmation.