Unicode program representation

François Pinard pinard at iro.umontreal.ca
Sun Apr 2 23:44:58 EDT 2000


"Neil Hodgson" <neilh at hare.net.au> writes:

> su = u"hi ?"

> I think this should be changed to interpreting the literal as a UTF-8
> literal.  The advantage here is that non-roman string literals become
> a natural part of the language.

It might not be convenient expecting all sources to be expressed using
UTF-8, which would be a consequence of your suggestion.  A lot of people
use other representations, and Python 1.6 is doing well, in my opinion.
Despite it _does_ favour UTF-8 for a representation, it does not force it.

Instead of writing:

   u"hi ?"

which is indeed a way to write Unicode strings using 7-bits, or even 8-bits
in a Latin-1 environment, you might write:

   unicode("hi ?")

(without the `u' prefix) to trigger an UTF-8 to Unicode conversion.

> Saving the scripts as UCS-2 shows that the interpreter is unable to
> deal with UCS-2 scripts, which is what I expected. I think very few
> people will be creating script files in UCS-2, instead preferring to
> keep source code in UTF-8 files.

We might guess right now that a writing style will soon develop in which
Python sources could be recoded between charsets (or encodings, to use
the Python terminology), with the sole need of changing a single variable
in each module.  The `unicode' function, for example, accepts a second
argument stating the encoding, when not UTF-8, and if that argument is
always used, and holds the same module-wide variable all over, this could
be easily achieved.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard






More information about the Python-list mailing list