unicode experiments + questions

Wed Mar 27 18:28:24 EST 2002

"Irmen de Jong" <irmen at NOSPAMREMOVETHISxs4all.nl> writes:

> First, what is the best way to enter international and/or unicode characters
> in your Python source? I can't just type the Euro sign and trust that it
> is read as a Euro sign on another platform, because how does Python know
> the character encoding of the source file!

For Python 2.2 and earlier, I recommend to maintain the file in utf-8,
and then do

msg = unicode("<utf-8 string containing euro sign>", "utf-8")

To view this source code properly, you need a UTF-8 editor; the same
approach would work with any other encoding. Python will treat this
code identical on all platforms.

If you want the code to display properly on all editors, you can write

msg = u"\N{EURO SIGN}"

> So I'm using the unicode escape char syntax, but that is cumbersome
> (where do I look up all my special characters?) and hard on the eyes.

Numeric values are difficult to read, indeed - I recommend to use the
symbolic values. On Windows, you can find those using charmap.exe; on
Linux, /usr/share/i18n/charmaps/UTF-8{.gz} lists the character names.
If neither is available, refer to unicode.org.

> I also have the following question:
> what exactly happens when I type  "print u" in Python, where u
> is a unicode string? 

It computes str(u), which in turn invokes
u.encode(sys.getdefaultencoding()).

> I'm on Win2000, so when I type
> >>> print e.encode('cp1252')
> I get the Euro symbol. 

What do you mean by that statement? Where do you "get" it? In the
cmd.exe window? Unlikely, since that window uses the OEM encoding.

Regards,
Martin