unicode mystery

John Machin sjmachin at lexicon.net
Tue Jan 11 15:26:48 EST 2005


Sean McIlroy wrote:
> I recently found out that unicode("\347", "iso-8859-1") is the
> lowercase c-with-cedilla, so I set out to round up the unicode
numbers
> of the extra characters you need for French, and I found them all
just
> fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the
unicode
> characters from 0 to 900 without finding it; then I looked at
> www.unicode.org but the numbers I got there (0152 and 0153) didn't
> work. Can anybody put a help on me wrt this? (Do I need to give a
> different value for the second parameter, maybe?)

Characters that are in iso-8859-1 are mapped directly into Unicode.
That is, the first 256 characters of Unicode are identical to
iso-8859-1.

Consider this:

>>> c_cedilla = unicode("\347", "iso-8859-1")
>>> c_cedilla
u'\xe7'
>>> ord(c_cedilla)
231
>>> ord("\347")
231

What you did with c_cedilla "worked" because it was effectively doing
nothing. However if you do unicode(char, encoding) where char is not in
encoding, it won't "work".

As John Lenton has pointed out, if you find a character in the Unicode
tables, you can just use it directly. There is no need in this
circumstance to use unicode().

HTH,
John




More information about the Python-list mailing list