ascii to latin1

richie at entrian.com richie at entrian.com
Tue May 9 16:55:50 EDT 2006


[Luis]
> The script converted the ÇÃ from the first line, but not the º from
> the second one.

That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a
letter and not a diacritic:

  http://www.fileformat.info/info/unicode/char/00ba/index.htm

You can't encode it in ascii because it's not an ascii character, and
the script doesn't remove it because it only removes diacritics.

I don't know what the best thing to do with it would be - could you use
latin-1 as your base encoding and leave it in there?  I don't speak any
language that uses it, but I'd guess that anyone searching for eg. 5º
(forgive me if I have the gender wrong 8-) would actually type 5º -
are there any Italian/Spanish/Portuguese speakers here who can confirm
or deny that?

In the general case, you have to decide what happens to characters that
aren't diacritics and don't live in your base encoding - what happens
when a Chinese user searches for a Chinese character?  Probably you
should just encode(base_encoding, 'ignore').

-- 
Richie Hindle
richie at entrian.com




More information about the Python-list mailing list