Simple character translation problem

Martin von Loewis loewis at informatik.hu-berlin.de
Fri Sep 21 11:08:29 EDT 2001


David Eppstein <eppstein at ics.uci.edu> writes:

> I have: user input text, in Mac character set encoding
> I want: ASCII with HTML-entities coding the accented characters.
> 
> E.g. "café" should become "café".
> Is there code already in Python to do this easily?

First, you should convert the string into a Unicode string, using the
proper codec. Then, there is an easy approach and a difficult one. The
easy one is to convert all non-ASCII characters (i.e. those with
ordinals > 127) into character entities, i.e. using the &#digits;
notation.

Or, you could try to use external entities where possible. For that,
please have a look at htmlentitydefs.entitydefs. Using that is not
straight forward: you have to invert the dictionary, and you have to
convert the keys into Unicode keys. For the keys that are
single-character strings (e.g. '\306'), you can use the Unicode
character with the same ordinal. For characters above 255, you have to
convert between the character entity and a Unicode character.

If you can come up with patches to htmlentitydefs that make use of
Unicode, please do so and submit them to sf.net/projects/python.

Regards,
Martin



More information about the Python-list mailing list