[OT] does the charset lie?

David Goodger goodger at python.org
Sun May 2 13:33:41 EDT 2004


Skip Montanaro wrote:
 > OTOH, this means if I need the raw
> content of the page (after expanding any entities), I need to so something
> like (assuming the raw bytes are already in data):
> 
>     data = unicode(data, "iso-8859-1").encode("utf-8")
>     data = map_entities_to_utf_8(data)
>     data = unicode(data, "utf-8")

Or, even simpler, skip the intermediate step:

     data = unicode(data, "iso-8859-1")
     data = map_entities_to_unicode(data)

map_entities_to_unicode() could use htmlentitydefs.name2codepoint from
the stdlib.  This must have already been done somewhere.

-- David Goodger




More information about the Python-list mailing list