[OT] does the charset lie?
David Goodger
goodger at python.org
Sun May 2 13:33:41 EDT 2004
Skip Montanaro wrote:
> OTOH, this means if I need the raw
> content of the page (after expanding any entities), I need to so something
> like (assuming the raw bytes are already in data):
>
> data = unicode(data, "iso-8859-1").encode("utf-8")
> data = map_entities_to_utf_8(data)
> data = unicode(data, "utf-8")
Or, even simpler, skip the intermediate step:
data = unicode(data, "iso-8859-1")
data = map_entities_to_unicode(data)
map_entities_to_unicode() could use htmlentitydefs.name2codepoint from
the stdlib. This must have already been done somewhere.
-- David Goodger
More information about the Python-list
mailing list