[OT] does the charset lie?

Robert Brewer fumanchu at amor.org
Sun May 2 17:14:03 EDT 2004


David Goodger wrote:
> Skip Montanaro wrote:
>  > OTOH, this means if I need the raw
> > content of the page (after expanding any entities), I need 
> to so something
> > like (assuming the raw bytes are already in data):
> > 
> >     data = unicode(data, "iso-8859-1").encode("utf-8")
> >     data = map_entities_to_utf_8(data)
> >     data = unicode(data, "utf-8")
> 
> Or, even simpler, skip the intermediate step:
> 
>      data = unicode(data, "iso-8859-1")
>      data = map_entities_to_unicode(data)
> 
> map_entities_to_unicode() could use htmlentitydefs.name2codepoint from
> the stdlib.  This must have already been done somewhere.

As an average, I'd guess at least once per Python web app. ;)


FuManChu




More information about the Python-list mailing list