Web page special characters encoding

mattia gervaz at gmail.com
Sat Jul 10 17:03:41 EDT 2010


Il Sat, 10 Jul 2010 18:09:12 +0100, MRAB ha scritto:

> mattia wrote:
>> Hi all, I'm using py3k and the urllib package to download web pages.
>> Can you suggest me a package that can translate reserved characters in
>> html like "è", "ò", "é" in the corresponding
>> correct encoding?
>> 
> import re
> from html.entities import entitydefs
> 
> # The downloaded web page will be bytes, so decode it to a string.
> webpage = downloaded_page.decode("iso-8859-1")
> 
> # Then decode the HTML entities.
> webpage = re.sub(r"&(\w+);", lambda m: entitydefs[m.group(1)], webpage)

Thanks, very useful, didn't know about the entitydefs dictionary.



More information about the Python-list mailing list