Web page special characters encoding

John Nagle nagle at animats.com
Sat Jul 10 17:48:49 EDT 2010


On 7/10/2010 2:03 PM, mattia wrote:
> Il Sat, 10 Jul 2010 18:09:12 +0100, MRAB ha scritto:
>
>> mattia wrote:
>>> Hi all, I'm using py3k and the urllib package to download web pages.
>>> Can you suggest me a package that can translate reserved characters in
>>> html like "è", "ò", "é" in the corresponding
>>> correct encoding?
>>>
>> import re
>> from html.entities import entitydefs
>>
>> # The downloaded web page will be bytes, so decode it to a string.
>> webpage = downloaded_page.decode("iso-8859-1")
>>
>> # Then decode the HTML entities.
>> webpage = re.sub(r"&(\w+);", lambda m: entitydefs[m.group(1)], webpage)
>
> Thanks, very useful, didn't know about the entitydefs dictionary.

    You also need to decode the HTML numerical escapes.  Expect that
in real-world HTML, out of range values will occasionally appear.

					John Nagle




More information about the Python-list mailing list