xHTML/XML to Unicode (and back)

Robin Haswell rob at digital-crocus.com
Tue Jan 24 09:34:34 EST 2006


On Tue, 24 Jan 2006 14:46:46 +0100, Fredrik Lundh wrote:

> Robin Haswell wrote:
> 
>> I'm currently screenscraping some Swedish site, and i need a method to
>> convert XML entities (& etc, plus d etc) to Unicode characters.
>> I'm sure one of python's myriad of XML processors can do this but I can't
>> find which one.
>>
>> Can anyone make any suggestions?
> 
> any decent html-aware screen scraper library should be able to do
> this for you.

I'm using BeautifulSoup and it appears that it doesn't. I'd also like to
know the answer to this for when I do screenscraping with regular
expressions :-)

Thanks

> 
> if you've already extracted the strings, the strip_html function on
> this page might be what you need:
> 
>     http://effbot.org/zone/re-sub.htm#strip-html
> 
> </F>




More information about the Python-list mailing list