html parser?

Laszlo Zsolt Nagy gandalf at designaproduct.biz
Wed Oct 19 05:56:14 EDT 2005


Thorsten Kampe wrote:

>* Christoph Söllner (2005-10-18 12:20 +0100)
>  
>
>>right, that's what I was looking for. Thanks very much.
>>    
>>
>
>For simple things like that "BeautifulSoup" might be overkill.
>
>import formatter, \ 
>       htmllib,   \ 
>       urllib 
>
>url = 'http://python.org' 
>
>htmlp = htmllib.HTMLParser(formatter.NullFormatter()) 
>  
>
The problem with HTMLParser is that does not handle unclosed tags and/or 
attirbutes given with invalid syntax.
Unfortunately, many sites on the internet use malformed HTML pages. You 
are right, BeautifulSoup is an overkill
(it is rather slow) but I'm affraid this is the only fault-tolerant 
solution.

  Les




More information about the Python-list mailing list