urllib2.urlopen(url) pulling something other than HTML

Stefan Behnel stefan.behnel-n05pAM at web.de
Wed Aug 22 02:47:32 EDT 2007


Gabriel Genellina wrote:
> On 21 ago, 18:36, j... at pobox.com (John J. Lee) wrote:
>> Gabriel Genellina <gagsl-... at yahoo.com.ar> writes:
>>
>> [...]> Don't even try to understand it - it's a mess. Use the HTMLParser
>>> module instead.
>> [...]
>>
>> Module sgmllib (and therefore module htmllib also) is more tolerant of
>> bad HTML than module HTMLParser.
> 
> I had the impression it was the opposite; anyway, neither of them can
> handle really bad html.
> I just don't *like* htmllib.HTMLParser - but that's only a matter of
> taste.

lxml.html handles bad HTML and it's a powerful tool that is very easy to use.
And if one day you have to deal with really, *really* broken tag soup, it also
comes with BeautifulSoup parser integration.

Stefan



More information about the Python-list mailing list