beautifulsoup .vs tidy

Fredrik Lundh fredrik at pythonware.com
Sun Jul 2 03:22:09 EDT 2006


Ravi Teja wrote:

>> Of course, lxml should be able to do this kind of thing as well. I'd be
>> interested to know why this "is not a good idea", though.
> 
> No reason that you don't know already.
> 
> http://www.boddie.org.uk/python/HTML.html
> 
> "If the document text is well-formed XML, we could omit the html
> parameter or set it to have a false value."
> 
> XML parsers are not required to be forgiving to be regarded compliant.
> And much HTML out there is not well formed.

so?  once you run it through an HTML-aware parser, the *resulting* 
structure is well formed.

a site generator->converter->xpath approach is no less reliable than any 
other HTML-scraping approach.

</F>




More information about the Python-list mailing list