beautifulsoup .vs tidy
Fredrik Lundh
fredrik at pythonware.com
Sun Jul 2 03:22:09 EDT 2006
Ravi Teja wrote:
>> Of course, lxml should be able to do this kind of thing as well. I'd be
>> interested to know why this "is not a good idea", though.
>
> No reason that you don't know already.
>
> http://www.boddie.org.uk/python/HTML.html
>
> "If the document text is well-formed XML, we could omit the html
> parameter or set it to have a false value."
>
> XML parsers are not required to be forgiving to be regarded compliant.
> And much HTML out there is not well formed.
so? once you run it through an HTML-aware parser, the *resulting*
structure is well formed.
a site generator->converter->xpath approach is no less reliable than any
other HTML-scraping approach.
</F>
More information about the Python-list
mailing list