[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Jon Ribbens jon+python-uk at unequivocal.co.uk
Fri Feb 8 13:14:02 CET 2008


On Fri, Feb 08, 2008 at 09:01:06AM +0000, Andy Robinson wrote:
> FWIW, we parse tens of thousands of pages every week to build let
> people republish content into nice PDFs.  Beautiful Soup was the only
> thing that made this sane, as many pages are not structured to be easy
> to parse.  Like you we found the network was the limit, and simply
> kicking off several scraping processes in parallel solved that (e.g.
> one run of a script parses hotels from A-F, the next from G-M and so
> on...). I can't imagine using anything else.

We do HTML parsing all day every day, so I wrote a Python-extension
module in C to do it. But we had very particular requirements,
specifically that we need to not only understand "real-life" HTML,
but also generate detailed, precise diagnostics whenever the HTML
is not correct according to the spec. The C module is only 900 lines
of code though.


More information about the python-uk mailing list