[Chicago] BeautifulSoup gone bad

Martin Maney maney at two14.net
Fri Mar 13 00:04:57 CET 2009


On Thu, Mar 12, 2009 at 10:31:51AM -0500, Kumar McMillan wrote:
> http://codespeak.net/lxml/lxmlhtml.html

Where it says:

  The normal HTML parser is capable of handling broken HTML, but for
  pages that are far enough from HTML to call them 'tag soup', it may
  still fail to parse the page. A way to deal with this is ElementSoup,
  which deploys the well-known BeautifulSoup parser to build an lxml
  HTML tree.

So when you need to parse nasty real-world web pages, you'll be using
BeautifulSoup anyway.  I only ever seem to need to scrape really nasty
pages, I think.  :-(

> What's really nice is that you can use full xpath expressions on
> crummy, poorly-formed HTML (the language of the Web!).  For a while
> lxml was a bit unstable and hard to build on Mac but as of recent
> versions I have not had any problems.

xpath has never appealed to me, though I suppose it's just the bee's
knees for the right applications.

-- 
To be alive, is that not to be
again and again surprised?  -- Nicholas van Rijn



More information about the Chicago mailing list