Parsing HTML, extracting text and changing attributes.

Mon Jun 18 15:04:24 EDT 2007

sebzzz at gmail.com wrote:
> I see there is a couple of tools I could use, and I also heard of
> sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
> htmllib ...
> 
> Is there any of those tools that does the job I need to do more easily
> and what should I use? Maybe a combination of those tools, which one
> is better for what part of the work?

Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author :), I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.

Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml.

http://codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Stefan