Parsing HTML, extracting text and changing attributes.

Mon Jun 18 10:39:09 EDT 2007

sebzzz at gmail.com wrote:
> I work at this company and we are re-building our website: http://caslt.org/.
> The new website will be built by an external firm (I could do it
> myself, but since I'm just the summer student worker...). Anyways, to
> help them, they first asked me to copy all the text from all the pages
> of the site (and there is a lot!) to word documents. I found the idea
> pretty stupid since style would have to be applied from scratch anyway
> since we don't want to get neither old html code behind nor Microsoft
> Word BS code.
> 
> I proposed to take each page and making a copy with only the text, and
> with class names for the textual elements (h1, h1, p, strong, em ...)
> and then define a css file giving them some style.
> 
> Now, we have around 1 600 documents do work on, and I thought I could
> challenge myself a bit and automate all the dull work. I thought about
> the possibility of parsing all those pages with python, ripping of the
> navigations bars and just keeping the text and layout tags, and then
> applying class names to specific tags. The program would also have to
> remove the table where text is located in. And other difficulty is
> that I want to be able to keep tables that are actually used for
> tabular data and not positioning.
> 
> So, I'm writing this to have your opinion on what tools I should use
> to do this and what technique I should use.

lxml is what you're looking for, especially if you're familiar with XPath.

http://codespeak.net/lxml/dev

Stefan