Parsing HTML, extracting text and changing attributes.

Mon Jun 18 10:00:39 EDT 2007

Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.