Parsing HTML, extracting text and changing attributes.

Mon Jun 18 11:23:30 EDT 2007

On 2007-06-18, sebzzz at gmail.com <sebzzz at gmail.com> wrote:
> I work at this company and we are re-building our website: http://caslt.org/.
> The new website will be built by an external firm (I could do it
> myself, but since I'm just the summer student worker...). Anyways, to
> help them, they first asked me to copy all the text from all the pages
> of the site (and there is a lot!) to word documents. I found the idea
> pretty stupid since style would have to be applied from scratch anyway
> since we don't want to get neither old html code behind nor Microsoft
> Word BS code.
>
> I proposed to take each page and making a copy with only the text, and
> with class names for the textual elements (h1, h1, p, strong, em ...)
> and then define a css file giving them some style.
>
> Now, we have around 1 600 documents do work on, and I thought I could
> challenge myself a bit and automate all the dull work. I thought about
> the possibility of parsing all those pages with python, ripping of the
> navigations bars and just keeping the text and layout tags, and then
> applying class names to specific tags. The program would also have to
> remove the table where text is located in. And other difficulty is
> that I want to be able to keep tables that are actually used for
> tabular data and not positioning.
>
> So, I'm writing this to have your opinion on what tools I
> should use to do this and what technique I should use.

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

-- 
Neil Cerutti