Right tool and method to strip off html files (python, sed, awk?)

Sat Jul 14 07:28:55 EDT 2007

sebzzz at gmail.com wrote:
> 1- Find all html files in the folders (sub-folders ...)
> 2- Do some file I/O and feed Sed or Python or what else with the file.
> 3- Apply recursively some regular expression on the file to do the
> things a want. (delete when it encounters certain tags, certain
> attributes)
> 4- Write the changed file, and go through all the files like that.

Use the lxml.html.clean module, which is made exactly for that purpose. It's
not released yet, but you can use it from the current html branch of lxml.
There will soon be an official alpha of the 2.0 series, which will contain
lxml.html:

http://codespeak.net/svn/lxml/branch/html/

It looks like you're on Ubuntu, so compiling it from sources after an SVN
checkout should be as simple as the usual setup.py dance. Please report back
to the lxml mailing list if you find any problems or have any further ideas on
how to make it even more versatile than it already is.

For lxml is general, see:

http://codespeak.net/lxml/

Stefan