Right tool and method to strip off html files (python, sed, awk?)

Eric_Dexter at msn.com Eric_Dexter at msn.com
Fri Jul 13 20:12:48 EDT 2007


On Jul 13, 7:07 pm, "Eric_Dex... at msn.com" <Eric_Dex... at msn.com> wrote:
> On Jul 13, 1:57 pm, seb... at gmail.com wrote:
>
>
>
>
>
> > Hi,
>
> > I'm in the process of refactoring a lot of HTML documents and I'm
> > using html tidy to do a part of this
> > work. (clean up, change to xhtml and remove font and center tags)
>
> > Now, Tidy will just do a part of the work I need to
> > do, I have to remove all the presentational tags and attributes from
> > the pages (in other words rip off the pages) including the tables that
> > are used for disposition of content (how to differentiate?).
>
> > I thought about doing that with python (for which I'm in process of
> > learning), but maybe an other tool (like sed?) would be better suited
> > for this job.
>
> > I kind of know generally what I need to do:
>
> > 1- Find all html files in the folders (sub-folders ...)
> > 2- Do some file I/O and feed Sed or Python or what else with the file.
> > 3- Apply recursively some regular expression on the file to do the
> > things a want. (delete when it encounters certain tags, certain
> > attributes)
> > 4- Write the changed file, and go through all the files like that.
>
> > But I don't know how to do it for real, the syntax and everything. I
> > also want to pick-up the tool that's the easiest for this job. I heard
> > about BeautifulSoup and lxml for Python, but I don't know if those
> > modules would help.
>
> > Now, I know I'm not a the best place to ask if python is the right
> > choice (anyways even my little finger tells me it is), but if I can do
> > the same thing more simply with another tool it would be good to know.
>
> > An other argument for the other tools is that I know how to use the
> > find unix program to find the files and feed them to grep or sed, but
> > I still don't know what's the syntax with python (fetch files, change
> > them than write them) and I don't know if I should read the files and
> > treat them as a whole or just line by line. Of course I could mix
> > commands with some python, find command to my program's standard
> > input, and my command's standard output to the original file. But I do
> > I control STDIN and STDOUT with python?
>
> > Sorry if that's a lot of questions in one, and I will probably get a
> > lot of RTFM (which I'm doing btw), but I feel I little lost in all
> > that right now.
>
> > Any help would be really appreciated.
> > Thanks
>
> You might find a text editor is the way to go..  you can use autoit
> either through python or by itself to control the text editor you
> use..  I just downloaded pspad and it looks like it will do that.  It
> may be a pain to script though.
>
> http://sourceforge.net/projects/dex-tracker/- Hide quoted text -
>
> - Show quoted text -

let me add to that it may be a pain to script with autoit and I am not
doing more of an example because it won't insert a textfile at a
location like mdipad will.




More information about the Python-list mailing list