Right tool and method to strip off html files (python, sed, awk?)

Fri Jul 13 18:38:31 EDT 2007

sebzzz at gmail.com wrote:
> I thought about doing that with python (for which I'm in process of
> learning), but maybe an other tool (like sed?) would be better suited
> for this job.

Generally speaking, in my experience, the best tool for the job is the one you know how to use ;) There are of course places where certain tools are very well suited - e.g. Perl when it comes to regular expressions and text processing. BUT, the time it will take you to learn Perl would be better spent getting the work done in Python or sed/awk etc. Similarly, maintaining a script in a language you don't know well will introduce headaches later. In short, you're almost always best off using the tool you are most comfortable with.

> I kind of know generally what I need to do:

That's usually a good start ;)

> 1- Find all html files in the folders (sub-folders ...)
> 2- Do some file I/O and feed Sed or Python or what else with the file.
> 3- Apply recursively some regular expression on the file to do the
> things a want. (delete when it encounters certain tags, certain
> attributes)
> 4- Write the changed file, and go through all the files like that.

This is one valid approach. There are a lot of things that you can do to help define your problem better though. For instance: 

* Are the files matching a predefined template of some kind? 
  Can you use this to help define some of your processing rules? 

* Do you know what kind of regular expressions you are going to need?
  For that matter, are you even comfortable using regular expressions?
  From the sound of your post, you may not have experience with them so
  that's going to be a hurdle to overcome when it coms to using them 

* Regular expressions are one approach to the problem. However, they 
  may not be the most maintainable or practical, depending on the actual
  requirements. An HTML or XML processing module might be a better option, 
  particularly if the HTML Tidied pages are valid XHTML. 

* Define your program requirements in smaller more specific terms, e.g. 
  "need to remove all of the following tags: <font>, <center>" or 
  "need to clean orphaned/invalid tags" - this will help you define 
  the actual problem statement better and makes it easier to see what
  the best solution is. Are you just looking to strip all the HTML from 
  some files? Perhaps lynx/links with the --dump option is all you need, 
  as opposed to a full HTML parsing script.

> But I don't know how to do it for real, the syntax and everything. I
> also want to pick-up the tool that's the easiest for this job. I heard
> about BeautifulSoup and lxml for Python, but I don't know if those
> modules would help.

See above about defining the problem statement. If you get it pinned down to a finite set of requirements, you can take those smaller problems and determine if, for example, lxml is the right tool for the job. If you come back to the Python mailing list with a smaller problem, e.g. "how can I remove all <center> tags from HTML pages", you're much more likely to get a quick, practical, and useful answer to your question(s). 

> Now, I know I'm not a the best place to ask if python is the right
> choice (anyways even my little finger tells me it is), but if I can do
> the same thing more simply with another tool it would be good to know.

If all you've got is a hammer, everything looks like a nail ;) - it's important to not be so dogmatic about one programming language or tool of any kind that you can't see when there's a much more efficient solution available. However, should you end up determining that what is needed is a good all-purpose scripting/programming language, I'm sure you'll find Python plenty capable and this list quite helpful in conquering any problems along the way. 

> An other argument for the other tools is that I know how to use the
> find unix program to find the files and feed them to grep or sed, but
> I still don't know what's the syntax with python (fetch files, change
> them than write them) and I don't know if I should read the files and
> treat them as a whole or just line by line. Of course I could mix
> commands with some python, find command to my program's standard
> input, and my command's standard output to the original file. But I do
> I control STDIN and STDOUT with python?

Either approach is perfectly valid should you end up using Python; you can either feed a list of filenames to Python on the command line, write a recursive directory reading function that will get the filenames, or control STDOUT/STDIN. Again see my first point about defining a problem statement, and then you can Google for example code to help you. The Python Cookbook is often enormously helpful as well, since you can find sample code for manipulating STDIN/STDOUT, reading a directory recursively, and handling command line arguments. But, it's important to know which one you want before you can search for it...

> Sorry if that's a lot of questions in one, and I will probably get a
> lot of RTFM (which I'm doing btw), but I feel I little lost in all
> that right now.

Reading the manual is excellent and important, but it won't always help you with feeling overwhelmed. The best thing to do is break a big problem into little problems and work on those so they don't seem so insurmountable. (You may be detecting a pattern to the advice I'm giving by now). 

HTH, 

-Jay