[Tutor] Remove certain tags in html files
Eric Brunson
brunson at brunson.com
Fri Jul 27 20:27:44 CEST 2007
Eric Brunson wrote:
> Sebastien Noel wrote:
>
>> Hi,
>>
>> I'm doing a little script with the help of the BeautifulSoup HTML parser
>> and uTidyLib (HTML Tidy warper for python).
>>
>> Essentially what it does is fetch all the html files in a given
>> directory (and it's subdirectories) clean the code with Tidy (removes
>> deprecated tags, change the output to be xhtml) and than BeautifulSoup
>> removes a couple of things that I don't want in the files (Because I'm
>> stripping the files to bare bone, just keeping layout information).
>>
>> Finally, I want to remove all trace of layout tables (because the new
>> layout will be in css for positioning). Now, there is tables to layout
>> things on the page and tables to represent tabular data, but I think it
>> would be too hard to make a script that finds out the difference.
>>
>> My question, since I'm quite new to python, is about what tool I should
>> use to remove the table, tr and td tags, but not what's enclosed in it.
>> I think BeautifulSoup isn't good for that because it removes what's
>> enclosed as well.
>>
>>
>
> You want to look at htmllib: http://docs.python.org/lib/module-htmllib.html
>
I'm sorry, I should have pointed you to HTMLParser:
http://docs.python.org/lib/module-HTMLParser.html
It's a bit more straightforward than the HTMLParser defined in htmllib.
Everything I was talking about below pertains to the HTMLParser module
and not the htmllib module.
> If you've used a SAX parser for XML, it's similar. Your parser parses
> the file and every time it hit a tag, it runs a callback which you've
> defined. You can assign a default callback that simply prints out the
> tag as parsed, then a custom callback for each tag you want to clean up.
>
> It took me a little time to wrap my head around it the first time I used
> it, but once you "get it" it's *really* powerful and really easy to
> implement.
>
> Read the docs and play around a little bit, then if you have questions,
> post back and I'll see if I can dig up some examples I've written.
>
> e.
>
>
>> Is re the good module for that? Basically, if I make an iteration that
>> scans the text and tries to match every occurrence of a given regular
>> expression, would it be a good idea?
>>
>> Now, I'm quite new to the concept of regular expressions, but would it
>> ressemble something like this: re.compile("<table.*>")?
>>
>> Thanks for the help.
>> _______________________________________________
>> Tutor maillist - Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>
>>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list