[Tutor] Remove certain tags in html files

Eric Brunson brunson at brunson.com
Fri Jul 27 20:27:44 CEST 2007


Eric Brunson wrote:
> Sebastien Noel wrote:
>   
>> Hi,
>>
>> I'm doing a little script with the help of the BeautifulSoup HTML parser 
>> and uTidyLib (HTML Tidy warper for python).
>>
>> Essentially what it does is fetch all the html files in a given 
>> directory (and it's subdirectories) clean the code with Tidy (removes 
>> deprecated tags, change the output to be xhtml) and than BeautifulSoup 
>> removes a couple of things that I don't want in the files (Because I'm 
>> stripping the files to bare bone, just keeping layout information).
>>
>> Finally, I want to remove all trace of layout tables (because the new 
>> layout will be in css for positioning). Now, there is tables to layout 
>> things on the page and tables to represent tabular data, but I think it 
>> would be too hard to make a script that finds out the difference.
>>
>> My question, since I'm quite new to python, is about what tool I should 
>> use to remove the table, tr and td tags, but not what's enclosed in it. 
>> I think BeautifulSoup isn't good for that because it removes what's 
>> enclosed as well.
>>   
>>     
>
> You want to look at htmllib:  http://docs.python.org/lib/module-htmllib.html
>   

I'm sorry, I should have pointed you to HTMLParser:  
http://docs.python.org/lib/module-HTMLParser.html

It's a bit more straightforward than the HTMLParser defined in htmllib.  
Everything I was talking about below pertains to the HTMLParser module 
and not the htmllib module.

> If you've used a SAX parser for XML, it's similar.  Your parser parses 
> the file and every time it hit a tag, it runs a callback which you've 
> defined.  You can assign a default callback that simply prints out the 
> tag as parsed, then a custom callback for each tag you want to clean up.
>
> It took me a little time to wrap my head around it the first time I used 
> it, but once you "get it" it's *really* powerful and really easy to 
> implement.
>
> Read the docs and play around a little bit, then if you have questions, 
> post back and I'll see if I can dig up some examples I've written.
>
> e.
>
>   
>> Is re the good module for that? Basically, if I make an iteration that 
>> scans the text and tries to match every occurrence of a given regular 
>> expression, would it be a good idea?
>>
>> Now, I'm quite new to the concept of regular expressions, but would it 
>> ressemble something like this: re.compile("<table.*>")?
>>
>> Thanks for the help.
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>   
>>     
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>   



More information about the Tutor mailing list