[Tutor] Remove certain tags in html files

Eric Brunson brunson at brunson.com
Fri Jul 27 21:34:04 CEST 2007


Man, the docs on the HTMLParser module are really sparse.

Attached is some code I just whipped out that will parse and HTML file, 
supress the ouput of the tags you mention and spew the html back out.  
It's just a rough thing, you'll still have to read the docs and make 
sure to expand on some of the things it's doing, but I think it'll 
handle 95% of what it comes across.  Be sure to override  all the 
"handle_*()" methods I didn't.

My recommendation would be to shove your HTML through BeautifulSoup to 
ensure it is well formed, then run it through the html parser to do 
whatever you want to change it, then through tidy to make it look nice.

If you wanted to take the time, you could probably write the entire tidy 
process in the parser.  I got a fair ways there, but decided it was too 
long to be instructional, so I pared it back to what I've included.

Hope this gets you started,
e.

Eric Brunson wrote:
> Eric Brunson wrote:
>   
>> Sebastien Noel wrote:
>>   
>>     
>>> Hi,
>>>
>>> I'm doing a little script with the help of the BeautifulSoup HTML parser 
>>> and uTidyLib (HTML Tidy warper for python).
>>>
>>> Essentially what it does is fetch all the html files in a given 
>>> directory (and it's subdirectories) clean the code with Tidy (removes 
>>> deprecated tags, change the output to be xhtml) and than BeautifulSoup 
>>> removes a couple of things that I don't want in the files (Because I'm 
>>> stripping the files to bare bone, just keeping layout information).
>>>
>>> Finally, I want to remove all trace of layout tables (because the new 
>>> layout will be in css for positioning). Now, there is tables to layout 
>>> things on the page and tables to represent tabular data, but I think it 
>>> would be too hard to make a script that finds out the difference.
>>>
>>> My question, since I'm quite new to python, is about what tool I should 
>>> use to remove the table, tr and td tags, but not what's enclosed in it. 
>>> I think BeautifulSoup isn't good for that because it removes what's 
>>> enclosed as well.
>>>   
>>>     
>>>       
>> You want to look at htmllib:  http://docs.python.org/lib/module-htmllib.html
>>   
>>     
>
> I'm sorry, I should have pointed you to HTMLParser:  
> http://docs.python.org/lib/module-HTMLParser.html
>
> It's a bit more straightforward than the HTMLParser defined in htmllib.  
> Everything I was talking about below pertains to the HTMLParser module 
> and not the htmllib module.
>
>   
>> If you've used a SAX parser for XML, it's similar.  Your parser parses 
>> the file and every time it hit a tag, it runs a callback which you've 
>> defined.  You can assign a default callback that simply prints out the 
>> tag as parsed, then a custom callback for each tag you want to clean up.
>>
>> It took me a little time to wrap my head around it the first time I used 
>> it, but once you "get it" it's *really* powerful and really easy to 
>> implement.
>>
>> Read the docs and play around a little bit, then if you have questions, 
>> post back and I'll see if I can dig up some examples I've written.
>>
>> e.
>>
>>   
>>     
>>> Is re the good module for that? Basically, if I make an iteration that 
>>> scans the text and tries to match every occurrence of a given regular 
>>> expression, would it be a good idea?
>>>
>>> Now, I'm quite new to the concept of regular expressions, but would it 
>>> ressemble something like this: re.compile("<table.*>")?
>>>
>>> Thanks for the help.
>>> _______________________________________________
>>> Tutor maillist  -  Tutor at python.org
>>> http://mail.python.org/mailman/listinfo/tutor
>>>   
>>>     
>>>       
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>   
>>     
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: replacetags.py
Type: text/x-python
Size: 660 bytes
Desc: not available
Url : http://mail.python.org/pipermail/tutor/attachments/20070727/6c7650ef/attachment.py 


More information about the Tutor mailing list