convert xhtml back to html

Gary Herron gherron at islandtraining.com
Thu Apr 24 12:11:50 EDT 2008


Tim Arnold wrote:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to 
> create  CHM files. That application really hates xhtml, so I need to convert 
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>
> Seems simple enough, but I'm having some trouble with it. regexps trip up 
> because I also have to take into account 'img', 'meta', 'link' tags, not 
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do 
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not 
> enough of a regexp pro to figure out that lookahead stuff.
>
> I'm not sure where to start now; I looked at BeautifulSoup and 
> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>
> thanks,
> --Tim Arnold
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>   
Whether or not you can find an application that does what you want, I 
don't know, but at the very least I can say this much.

You should not be reading and parsing the text yourself!  XHTML is valid 
XML, and there a lots of ways to read and parse XML with Python.  
(ElementTree is what I use, but other choices exist.)   Once you use an 
existing package to read your files into an internal tree structure 
representation, it should be a relatively easy job to traverse the tree 
to emit the tags and text you want.


Gary Herron




More information about the Python-list mailing list