convert xhtml back to html

John Krukoff jkrukoff at ltgc.com
Thu Apr 24 13:16:35 EDT 2008


> -----Original Message-----
> From: python-list-bounces+jkrukoff=ltgc.com at python.org [mailto:python-
> list-bounces+jkrukoff=ltgc.com at python.org] On Behalf Of Tim Arnold
> Sent: Thursday, April 24, 2008 9:34 AM
> To: python-list at python.org
> Subject: convert xhtml back to html
> 
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
> to
> create  CHM files. That application really hates xhtml, so I need to
> convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> 
> Seems simple enough, but I'm having some trouble with it. regexps trip up
> because I also have to take into account 'img', 'meta', 'link' tags, not
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm
> not
> enough of a regexp pro to figure out that lookahead stuff.
> 
> I'm not sure where to start now; I looked at BeautifulSoup and
> BeautifulStoneSoup, but I can't see how to modify the actual tag.
> 
> thanks,
> --Tim Arnold
> 
> 
> --
> http://mail.python.org/mailman/listinfo/python-list


One method which wouldn't require much python code, would be to run the
XHTML through a simple identity XSL tranform with the output method set to
HTML. It would have the benefit that you wouldn't have to worry about any of
the specifics of the transformation, though you would need an external
dependency.

As far as I know, both 4suite and lxml (my personal favorite:
http://codespeak.net/lxml/) support XSLT in python. 

It might work out fine for you, but mixing regexps and XML always seems to
work out badly in the end for me.
---------
John Krukoff
jkrukoff at ltgc.com




More information about the Python-list mailing list