Problem with minidom and special chars in HTML

Fredrik Lundh fredrik at pythonware.com
Tue Feb 22 11:55:50 EST 2005


Horst Gutmann wrote:

> I currently have quite a big problem with minidom and special chars (for example ü)  in HTML.
>
> Let's say I have following input file:
> --------------------------------------------------
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
>             "http://www.w3.org/TR/html4/strict.dtd">
> <html>
> <body>
> ü
> </body>
> </html>
> --------------------------------------------------

 > test3.html only has a blank line where should be the ü It is simply
> removed.
>
> Any idea how I could solve this problem?

umm.  doesn't that doctype point to an SGML DTD?  even if minidom did fetch
external DTD's (I don't think it does), it would probably choke on that DTD.

running your documents through "tidy -asxml -numeric" before parsing them as
XML might be a good idea...

    http://tidy.sourceforge.net/ (command-line binaries, library)
    http://utidylib.berlios.de/ (python bindings)

</F> 






More information about the Python-list mailing list