XML/HTML Encoding problem

Dale Strickland-Clark dale at riverhall.nospam.co.uk
Mon May 22 11:00:48 EDT 2006


A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped.  What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>

So it encodes the entity reference to € (Euro sign).  I need it to remain as
€ so that the resulting HTML can render properly in a browser.  Is
there a way to make the parser not convert the entity references?  Or is
there a convenient post processing function that will do the conversion?

-- 
Dale Strickland-Clark
Riverhall Systems www.riverhall.co.uk




More information about the Python-list mailing list