XML/HTML Encoding problem

Duncan Booth duncan.booth at invalid.invalid
Mon May 22 11:29:36 EDT 2006


Dale Strickland-Clark wrote:

> from xml.dom.minidom import parseString
> output = parseString(strHTML).toxml()
> 
> The output is:
> 
><?xml version="1.0" encoding="iso-8859-1"?>
><html>
><head>
><title/>
><meta content="text/html; charset=iso-8859-1"
>http-equiv="Content-Type"/> </head>
><body>
> €
></body>
></html>
> 
> So it encodes the entity reference to € (Euro sign).  I need it to
> remain as € so that the resulting HTML can render properly in a
> browser.  Is there a way to make the parser not convert the entity
> references?  Or is there a convenient post processing function that
> will do the conversion? 

First up, when I repeat what you did I don't get the same output. toxml() 
without an encoding argument produces a unicode string, and no encoding 
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't 
any way to tell it what to do for unicode characters which are not 
supported in the encoding you are using. However, if you then encode the 
unicode output to ascii with entity escapes, I think you should be alright 
(unless I've missed something):

>>> from xml.dom.minidom import parseString
>>> strHTML = '''<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>'''
>>> print parseString(strHTML).toxml().encode('ascii', 'xmlcharrefreplace')
<?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>
>>> 

You lose the encoding at the top of the output, but since the output is 
entirely ascii I don't think that matters.



More information about the Python-list mailing list