[XML-SIG] Encoding argument for toxml and toprettyxml

Mon, 01 Jul 2002 19:18:38 +0200

Martin v. Löwis wrote:

 > Walter Dörwald <walter@livinglogic.de> writes:
 >
 >>I like it. Even better would be if those methods were able to escape
 >>unencodable characters with character references inside Text nodes.
 >>That's exactly what PEP 293 was made for.
 >
 > Right. Does that mean that the "errors" argument of the StreamWriter
 > needs to be exposed,

As long as the stream writer class is derived from codecs.StreamWriter
the errors attribute is already exposed.

It just has to be documented that the errors attribute can be
changed between calls to StreamWriter.write() to switch between
different error handling modes.

I should probably add this to PEP 293.

 > or should we silently use "xmlcharrefreplace" in
 > 2.3?

If we use "xmlcharrefreplace" now, Python 2.2 or codecs that haven't 
been updated will fail with:

ValueError: foo encoding error; unknown error handling code: 
xmlcharrefreplace

instead of:

UnicodeError: foo encoding error: ordinal not in range

as it is now. I guess this shouldn't be a problem. We could add
a comment that will be visible in the stacktrace (e.g. for
minidom:

def _write_data(writer, data, errors):
     "Writes datachars to writer."
     data = data.replace("&", "&amp;")
     data = data.replace("<", "&lt;")
     data = data.replace("\"", "&quot;")
     data = data.replace(">", "&gt;")
     writer.errors = errors
     writer.write(data) # if this fails with an "unknown error handling 
code" exception for Text content, you should use Python 2.3 or an 
encoding that can encode all characters in your document. If it fails 
with a UnicodeError, you have unencodeable characters in comments, 
processing instructions, or tag names.

For comments or processing instructions we could define new error 
handling names, that raise a different exception, tell the user
that unencodable characters are not allowed in comments etc.

If we want escaping to work with Python 2.2 or old codecs, we could do 
something like this:

     try:
         writer.errors = errors
	writer.write(data)
     except ValueError:
         # unknown error handling name
         # i.e. Python 2.2 or an old codec
         if errors!="xmlcharrefreplace":
             raise # we're in a comment or procinst
         else:
             # use the slow workaround
             for c in data:
                try:
                    writer.write(c)
                except UnicodeError:
                    writer.write(u"&#%d;" % ord(c))

But this will be slow and it isn't 100% safe, because the first
	writer.write(data)
might have written partial data. (It is safe with the PEP 293 method,
because writing never has to be retried.)

 > If so, how do we find out whether this is a valid error mode?

The same way we find out if it's a valid encoding: just try it.
If it doesn't work, catch the ValueError exception (for Python 2.2).

Unfortunately a ValueError is not very specific and might have
a different source.

For Python 2.3 a LookupError will be raised by new codecs, but
"xmlcharrefreplace" is supported by all builtin codecs anyway.

Bye,
    Walter Dörwald