[XML-SIG] Encoding argument for toxml and toprettyxml

Wed, 03 Jul 2002 00:01:20 +0200

Martin v. Loewis wrote:

 > Walter Dörwald <walter@livinglogic.de> writes:
 >
 >> > Right. Does that mean that the "errors" argument of the StreamWriter
 >> > needs to be exposed,
 >>
 >>As long as the stream writer class is derived from codecs.StreamWriter
 >>the errors attribute is already exposed.
 >
 > No, it's not: toxml reads
 >
 >     def toprettyxml(self, indent="\t", newl="\n", encoding = None):
 >         writer = _get_StringIO()
 >         if encoding is not None:
 >             import codecs
 >             # Can't use codecs.getwriter to preserve 2.0 compatibility
 >             writer = codecs.lookup(encoding)[3](writer)
 >         if self.nodeType == Node.DOCUMENT_NODE:
 >             # Can pass encoding only to document, to put it into XML 
header
 >             self.writexml(writer, "", indent, newl, encoding)
 >         else:
 >             self.writexml(writer, "", indent, newl)
 >         return writer.getvalue()
 >
 > So the user never sees the stream writer, and can thus not adjust the
 > errors attribute.

If with "user" you mean the caller of toprettyxml(), the user does not
have to adjust the errors attribute, it is done by the writexml method
of various node classes:

def _write_data(writer, data, errors):
    "Writes datachars to writer."
    data = data.replace("&", "&amp;")
    data = data.replace("<", "&lt;")
    data = data.replace("\"", "&quot;")
    data = data.replace(">", "&gt;")
    writer.errors = errors
    writer.write(data)

class Text:
     def writexml(self, writer, indent="", addindent="", newl=""):
         _write_data(writer,
            "%s%s%s"%(indent, self.data, newl),
            "xmlcharrefreplace")

class Comment:
     def writexml(self, writer, indent="", addindent="", newl=""):
         writer.errors = "strict"
         writer.write("%s<!--%s-->%s" % (indent,self.data,newl))

Of course the above code only works when an encoding is specified,
so that writer is a StreamWriter. For the StringIO the assignment
to writer.errors should be wrapped in a
    try:
       ...
    except AttributeError:
       pass

 >>It just has to be documented that the errors attribute can be
 >>changed between calls to StreamWriter.write() to switch between
 >>different error handling modes.
 >
 > Is that really something you can guarantee for all stream writers?

I didn't find anything in the documentation

http://www.python.org/doc/current/lib/stream-writer-objects.html

about the errors attribute of the StreamWriter class, so we
can't guarantee anything now.

That's why PEP 293 should state that errors should be assignable in
the stream classes. I'll add it in the next few days.

I took a look at JapaneseCodecs 1.4.5: The StreamWriter/Reader
classes are derived from codecs.StreamWriter/codecs.StreamReader
(for both the C and Python based versions), so it'll work there.
I don't know anything about the other third party codecs.
Where can they be found?

 > What if the stream writer needs to be stateful, and the state
 > interpretation is affected by the error handling?

Using a custom error handling name should be completely equivalent
to applying the replacement of unencodable characters beforehand
and then using the strict method.

Currently I can't imagine a stream writer where the state menagement
depends on the error handling name. Can you give an example? The
PEP 293 machinery is explicitely designed to make state management
and error handling independent.

 >>If we use "xmlcharrefreplace" now, Python 2.2 or codecs that haven't
 >>been updated will fail with:
 >>
 >>ValueError: foo encoding error; unknown error handling code:
 >>xmlcharrefreplace
 >
 > Is this also the case for all third-party codecs?

JapaneseCodecs 1.4.5 does the following:

Python 2.2 (#1, Jan 30 2002, 17:32:28)
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> u"gurk".encode("sjis", "gurk")
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File 
"/usr/local/src/JapaneseCodecs-1.4.5/japanese/python/shift_jis.py", line 
12, in encode
     raise UnicodeError, "unknown error handling"

i.e. it raises a UnicodeError instead of a ValueError (which would
work too, because issubclass(UnicodeError, ValueError)), although
it's not correct, because this is an error in the provided
arguments, not in the encoding algorithm (Note that I was the
one who convinced Tamito to use UnicodeError instead of ValueError,
maybe I can convince him to change it back.)

 >>If we want escaping to work with Python 2.2 or old codecs, we could do
 >>something like this:
 >
 > I was rather looking for code like
 >
 > try:
 >   codecs.xmlcharrefreplace
 >   have_xmlcharrefreplace = 1
 > except AttributeError:
 >   have_xmlcharrefreplace = 0

or:

try:
   codecs.register_error
   have_xmlcharrefreplace = 1
except AttributeError:
   have_xmlcharrefreplace = 0

But this only checks if the error handling API is present in
Python. It does not check whether a certain codec uses the API.

 >>The same way we find out if it's a valid encoding: just try it.
 >>If it doesn't work, catch the ValueError exception (for Python 2.2).
 >
 > That's not good enough. If the minidom implementation uses it, it
 > better be sure that it, if not supported, has the same meaning like
 > the "strict" error mode.

How about the following code? This will raise the same exception
if assigning the errors attribute or using it doesn't work:

try:
    try:
       writer.errors = errors
    except AttributeError:
       pass
     writer.write(data)
except ValueError:
     try:
       writer.errors = "strict"
    except AttributeError:
       pass
     writer.write(data)

BTW, thanks for your comments. The PEP did go unnoticed in python-dev
and python-list so far.

Bye,
    Walter Dörwald