[XML-SIG] Invalid character encoding handling in PyXML-0.8.4

Paweł Sakowski pawel at sakowski.pl
Thu Apr 14 21:59:05 CEST 2005


A simple test case:

$ LANG=pl_PL.ISO-8859-2 python
Python 2.4 (#1, Dec 23 2004, 10:29:41)
[GCC 3.3.5 (PLD Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.marshal import generic
>>> generic.dumps("piątek")
'<?xml version="1.0"?><marshal><string>pi\xb1tek</string></marshal>'

"\xb1" is the ISO 8859-2 encoding of "ą". Still, the XML specification
makes it clear that "In the absence of external character encoding
information (such as MIME headers), parsed entities which are stored in
an encoding other than UTF-8 or UTF-16 MUST begin with a text
declaration (see 4.3.1 The Text Declaration) containing an encoding
declaration". So, the XML obtained above is not well-formed:

>>> generic.loads(generic.dumps("piątek"))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "xml/marshal/generic.py", line 321, in loads
    return m._load(file)
  File "xml/marshal/generic.py", line 331, in _load
    p.parseFile(file)
  File "xml/sax/drivers/drv_pyexpat.py", line 68, in parseFile
    if self.parser.Parse(buf, 0) != 1:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 40

I'd also like to make a related feature request:

>>> generic.dumps(u"czwartek")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/_xmlplus/marshal/generic.py",
line 59, in dumps
  File "/usr/lib/python2.4/site-packages/_xmlplus/marshal/generic.py",
line 104, in m_root
  File "/usr/lib/python2.4/site-packages/_xmlplus/marshal/generic.py",
line 92, in _marshal
AttributeError: Marshaller instance has no attribute 'm_unicode'

Given XML's well defined character encoding semantics, it would be
useful (and IMO pretty straightforward) to support unicode strings by
simply encoding them with the document's encoding.

-- 
+----------------------------------------------------------------------+
| Paweł Sakowski <pawel at sakowski.pl>                Never trust a man  |
|                            who can count up to 1023 on his fingers.  |
+----------------------------------------------------------------------+



More information about the XML-SIG mailing list