[ANN] pyxser-1.2r --- Python-Object to XML serialization module

Daniel Molina Wegener dmw at coder.cl
Tue Aug 25 00:03:55 EDT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Stefan Behnel <stefan_ml at behnel.de>
on Monday 24 August 2009 09:00
wrote in comp.lang.python:


> Daniel Molina Wegener wrote:
>> unicode objects are encoded into the
>> encoding that the XML document encoding has, and as you say, the whole
>> XML document has one encoding. There is no mixing of byte encoded strings
>> with different encodings in the outout document.
> 
> Ok, that's what I hoped anyway. It just wasn't clear from your
> description.
> 
> 
>> When the object is restored, by using pyxser.unserialize:
>> 
>> pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
> 
> But this is XML, right? What do you need to pass the encoding for at this
> point?

  The user may want a different encoding, other than utf-8, it can
be any encoding supported by libxml2.

> 
> 
>> Another issue is the fact that if you have mixed some encodings in byte
>> strings objects in your object tree, such as iso-8859-1 and utf-8, and
>> you try to serialize that object, pyxser will output to stdout the
>> serialization errors by trying to handle those mixed encodings which are
>> not regarding the document encoding.
> 
> There shouldn't be any serialisation errors (unless you try to recode byte
> strings on the way out, which is a no-no for arbitrary user input). All
> you have to do is properly escape the byte string so that it passes the
> XML encoding step.

  Yup, but if the encodings are mixed inside Python byte strings, I think
that there is no way to know which encoding are using them. This may cause
XML serialization errors, by having a different encoding that the user
have set as the document encoding.

> 
> One trick to do that is to decode the byte string as ISO-8859-1 and
> serialise the result as a normal Unicode string. Then you can re-encode
> the unicode string on input back to ISO-8859-1.
> 
> I choose ISO-8859-1 here because it has the well-defined side-effect of
> mapping byte values directly to Unicode characters with an identical code
> point value. So you do not risk any failures or data loss.

  Sure, but if there are Python byte strings (not Unicode strings), ones
encoded in big5 and others in iso-8859-1 inside the object tree, the
XML serialization would throw errors on the encoding conversion, by
setting those bytes inside the document...

> 
> Stefan

  Thanks for commenting, and sorry for the late answer. This day was
busy...

Best regards,
- -- 
 .O. | Daniel Molina Wegener   | FreeBSD & Linux
 ..O | dmw [at] coder [dot] cl | Open Standards
 OOO | http://coder.cl/        | FOSS Developer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (FreeBSD)

iQIcBAEBCgAGBQJKk2KrAAoJEHxqfq6Y4O5N6mAQAK6a121n6ZeTl/Xm/UqlFw3S
YyN0ZLY7qEgBPOz4NO3PC9DxQbo1F5/V4ZOLS86Tdc4OgRjq1dRHG2EoxzV7wQFJ
yBzGTWbgyOOX1lNlnBSVLvlPooz/pBBYa0WMejUG2Sa4xDVXSXDEA5aPFy3xrYHy
gj7zkWcBg3on0U+OC2l7Xdy4vIXmtTKSXMLQc01C2/yJdsNKGm2vOm8SG/gsEHo4
XYaOje3fI9WE9HUbeGrFMEnmPXpXTvodkdHY6mzRBqXws9K/ot5pz03R94boALKz
MAgc/eZbPTkxViy8N98G0d4aXutNWy3cEr4B9kk6c5ZjIhmGzpSers/MrqJS+LiY
t8O8d/1sT6uHQKKYOoFWCojagsDG3HFXClQyvNlqoZyj9IdN2fHNrrjmgCyeNdr3
njJNhfu7IVuZTtjwHQWscG2TVgh5slsTjEpzB/LR3V4Kt+x6Ptiy636LF7L4dqmm
7lL9dXhWRnZK7W8FzzzZDnk0qtJdsYRXsXdZ8opOqQwTnx47+HdwFClp88vbpASH
YvNVn76m/Jx37WfXUXVoPDVuiQHsWDPNn3anZ60d6pDOCK9x7A065f7OCtyjq12k
2sDR1RUMBbJ0u11m7+JxIqTdcut/cJS7piiSE95vqviob4jKOQgF9y5i5eUget60
uWmCsGjI65Naxq+BWFrb
=BJOF
-----END PGP SIGNATURE-----



More information about the Python-list mailing list