[ANN] pyxser-1.2r --- Python-Object to XML serialization module

Tue Aug 25 01:11:01 EDT 2009

Daniel Molina Wegener wrote:
> Stefan Behnel <stefan_ml at behnel.de> wrote:
>> Daniel Molina Wegener wrote:
>>> When the object is restored, by using pyxser.unserialize:
>>>
>>> pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")
>> But this is XML, right? What do you need to pass the encoding for at this
>> point?
> 
>   The user may want a different encoding, other than utf-8, it can
> be any encoding supported by libxml2.

I really meant what I wrote: this is XML. The encoding is well defined in
the XML declaration at the start of the document (and will default to UTF-8
if not provided). Passing it externally will allow users to override that,
which doesn't make any sense at all.

> if the encodings are mixed inside Python byte strings, I think
> that there is no way to know which encoding are using them.

Correct.

> This may cause XML serialization errors

Yes, but only if you try to recode the strings (which, as I said, is a no-no).

>> One trick to do that is to decode the byte string as ISO-8859-1 and
>> serialise the result as a normal Unicode string. Then you can re-encode
>> the unicode string on input back to ISO-8859-1.
> 
>> I choose ISO-8859-1 here because it has the well-defined side-effect of
>> mapping byte values directly to Unicode characters with an identical code
>> point value. So you do not risk any failures or data loss.
> 
>   Sure, but if there are Python byte strings (not Unicode strings), ones
> encoded in big5 and others in iso-8859-1 inside the object tree, the
> XML serialization would throw errors on the encoding conversion, by
> setting those bytes inside the document...

No, I really meant: decoding from ISO-8859-1 to Unicode, for all byte
strings, regardless of their encoding (since you can't even know if they
represent encoded text at all). So you get a unicode string that you can
serialise to the target encoding, although it may result in character
references (&#xyz;) being output. But you won't get any errors, at least.

On the way in, you get a unicode string again, which you can encode to
ISO-8859-1 to get the original byte string back.

Stefan