[XML-SIG] XML Unicode and UTF-8

"Martin v. Löwis" martin at v.loewis.de
Thu Aug 5 15:30:48 CEST 2004


n.youngman at ntlworld.com wrote:
> Sorry, I missed a key point out. Segment[0] is the decoded part of
> the output from email.Header.decode_header(). I believed this was a
> unicode string, but checking back in the documentation it doesn't
> actually say that, so I guess at least part of the problem is I'm
> getting some sort of binary data, which I thought was Unicode, but
> isn't.

Indeed. decode_header gives you a list of (byte, encoding) pairs
precisely because it does not attempt to decode them. In turn, it
does not try to decode them because Python might not have a codec
for some of the encodings. Generally, you would do

def u_decode_header(header):
   result = []
   for h, enc in Header.decode_header(header):
       result.append(h.decode(enc))
   return u"".join(result)

which will raise a LookupError if there is an unsupported encoding.
As you are going to put the header into an XML document, you really
have little choice what to do in that case - if giving up is not
acceptable,

      try:
        result.append(h.decode(enc))
      except LookupError:
        result.append(h.decode('us-ascii', 'replace'))

might be your next best choice: this will assume that any encoding
is an ASCII superset, and replace all non-ASCII bytes with question
marks.

All that decode_header is is to decode the transfer encoding (i.e.
Q or B).

>>> Leaves binary data in the document. I have assumed that this was
>>> raw Unicode, may be that's a flawed assumption?
[...]
> XML doesn't, Python does. If I ask it to print without encoding it, I
> don't know whether it's passed through unchanged. Raw Unicode seems
> to me like a reasonable term for the data in a unicode string.

Ah, that. Don't worry about the internal representation of a Unicode
string. It may have 2 or 4 bytes, and be big or little endian. You
are never going to see that directly, as there is *always* an encoding
going on to convert the Unicode object into a byte string. Of course,
you could create a buffer object to really find out, but that should
not be done.

> You have neatly pinpointed where I was confused. Your assistance is
> much appreciated.

You are welcome!

Martin


More information about the XML-SIG mailing list