[XML-SIG] PyExpat encoding (was: XML support in Python 1.6)

Andrew M. Kuchling akuchlin@mems-exchange.org
Thu, 1 Jun 2000 16:04:16 -0400


On Thu, Jun 01, 2000 at 12:56:28PM -0700, Greg Stein wrote:
>IMO, we should have a fixed output format, which is the Expat default:
>UTF-8.

I don't know; it seems a bit odd to parse a Unicode string and then
have to convert from an 8-bit encoding back to Unicode in your
character data handlers, attributes, etc.  The problem is that it's
also odd to parse a regular Python string and get back Unicode.  

OTOH, if Latin1-encoded XML has something like <!ENTITY unichar
&#1972;> &unichar; in it, Unicode is the only thing it could possibly
return.  Maybe PyExpat could attempt to convert its Unicode output
into an 8-bit string (but using what encoding?), and only return
Unicode if it has to.  

Hmmm... on the third hand, XML is a Unicode based standard, and
sometimes returning Unicode and sometimes an 8-bit string is also
strange.  Maybe it's best to just always return Unicode, and leave
further conversion to the caller.  

I think I'd go for the third option: always returning Unicode strings.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I was somebody else once. I... I... don't think I was a very good person.
  -- The detective in THE MYSTERY PLAY