[XML-SIG] PyExpat encoding

tpassin@home.com tpassin@home.com
Thu, 1 Jun 2000 22:11:11 -0400


With all the talk about default encodings, compiling for different
encodings, and passing "unicode" to Python objects, I'm losing track -or my
grip :) -.  Do these considerations affect either pyexpat or other python
XML code in their ability to handle the basic required encodings, per the
XML 1.0 Rec:

"All XML processors must accept the UTF-8 and UTF-16 encodings of 10646"

also,

"In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is an error for an entity including an encoding
declaration to be presented to the XML processor in an encoding other than
that named in the declaration, for an encoding declaration to occur other
than at the beginning of an external entity, or for an entity which begins
with neither a Byte Order Mark nor an encoding declaration to use an
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
ordinary ASCII entities do not strictly need an encoding declaration."

Since there is a lot of XML out there without encoding declarations, it
would seem that UTF-8 would HAVE to be used as the default.  I admit, I
can't find anything in the Rec that says what encoding a processor must use
to send results to other pieces of code.  And don't these other pieces of
code also constitute XML processors, so that they should follow the same
rules?

I'd appreciate some enlightment in this area - if expat/pyexpat are compiled
to "use" encoding X, how does this fact interact with the Rec's
requirements?  Or doesn't it?

Tom Passin

 Greg Stein (and lots of others) wrote:

> On Thu, 1 Jun 2000, Andrew M. Kuchling wrote:
> > On Thu, Jun 01, 2000 at 02:28:06PM -0700, Greg Stein wrote:
> > >In other words, I think you're confusing the character set that Expat
> > >operates with (Unicode) with the encoding of that charset (UTF-8 or
> > >UTF-16; the latter is used by the Unicode object).
> >
> > Perhaps; I'm asking what's the Python type of the Python objects
> > passed to callbacks used by Expat.
>
> Right. I'm saying that it can be either, depending on how Expat was built.
>
> > >Expat is characterized by its speed. Throwing conversions in there is
not
> > >going to help.
> >
> > I thought Paul said Expat could be compiled to return 16-bit Unicode.
> > Or... damn, does it return UCS-2 and we need UTF-16?  <looks at
> > xmlparse.h> In Expat 1.1, it looks to me that if you #define
> > XML_UNICODE, and don't #define XML_UNICODE_WCHAR_T, Expat will return
> > "UTF-16 encoded as unsigned shorts".  Wouldn't that be just what we
> > need to return Unicode objects?
>
> Python is the same: UTF-16 encoded as unsigned shorts.
>
> > On the other hand, that means you can't use the system's copy of
> > Expat, since who knows what it was compiled with?
>
> Bingo. My point exactly. By default, Expat is going to be built using
> UTF-8 for the output.
>
> > Actually, this
> > seems like a bug in Expat; if I have an Expat library, I have no way
> > of figuring out what it'll be outputting: C 'char's containing UTF-8,
> > unsigned short holding UTF-16, or wchar_t holding UTF-16.  (Argh, my
> > head explodes every time character encodings come up.)
>
> Eek. You're right. This can be determined at compile-time, so we can Do
> The Right Thing when building pyexpat. But things will be hosed if
> somebody drops in a libexpat.a that was compiled differently.
>
> Bleh. This says we should simply depend on it being compiled to output
> UTF-8, or we should include a copy of the library. The latter is already
> "not recommended" by the BDFL, so we can only assume that Expat will
> return UTF-8.
>
> This still doesn't discount pyexpat from having a setting to do a decoding
> on the UTF-8 text and calling into Python with Unicode obs.
>