[XML-SIG] XML and Unicode

Mark Nottingham mnot@mnot.net
Wed, 23 May 2001 08:46:25 -0700


It's the em dash in the middle. If true, this behaviour would be a
bug, no? Is there any kind of workaround possible (such as detecting
the encoding of the XML file outside of the parser and .encode()ing
to suit)?

Thanks again,


On Wed, May 23, 2001 at 09:38:14AM +0200, M.-A. Lemburg wrote:
> Mark Nottingham wrote:
> > 
> > OK, so I'm not getting something then. The attached test script (and
> > data file) is the problem pared down - if u'string' is a neutral
> > encoding, and .encode('utf-8') generates a utf-8 encoded string of
> > that encoding, then the utf-8.html output file should display
> > correctly; however, it doesn't, while the latin-1 output does
> > (because the input is latin-1).
> > 
> > It seems like the XML parser isn't converting the ISO-8859-1 to
> > Unicode; does this make sense?
> 
> That's a possibility (even though I don't see any funny characters
> in your example XML file); looking through the pyexpat.c code
> it seems as if the parser assumes that the XML file is encoded 
> as UTF-8 -- at least all Unicode conversions are done using UTF-8.
> 
> -- 
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> ______________________________________________________________________
> Company & Consulting:                           http://www.egenix.com/
> Python Software:                        http://www.lemburg.com/python/

-- 
Mark Nottingham
http://www.mnot.net/