[XML-SIG] XML and Unicode
Mark Nottingham
mnot@mnot.net
Tue, 22 May 2001 19:33:18 -0700
--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
OK, so I'm not getting something then. The attached test script (and
data file) is the problem pared down - if u'string' is a neutral
encoding, and .encode('utf-8') generates a utf-8 encoded string of
that encoding, then the utf-8.html output file should display
correctly; however, it doesn't, while the latin-1 output does
(because the input is latin-1).
It seems like the XML parser isn't converting the ISO-8859-1 to
Unicode; does this make sense?
Thanks,
On Wed, May 23, 2001 at 12:38:34AM +0200, M.-A. Lemburg wrote:
> Mark Nottingham wrote:
> >
> > How does one detect the charset used in an XML document from a SAX2
> > parser (PyXML 0.6.5)?
> >
> > Also, if I have an XML document encoded ISO-8851-1 (and properly
> > identified), should I have a reasonable expectation that the output
> > of a SAX processor, post- .encode('utf-8'), should be correct if
> > viewed in a Web browser with UTF-8 selected as a character encoding?
>
> This should work...
>
> > In other words, is the post-parse unicode string a neutral
> > representation of the 8851-x string, which can then be encoded as
> > utf-8?
>
> Unicode is encoding neutral in the sense that it provides
> space for the characters of most scripts. If the parser returns
> Unicode, then you can encode it as UTF-8 and have the original
> contents of the attribute/element represented as UTF-8 string.
>
> > Or, is it in the charset of the original XML document (my
> > testing seems to indicate the latter - what was a 8851 character in
> > the original text does not successfully come out the other side)?
> >
> > (Sorry if this is obtuse - just getting into i18n, and Python docs
> > are thin on the ground)
>
> --
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> ______________________________________________________________________
> Company & Consulting: http://www.egenix.com/
> Python Software: http://www.lemburg.com/python/
--
Mark Nottingham
http://www.mnot.net/
--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="testuni.py"
#!/usr/bin/env python2.0
from xml import sax
import string
def run(i, e):
dh = Parser()
p = sax.sax2exts.make_parser()
p.setContentHandler(dh)
p.setFeature(sax.handler.feature_namespaces, 1)
p.parse(i + '.xml')
content = dh.content.encode(e)
file = open(e + ".html", 'w')
file.write(template % (e, content))
file.close()
class Parser(sax.handler.ContentHandler):
def __init__(self):
self._tmp_buf = ''
self.content = None
def startElementNS(self, name, qname, attrs):
pass
def endElementNS(self, name, qname):
if name[1] == 'content':
self.content = string.strip(self._tmp_buf)
def characters(self, content):
self._tmp_buf = self._tmp_buf + content
template = """\
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=%s">
</head>
<body>
<p>%s</p>
</body>
</html
"""
if __name__ == '__main__':
run('ISO-8859-1', 'UTF-8')
run('ISO-8859-1', 'Latin-1')
--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment; filename="ISO-8859-1.xml"
Content-Transfer-Encoding: 8bit
<?xml version="1.0" encoding="ISO-8859-1" ?>
<content>Net 21 – The Survivors</content>
--jI8keyz6grp/JLjh--