unicode and xml/xsl

Mon Aug 9 13:40:00 EDT 2004

Hello,

I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a little confusing.  Here
is the basic situation:

I connect to the server and receive an xml document whose content is a
bibliographic dataset.  The document can be encoded in two ways:
ISO-8859-1 or unicode.  My program simply takes the document and
passes it to an xsl stylesleet using libxslt & libxml2.  Here's the
relevant code:  

# this is how I get the results & generate either a string or a
# unicode string
    def getref (self, query = ':ID:>0',  cmd = 'getref ', 
                reftype = default_reftype): 
        cmd += ' ' + query 
        self.send(cmd + self.CS_TERM) 
        results = self.tread() 
        if self.encoding == 'UNICODE': 
            print ' decoding unicode string: ' 
            results = results.decode('utf-8', 'replace') 
        return results 

# this is where I generate the html:
    def risx_to_html (self, risxSet, xsl = xsl_ss,  
                    css=css_url, use_css = 1): 
        styledoc = libxml2.parseFile(xsl) 
        style = libxslt.parseStylesheetDoc(styledoc) 
        doc = libxml2.parseDoc(risxSet) 
        result = style.applyStylesheet(doc, None) 
        # style.saveResultToFilename("results.html", result, 0) 
        htmlString = style.saveResultToString(result) 
        style.freeStylesheet() 
        doc.freeDoc() 
        result.freeDoc() 
        return htmlString 

The server's default encoding is iso-8859-1, and since I mosly use
english-language references, this usually works fine; but occasionally
the server will pass me an entity like 'μ' (for Greek letter mu).
This generates an error like this:  

Entity: line 57: parser error : Entity 'mu' not defined

This is not so bad, because the parsing continues nonetheless.  With
unicode it's worse.  In this case there are several errors depending
on how I set the system up:  

with iso-8859-1 set as default encoding in sitecustomize.py:

  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

with utf-8 set as default encoding: 
  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode

So I guess I have two questions:

(1) am I using the right python tools for this job?  My excellent
python books unfortunately don't cover either unicode or xml in much
depth, so I'm a little uncertain as te whtehr I'm doing the right
thing.  

(2) Is there a way to make libxml2 parse unicode documents?  Do I need
to pass it more information alerting it that it's getting unicode?  

Anyway, thanks very much for your help.  Much appreciated,  

Matt

-------------------------------------------
Matt Price	    matt.price at utoronto.ca
History Department, University of Toronto
(416) 978-2094
--------------------------------------------