[XML-SIG] HTML<->UTF-8 'codec'?

M.-A. Lemburg mal@lemburg.com
Mon, 22 Oct 2001 15:50:47 +0200


Bill Janssen wrote:
> 
> Perhaps you'd be kind enough to review my sample code at
> ftp://ftp.parc.xerox.com/transient/janssen/htmlcodec.py, and advise of
> glaring errors or any interesting improvements that occur to you?
> 
> Thanks in advance!

Here are some comments:

First of all, you are encoding Unicode to an 8-bit string, right ?
If so, then you don't need to use Unicode for output.

    def encode(self,input,errors='strict'):

        output = u''
        i = 0
        input_len = len(input)
        while (i < input_len):
            if ord(input[i]) > 0x7F:
                output = output + u'&#' + unicode(str(ord(input[i]))) + u';'

Wouldn't this be easier: u"&#%i;" % ord(input[i]) ?!

            else:
                output = output + unicode(input[i])
            i = i + 1
        return (str(output), len(output))

This should be return (str(output), i) -- (returnvalue, bytes_consumed).

Same for decode().

A note about the search function: if you give the codec module
a name like 'html_utf_8.py' then you can have the search function
in encodings/__init__.py find it.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/