xmlrpclib and decoding entity references

Bengt Richter bokr at oz.net
Wed May 4 18:33:43 EDT 2005


On 4 May 2005 08:17:07 -0700, "Chris Curvey" <ccurvey at gmail.com> wrote:

>Here is the solution.  Incidentally, the client is Cold Fusion.
>
I suspect your solution may be not be general, though it would seem to
satisfy your use case. It seems to be true for python's latin-1 that
all the first 256 character codes are acceptable and match unicode 1:1,
even though the windows character map for lucida sans unicode font
with latin-1 codes shows undefined-char boxes for codes 0x7f-0x9f.

 >>> sum(chr(i).decode('latin-1') == unichr(i) for i in xrange(256))
 256
 >>> sum(unichr(i).encode('latin-1') == chr(i) for i in xrange(256))
 256

Not sure what to make of that. E.g. should unichr(0x7f).encode('latin-1')
really be legal, or is it just expedient to have latin-1 serves as a kind of
compressed utf_16_le? E.g., there's 256 Trues in these:

 >>> sum(unichr(i).encode('utf_16_le')[0] == chr(i) for i in xrange(256))
 256
 >>> sum(unichr(i).encode('utf_16_le')[1] == '\x00' for i in xrange(256))
 256

Maybe we could have a 'u_as_str' or 'utf_16_le_lsbyte' codec for that, so the above would be spelled
 >>> sum(unichr(i).encode('u_as_str') == chr(i) for i in xrange(256)) # XXX faked, not implemented
 256

Utf-8 only goes half way:
 >>> sum(unichr(i).encode('utf-8') == chr(i) for i in xrange(256))
 128


<aside>
What do you think, Martin? ;-)
Maybe 'ubyte' or 'u256' would be a user-friendlier codec name? Or 'ustr'?
</aside>

>import re
>import logging
>import logging.config
>import os
>import SimpleXMLRPCServer
>
>logging.config.fileConfig("logging.ini")
>
>########################################################################
>class
>LoggingXMLRPCRequestHandler(SimpleXMLRPCServer.CGIXMLRPCRequestHandler):
>    def __dereference(self, request_text):
>        entityRe = re.compile("((?P<er>&#x)(?P<code>..)(?P<semi>;))")
What about entity &#x263a; ? Or the same in decimal: ☺
  :)
>        for m in re.finditer(entityRe, request_text):
>            hexref = int(m.group(3),16)
>	    charref = chr(hexref)
                      unichr(hexref) would handle >= 256, if you used unicode.
>	    request_text = request_text.replace(m.group(1), charref)
>
>	return request_text
>
>
>#-------------------------------------------------------------------
>    def handle_xmlrpc(self, request_text):
>        logger = logging.getLogger()
>	#logger.debug("************************************")
>	#logger.debug(request_text)
                      ^^^^^^^^^^^^  I would suggest repr(request_text) for debugging, unless you
know that your logger is going to do that for you. Otherwise a '%s' format may hide things that you'd like to know.

>	try:
>	    #logger.debug("-------------------------------------")
>	    request_text = self.__dereference(request_text)
>	    #logger.debug(request_text)
>	    request_text = request_text.decode("latin-1").encode('utf-8')
AFAIK, XML can be encoded with many encodings other than latin-1, so you are essentially
saying here that you know it's latin-1 somehow. Theoretically, your XML could
start with something like <?xml encoding='UTF-8'?> and .decode("latin-1") is only going to
"work" when the source is plain ascii. I wouldn't be surprised if that's what's happening
up to the point where you __dereference, but str.replace doesn't care that you are potentially
making a utf-8 encoding invalid by just replacing 8-bit characters with what is legal latin-1.
after that, you are decoding your utf-8_clobbered_with_latin-1 as latin-1 anyway, so it "works".
At least I think this is a consistent theory. See if you can get the client to send something
with characters >128 that aren't represented as &#x..; to see if it's actually sending utf-8.


>	    #logger.debug("************************************")
>	except Exception, e:
>	    logger.error(request_text)
again, suggest repr(request_text)
>	    logger.error("had a problem dereferencing")
>	    logger.error(e)
>
>	SimpleXMLRPCServer.CGIXMLRPCRequestHandler.handle_xmlrpc(self,
>request_text)
>########################################################################
>class Foo:
>    def settings(self):
>        return os.environ
>    def echo(self, something):
>        logger = logging.getLogger()
>	logger.debug(something)
repr it, unless you know ;-)

>        return something
>    def greeting(self, name):
>        return "hello, " + name
>
># these are used to run as a CGI
>handler = LoggingXMLRPCRequestHandler()
>handler.register_instance(Foo())
>handler.handle_request()
>

Regards,
Bengt Richter



More information about the Python-list mailing list