xmlrpclib and decoding entity references
Bengt Richter
bokr at oz.net
Wed May 4 18:33:43 EDT 2005
On 4 May 2005 08:17:07 -0700, "Chris Curvey" <ccurvey at gmail.com> wrote:
>Here is the solution. Incidentally, the client is Cold Fusion.
>
I suspect your solution may be not be general, though it would seem to
satisfy your use case. It seems to be true for python's latin-1 that
all the first 256 character codes are acceptable and match unicode 1:1,
even though the windows character map for lucida sans unicode font
with latin-1 codes shows undefined-char boxes for codes 0x7f-0x9f.
>>> sum(chr(i).decode('latin-1') == unichr(i) for i in xrange(256))
256
>>> sum(unichr(i).encode('latin-1') == chr(i) for i in xrange(256))
256
Not sure what to make of that. E.g. should unichr(0x7f).encode('latin-1')
really be legal, or is it just expedient to have latin-1 serves as a kind of
compressed utf_16_le? E.g., there's 256 Trues in these:
>>> sum(unichr(i).encode('utf_16_le')[0] == chr(i) for i in xrange(256))
256
>>> sum(unichr(i).encode('utf_16_le')[1] == '\x00' for i in xrange(256))
256
Maybe we could have a 'u_as_str' or 'utf_16_le_lsbyte' codec for that, so the above would be spelled
>>> sum(unichr(i).encode('u_as_str') == chr(i) for i in xrange(256)) # XXX faked, not implemented
256
Utf-8 only goes half way:
>>> sum(unichr(i).encode('utf-8') == chr(i) for i in xrange(256))
128
<aside>
What do you think, Martin? ;-)
Maybe 'ubyte' or 'u256' would be a user-friendlier codec name? Or 'ustr'?
</aside>
>import re
>import logging
>import logging.config
>import os
>import SimpleXMLRPCServer
>
>logging.config.fileConfig("logging.ini")
>
>########################################################################
>class
>LoggingXMLRPCRequestHandler(SimpleXMLRPCServer.CGIXMLRPCRequestHandler):
> def __dereference(self, request_text):
> entityRe = re.compile("((?P<er>&#x)(?P<code>..)(?P<semi>;))")
What about entity ☺ ? Or the same in decimal: ☺
:)
> for m in re.finditer(entityRe, request_text):
> hexref = int(m.group(3),16)
> charref = chr(hexref)
unichr(hexref) would handle >= 256, if you used unicode.
> request_text = request_text.replace(m.group(1), charref)
>
> return request_text
>
>
>#-------------------------------------------------------------------
> def handle_xmlrpc(self, request_text):
> logger = logging.getLogger()
> #logger.debug("************************************")
> #logger.debug(request_text)
^^^^^^^^^^^^ I would suggest repr(request_text) for debugging, unless you
know that your logger is going to do that for you. Otherwise a '%s' format may hide things that you'd like to know.
> try:
> #logger.debug("-------------------------------------")
> request_text = self.__dereference(request_text)
> #logger.debug(request_text)
> request_text = request_text.decode("latin-1").encode('utf-8')
AFAIK, XML can be encoded with many encodings other than latin-1, so you are essentially
saying here that you know it's latin-1 somehow. Theoretically, your XML could
start with something like <?xml encoding='UTF-8'?> and .decode("latin-1") is only going to
"work" when the source is plain ascii. I wouldn't be surprised if that's what's happening
up to the point where you __dereference, but str.replace doesn't care that you are potentially
making a utf-8 encoding invalid by just replacing 8-bit characters with what is legal latin-1.
after that, you are decoding your utf-8_clobbered_with_latin-1 as latin-1 anyway, so it "works".
At least I think this is a consistent theory. See if you can get the client to send something
with characters >128 that aren't represented as &#x..; to see if it's actually sending utf-8.
> #logger.debug("************************************")
> except Exception, e:
> logger.error(request_text)
again, suggest repr(request_text)
> logger.error("had a problem dereferencing")
> logger.error(e)
>
> SimpleXMLRPCServer.CGIXMLRPCRequestHandler.handle_xmlrpc(self,
>request_text)
>########################################################################
>class Foo:
> def settings(self):
> return os.environ
> def echo(self, something):
> logger = logging.getLogger()
> logger.debug(something)
repr it, unless you know ;-)
> return something
> def greeting(self, name):
> return "hello, " + name
>
># these are used to run as a CGI
>handler = LoggingXMLRPCRequestHandler()
>handler.register_instance(Foo())
>handler.handle_request()
>
Regards,
Bengt Richter
More information about the Python-list
mailing list