xmlrpclib and decoding entity references

Bengt Richter bokr at oz.net
Wed May 4 12:12:01 EDT 2005


On 3 May 2005 08:07:06 -0700, "Chris Curvey" <ccurvey at gmail.com> wrote:

>I'm writing an XMLRPC server, which is receiving a request (from a
>non-Python client) that looks like this (formatted for legibility):
>
><?xml version="1.0"?>
><methodCall>
><methodName>echo</methodName>
><params>
><param>
><value>
><string>Le Martyre de Saint Andr&#xe9; <BR> avec inscription
>'Le Dominiquain.' et 'Le tableau fait par le dominicain,
>d'apr&#xe8;s son dessein &#xe0;... est &#xe0; Rome, &#xe0;
>l'&#xe9;glise Saint Andr&#xe9; della Valle' sur le
>cadre<BR> craie noire, plume et encre brune, lavis brun
>rehauss&#xe9; de blanc sur papier brun<BR> 190 x 228 mm. (7 1/2 x
>9 in.)</string>
></value>
></param>
></params>
></methodCall>
>
>But when my "echo" method is invoked, the value of the string is:
>
>Le Martyre de Saint Andr; <BR> avec inscription 'Le Dominiquain.' et
>'Le tableau fait par le dominicain, d'apr:s son dessein 2... est 2
>Rome, 2 l';glise Saint Andr; della Valle' sur le cadre<BR> craie noire,
>plume et encre brune, lavis brun rehauss; de blanc sur papier brun<BR>
>190 x 228 mm. (7 1/2 x 9 in.)
>
>Can anyone give me a lead on how to convert the entity references into
>something that will make it through to my method call?
>
I haven't used XMLRPC but superficially this looks like a quoting and/or encoding
problem. IOW, your "request" is XML, and the <string>...</string> part is also XML
which is part of the whole, not encapsulated in e.g. <![CDATA[...stuff...]]>
(which would tell an XML parser to suspend markup interpretation of ...stuff...).

So IWT you would at least need the <string>...</string> content to be converted to
unicode to preserve all the represented characters. It wouldn't surprise me if the
whole request is routinely converted to unicode, and the "value" you are showing
above is a result of converting from unicode to an encoding that can't represent
everything, and maybe just drops conversion errors. What do you
get if you print repr(value)? (assuming value is passed to you echo method)

If it is a unicode string, you will just have to choose an appropriate value.encode('appropriate')
from available codecs. If it looks like e.g., a utf-8 encoding of unicode, you could try
value.decode('utf-8').encode('appropriate')

I'm just guessing here. But something is interpreting the basic XML, since
<BR> is being converted to <BR>. Seems not unlikely that the rest are
also being converted, and to unicode. You just wouldn't notice a glitch when
unicode <BR> is converted to any usual western text encoding.

OTOH, if the intent (which I doubt) of the non-python client were to pass through
a block of pre-formatted XML as such (possibly for direct pasting into e.g. web page XHTML?)
then a way to avoid escaping every & and < would be to use CDATA to encapsulate it. That
would have to be fixed on that end.

Regards,
Bengt Richter



More information about the Python-list mailing list