How to use xmlrpc properly with Korean (non-ascii characters)

Martin v. Loewis martin at v.loewis.de
Tue Oct 22 02:55:54 EDT 2002


lamb100 at korea.com (Seunghyun Kim) writes:

> Recently I tried xmlrpc to support Korean. In newsgroup, there are
> many complaint to xmlrpc from non-english, because it formally
> supports just ASCII. Here is solution for Korean Language support. I
> guess it would be properly adopted to other languages.

While I trust you that your approach works for you, I wonder whether
it wouldn't be easier to just use UTF-8, instead of EUC-KR. I think a
number of problem would go away, and the standard xmlrpclib module
would work out of the box, with no modifications required.

> First of all, you have to get xmlrpclib 0.9.9 from pythonware, and
> install it to your $PYTHON_PATH$/lib. surely overwrite previous files.
> Then open xmlrpclib.py with text editor and modify it.

Can you please elaborate why the standard library module fails to work?
My guess is that this is because of missing EUC-KR support in Expat.
Expat has mechanisms to support other encodings as well; currently,
pyexpat uses them only for 8-bit encodings.

If you can extend this to work for arbitrary encodings, contributions
are appreciated. Also, if you can extend this to work for specific
multi-byte encodings only, contributions would be still appreciated.

> ----------------------------------------------------------------------
> def _decode(data, encoding, is8bit=re.compile("[\x80-\xff]").search):
>     # decode non-ascii string (if possible)
>     if unicode and is8bit(data):
>         data = unicode(data, encoding)
>     return data
> -----------------------------------------------------------------------
> to
> -----------------------------------------------------------------------
> def _decode(data, encoding, is8bit=re.compile("[\x80-\xff]").search):
>     # decode non-ascii string (if possible)
>     if unicode and is8bit(data):
>         UTF8_encode = codecs.lookup('UTF-8')[0]    # set UTF8 codec
>         data = UTF8_encode(unicode(data, 'mbcs'))[0] 
>     return data
> ------------------------------------------------------------------------

That is inherently wrong. Don't you think you should take the document
encoding into account, somehow? What was wrong with the original code?

Regards,
Martin



More information about the Python-list mailing list