[Python-Dev] Unicode <--> UTF-8 in CPython extension modules
John Dennis
jdennis at redhat.com
Sat Feb 23 01:50:32 CET 2008
Colin Walters wrote:
> On Fri, Feb 22, 2008 at 4:23 PM, John Dennis <jdennis at redhat.com> wrote:
>
>> Python programs which use Unicode string objects for their i18n and
>> which "link" to C libraries expecting UTF-8 but which have a CPython
>> binding which only uses 's' or 's#' formats programs seem to often
>> fail with encoding errors.
>
> One thing to be aware of is that PyGTK+ actually sets the Python
> Unicode object encoding to UTF-8.
>
> http://bugzilla.gnome.org/show_bug.cgi?id=132040
>
> I mention this because PyGTK is a very popular library related to
> Python and Linux. So currently if you "import gtk", then libraries
> which are using UTF-8 (as you say, the vast majority) will work with
> Python unicode objects unmodified.
Thank you Colin, your input was very helpful. The fact PyGTK's i18n
handling worked was the counter example which made me doubt my analysis
was correct but I can see from the Gnome bug report and Martin's
subsequent comment that the analysis was sound. It had perplexed me
enormously why in some circumstances i18n handling worked but failed in
others. Apparently it was a side effect of importing gtk, a problem
exacerbated when either the sequence of imports or the complete set of
imports was not taken into account.
I am aware of other python bindings (libxml2 is one example) which share
the same mistake of not using the 'es' family of format conversions when
the underlying library is UTF-8. At least I now understand why
incorrectly coded bindings in some circumstances produced correct
results when logic dictated they shouldn't.
--
John Dennis <jdennis at redhat.com>
More information about the Python-Dev
mailing list