[Python-Dev] Unicode <--> UTF-8 in CPython extension modules

Sat Feb 23 00:26:07 CET 2008

> I've uncovered what seems to me to a problem with python Unicode
> string objects passed to extension modules. Or perhaps it's revealing
> a misunderstanding on my part :-) So I would like to get some
> clarification.

It seems to me that there is indeed one or more misunderstandings
on your part. Please discuss them on comp.lang.python.

> Extension modules written in C receive strings from python via the
> PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
> format parameter.
> 
> Many C libraries in Linux use the UTF-8 encoding.
> 
> The 's' format when passed a Unicode object will encode the string
> according to the default encoding which is immutably set to 'ascii' in
> site.py. Thus a C library expecting UTF-8 which uses the 's' format in
> PyArg_ParseTuple will get an encoding error when passed a Unicode
> string which contains any code points outside the ascii range.

The C library isn't expecting  using the 's' format. A Python module
wrapping the C library is. So whatever conversion is necessary should
be done by that Python module.

> Now my questions:
> 
> * Is the use of the 's' or 's*' format parameter in an extension
>    binding expecting UTF-8 fundamentally broken and not expected to
>    work?  Instead should the binding be using a format conversion which
>    specifies the desired encoding, e.g. 'es' or 'es#'?

Yes. Alternatively, require the callers to pass UTF-8 byte strings,
not Unicode strings.

> * The extension modules could successfully use the 's' or 's#' format
>    conversion in a UTF-8 environment if the default encoding was
>    UTF-8. Changing the default encoding to UTF-8 would in one easy
>    stroke "fix" most extension modules, right?

Wrong. This assumes that "most" libraries do indeed specify their
APIs in terms of UTF-8. I don't think that is a fact; not in the world
of 2008.

> Why is the default
>    encoding 'ascii' in UTF-8 environments and why is the default
>    encoding prohibited from being changed from ascii?

There are several reasons, all off-topic for python-dev.
ASCII was considered the most safe assumption: when
converting between byte and Unicode strings in the absence of an
encoding specification, you can't assume anything but ASCII
(technically, not even that, as the bytes may be EBCDIC, but ASCII
is safe for the majority of the systems - unlike UTF-8).
The encoding can't be changed because that would break hash().

> * Did Python 2.5 introduce anything which now makes this issue visible
>    whereas before it was masked by some other behavior?

I don't know. Can you please be a bit more specific (on 
comp.lang.python) where you suspect such a change?

Regards,
Martin