[pypy-commit] cffi default: Expand the docs about wchar_t/char16_t/char32_t
arigo
pypy.commits at gmail.com
Thu Aug 31 03:16:29 EDT 2017
Author: Armin Rigo <arigo at tunes.org>
Branch:
Changeset: r3005:4a39df6e9df5
Date: 2017-08-31 09:16 +0200
http://bitbucket.org/cffi/cffi/changeset/4a39df6e9df5/
Log: Expand the docs about wchar_t/char16_t/char32_t
diff --git a/doc/source/ref.rst b/doc/source/ref.rst
--- a/doc/source/ref.rst
+++ b/doc/source/ref.rst
@@ -796,20 +796,8 @@
`[8]` ``wchar_t``, ``char16_t`` and ``char32_t``
- The ``wchar_t`` type has the same signedness as the underlying
- platform's. For example, on Linux, it is a signed 32-bit integer.
- However, the types ``char16_t`` and ``char32_t`` (*new in version
- 1.11*) are always unsigned. **Warning:** for now, if you use
- ``char16_t`` and ``char32_t`` with ``cdef()`` and ``set_source()``,
- you have to make sure yourself that the types are declared by the C
- source you provide to ``set_source()``. They would be declared if
- you ``#include`` a library that explicitly uses them, for example,
- or when using C++11. Otherwise, you need ``#include <uchar.h>`` on
- Linux, or more generally something like ``typedef uint_least16_t
- char16_t;``. This is not done automatically by CFFI because
- ``uchar.h`` is not standard across platforms, and writing a
- ``typedef`` like above would crash if the type happens to be
- already defined.
+ See `Unicode character types`_ below.
+
.. _file:
@@ -842,3 +830,66 @@
The special support for ``FILE *`` is anyway implemented in a similar manner
on CPython 3.x and on PyPy, because these Python implementations' files are
not natively based on ``FILE *``. Doing it explicity offers more control.
+
+
+.. _unichar:
+
+Unicode character types
++++++++++++++++++++++++
+
+The ``wchar_t`` type has the same signedness as the underlying
+platform's. For example, on Linux, it is a signed 32-bit integer.
+However, the types ``char16_t`` and ``char32_t`` (*new in version 1.11*)
+are always unsigned.
+
+Note that CFFI assumes that these types are meant to contain UTF-16 or
+UTF-32 characters in the native endianness. More precisely:
+
+* ``char32_t`` is assumed to contain UTF-32, or UCS4, which is just the
+ unicode codepoint;
+
+* ``char16_t`` is assumed to contain UTF-16, i.e. UCS2 plus surrogates;
+
+* ``wchar_t`` is assumed to contain either UTF-32 or UTF-16 based on its
+ actual platform-defined size of 4 or 2 bytes.
+
+Whether this assumption is true or not is unspecified by the C language.
+In theory, the C library you are interfacing with could use one of these
+types with a different meaning. You would then need to handle it
+yourself---for example, by using ``uint32_t`` instead of ``char32_t`` in
+the ``cdef()``, and building the expected arrays of ``uint32_t``
+manually.
+
+Python itself can be compiled with ``sys.maxunicode == 65535`` or
+``sys.maxunicode == 1114111`` (Python >= 3.3 is always 1114111). This
+changes the handling of surrogates (which are pairs of 16-bit
+"characters" which actually stand for a single codepoint whose value is
+greater than 65535). If your Python is ``sys.maxunicode == 1114111``,
+then it can store arbitrary unicode codepoints; surrogates are
+automatically inserted when converting from Python unicodes to UTF-16,
+and automatically removed when converting back. On the other hand, if
+your Python is ``sys.maxunicode == 65535``, then it is the other way
+around: surrogates are removed when converting from Python unicodes
+to UTF-32, and added when converting back. In other words, surrogate
+conversion is done only when there is a size mismatch.
+
+Note that Python's internal representations is not specified. For
+example, on CPython >= 3.3, it will use 1- or 2- or 4-bytes arrays
+depending on what the string actually contains. With CFFI, when you
+pass a Python byte string to a C function expecting a ``char*``, then
+we pass directly a pointer to the existing data without needing a
+temporary buffer; however, the same cannot cleanly be done with
+*unicode* string arguments and the ``wchar_t*`` / ``char16_t*`` /
+``char32_t*`` types, because of the changing internal
+representation. As a result, and for consistency, CFFI always allocates
+a temporary buffer for unicode strings.
+
+**Warning:** for now, if you use ``char16_t`` and ``char32_t`` with
+``set_source()``, you have to make sure yourself that the types are
+declared by the C source you provide to ``set_source()``. They would be
+declared if you ``#include`` a library that explicitly uses them, for
+example, or when using C++11. Otherwise, you need ``#include
+<uchar.h>`` on Linux, or more generally something like ``typedef
+uint16_t char16_t;``. This is not done automatically by CFFI because
+``uchar.h`` is not standard across platforms, and writing a ``typedef``
+like above would crash if the type happens to be already defined.
diff --git a/doc/source/using.rst b/doc/source/using.rst
--- a/doc/source/using.rst
+++ b/doc/source/using.rst
@@ -206,6 +206,9 @@
from a unicode string,
and calling ``ffi.string()`` on the cdata object returns the current unicode
string stored in the source array (adding surrogates if necessary).
+See the `Unicode character types`__ section for more details.
+
+__: ref.html#unichar
Note that unlike Python lists or tuples, but like C, you *cannot* index in
a C array from the end using negative numbers.
More information about the pypy-commit
mailing list