Notice: While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience.

PEP 624 -- Remove Py_UNICODE encoder APIs

PEP:624
Title:Remove Py_UNICODE encoder APIs
Author:Inada Naoki <songofacandy at gmail.com>
Status:Draft
Type:Standards Track
Created:06-Jul-2020
Python-Version:3.11
Post-History:08-Jul-2020

Abstract

This PEP proposes to remove deprecated Py_UNICODE encoder APIs in Python 3.11:

  • PyUnicode_Encode()
  • PyUnicode_EncodeASCII()
  • PyUnicode_EncodeLatin1()
  • PyUnicode_EncodeUTF7()
  • PyUnicode_EncodeUTF8()
  • PyUnicode_EncodeUTF16()
  • PyUnicode_EncodeUTF32()
  • PyUnicode_EncodeUnicodeEscape()
  • PyUnicode_EncodeRawUnicodeEscape()
  • PyUnicode_EncodeCharmap()
  • PyUnicode_TranslateCharmap()
  • PyUnicode_EncodeDecimal()
  • PyUnicode_TransformDecimalToASCII()

Note

PEP 623 propose to remove Unicode object APIs relating to Py_UNICODE. On the other hand, this PEP is not relating to Unicode object. These PEPs are split because they have different motivation and need different discussion.

Motivation

In general, reducing the number of APIs that have been deprecated for a long time and have few users is a good idea for not only it improves the maintainability of CPython, but it also helps API users and other Python implementations.

Rationale

Deprecated since Python 3.3

Py_UNICODE and APIs using it have been deprecated since Python 3.3.

Inefficient

All of these APIs are implemented using PyUnicode_FromWideChar. So these APIs are inefficient when user want to encode Unicode object.

Not used widely

When searching from top 4000 PyPI packages [1], only pyodbc use these APIs.

  • PyUnicode_EncodeUTF8()
  • PyUnicode_EncodeUTF16()

pyodbc uses these APIs to encode Unicode object into bytes object. So it is easy to fix it. [2]

Alternative APIs

There are alternative APIs to accept PyObject *unicode instead of Py_UNICODE *. Users can migrate to them.

Deprecated API Alternative APIs
PyUnicode_Encode() PyUnicode_AsEncodedString()
PyUnicode_EncodeASCII() PyUnicode_AsASCIIString() (1)
PyUnicode_EncodeLatin1() PyUnicode_AsLatin1String() (1)
PyUnicode_EncodeUTF7() (2)
PyUnicode_EncodeUTF8() PyUnicode_AsUTF8String() (1)
PyUnicode_EncodeUTF16() PyUnicode_AsUTF16String() (3)
PyUnicode_EncodeUTF32() PyUnicode_AsUTF32String() (3)
PyUnicode_EncodeUnicodeEscape() PyUnicode_AsUnicodeEscapeString()
PyUnicode_EncodeRawUnicodeEscape() PyUnicode_AsRawUnicodeEscapeString()
PyUnicode_EncodeCharmap() PyUnicode_AsCharmapString() (1)
PyUnicode_TranslateCharmap() PyUnicode_Translate()
PyUnicode_EncodeDecimal() (4)
PyUnicode_TransformDecimalToASCII() (4)

Notes:

  1. const char *errors parameter is missing.
  2. There is no public alternative API. But user can use generic PyUnicode_AsEncodedString() instead.
  3. const char *errors, int byteorder parameters are missing.
  4. There is no direct replacement. But Py_UNICODE_TODECIMAL can be used instead. CPython uses _PyUnicode_TransformDecimalAndSpaceToASCII for converting from Unicode to numbers instead.

Plan

Python 3.9

Add Py_DEPRECATED(3.3) to following APIs. This change is committed already [3]. All other APIs have been marked Py_DEPRECATED(3.3) already.

  • PyUnicode_EncodeDecimal()
  • PyUnicode_TransformDecimalToASCII().

Document all APIs as "will be removed in version 3.11".

Python 3.11

These APIs are removed.

  • PyUnicode_Encode()
  • PyUnicode_EncodeASCII()
  • PyUnicode_EncodeLatin1()
  • PyUnicode_EncodeUTF7()
  • PyUnicode_EncodeUTF8()
  • PyUnicode_EncodeUTF16()
  • PyUnicode_EncodeUTF32()
  • PyUnicode_EncodeUnicodeEscape()
  • PyUnicode_EncodeRawUnicodeEscape()
  • PyUnicode_EncodeCharmap()
  • PyUnicode_TranslateCharmap()
  • PyUnicode_EncodeDecimal()
  • PyUnicode_TransformDecimalToASCII()

Alternative ideas

Instead of just removing deprecated APIs, we may be able to use thier names with different signature.

Make some private APIs public

PyUnicode_EncodeUTF7() doesn't have public alternative APIs.

Some APIs have alternative public APIs. But they are missing const char *errors or int byteorder parameters.

We can rename some private APIs and make them public to cover missing APIs and parameters.

Rename to Rename from
PyUnicode_EncodeASCII() _PyUnicode_AsASCIIString()
PyUnicode_EncodeLatin1() _PyUnicode_AsLatin1String()
PyUnicode_EncodeUTF7() _PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8() _PyUnicode_AsUTF8String()
PyUnicode_EncodeUTF16() _PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32() _PyUnicode_EncodeUTF32()

Pros:

  • We have more consistent API set.

Cons:

  • We have more public APIs to maintain.
  • Existing public APIs are enough for most use cases, and PyUnicode_AsEncodedString() can be used in other cases.

Replace Py_UNICODE* with Py_UCS4*

We can replace Py_UNICODE (typedef of wchar_t) with Py_UCS4. Since builtin codecs support UCS-4, we don't need to convert Py_UCS4* string to Unicode object.

Pros:

  • We have more consistent API set.
  • User can encode UCS-4 string in C without creating Unicode object.

Cons:

  • We have more public APIs to maintain.
  • Applications which uses UTF-8 or UTF-16 can not use these APIs anyway.
  • Other Python implementations may not have builtin codec for UCS-4.
  • If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.

Replace Py_UNICODE* with wchar_t*

We can replace Py_UNICODE to wchar_t.

Pros:

  • We have more consistent API set.
  • Backward compatible.

Cons:

  • We have more public APIs to maintain.
  • They are inefficient on platforms wchar_t* is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input.

Rejected ideas

Using runtime warning

These APIs doesn't release GIL for now. Emitting a warning from such APIs is not safe. See this example.

PyObject *u = PyList_GET_ITEM(list, i);  // u is borrowed reference.
PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
        PyUnicode_GET_SIZE(u), NULL);
// Assumes u is still living reference.
PyObject *t = PyTuple_Pack(2, u, b);
Py_DECREF(b);
return t;

If we emit Python warning from PyUnicode_EncodeUTF8(), warning filters and other threads may change the list and u can be a dangling reference after PyUnicode_EncodeUTF8() returned.

Additionally, since we are not changing behavior but removing C APIs, runtime DeprecationWarning might not helpful for Python developers. We should warn to extension developers instead.

Deprecate PyUnicode_Decode* APIs too

Not only remove PyUnicode_Encode*() APIs, but also deprecate following APIs too for symmetry and reducing number of APIs.

  • PyUnicode_DecodeASCII()
  • PyUnicode_DecodeLatin1()
  • PyUnicode_DecodeUTF7()
  • PyUnicode_DecodeUTF8()
  • PyUnicode_DecodeUTF16()
  • PyUnicode_DecodeUTF32()
  • PyUnicode_DecodeUnicodeEscape()
  • PyUnicode_DecodeRawUnicodeEscape()
  • PyUnicode_DecodeCharmap()
  • PyUnicode_DecodeMBCS()

This idea is excluded from this PEP because of several reasons:

  • We can not remove them anytime soon because they are part of stable ABI.
  • PyUnicode_DecodeASCII() and PyUnicode_DecodeUTF8() are used very widely. Deprecating them is not worth enough.
  • Decoding from const char* is independent from Unicode representation.
    • PyUnicode_Decode*() APIs are useful for applications and extensions using UTF-8 or Python Unicode objects to store Unicode string. But PyUnicode_Encode*() APIs are not useful for them.
    • Python implementations using UTF-8 for Unicode internal representation (e.g. PyPy and micropython) may not have encoder with wchar_t* or UCS-4 input. But decoding from char* is very natural for them too.

References

[1]Source package list chosen from top 4000 PyPI packages. (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
[2]pyodbc -- Don't use PyUnicode_Encode API #792 (https://github.com/mkleehammer/pyodbc/pull/792)
[3]Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)
Source: https://github.com/python/peps/blob/master/pep-0624.rst