[New-bugs-announce] [issue39087] No efficient API to get UTF-8 string from unicode object.
Inada Naoki
report at bugs.python.org
Wed Dec 18 07:10:15 EST 2019
New submission from Inada Naoki <songofacandy at gmail.com>:
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode.
There are two courses:
(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.
(a) will be the fastest on CPython, but there are few drawbacks:
* This is tightly coupled with CPython implementation. It will be slow on PyPy.
* CPython may change the internal representation to UTF-8 in the future, like PyPy.
* You can not easily reuse algorithms written in C that handle `char*`.
So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode object.
* PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache.
* PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already.
For speed and efficiency, I propose a new API:
```
/* Borrow the UTF-8 C string from the unicode.
*
* Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*.
* The returned object is the owner of the *utf8*. You need to Py_DECREF() it after
* you finished to using the *utf8*. The owner may be not the unicode.
* Returns NULL when the error occurred while decoding the unicode.
*/
PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len);
```
When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it.
Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it.
----------
components: C API
messages: 358623
nosy: inada.naoki
priority: normal
severity: normal
status: open
title: No efficient API to get UTF-8 string from unicode object.
type: enhancement
versions: Python 3.9
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue39087>
_______________________________________
More information about the New-bugs-announce
mailing list