[pypy-dev] Unicode encode/decode speed

Mon Feb 18 12:37:21 CET 2013

On 17/02/13 11:43, Armin Rigo wrote:
> Hi,
>
> On Tue, Feb 12, 2013 at 7:14 PM, Eleytherios Stamatogiannakis
> <estama at gmail.com> wrote:
>> Also we are looking into adding a special ffi.string_decode_UTF8 in CFFI's
>> backend to reduce the number of calls that are needed to go from utf8_char*
>> to PyPy's unicode.
>
> A first note: I'm wondering why you need to convert from
> utf-8-that-contains-only-ascii, to unicode, and back.  What is the
> point of having unicode strings in the first place?  Can't you just
> pass around your complete program plain non-unicode strings?
>

The problem is that SQlite internally uses UTF-8. So you cannot know in 
advance if the char* that you get from it is plain ASCII or a UTF-8 
encoded Unicode. So we end up always converting to Unicode from the 
char* that SQlite returns.

When sending to it, we have different code paths for Python's str() and 
unicode() string representations. Unfortunately, due to the nature of 
our data (its multilingual), and to make our life easier when we code 
our relational operators (written in Python), we always convert to 
Unicode inside our operators. So the str() path inside the MSPW SQLite 
wrapper, mostly sits unused.

> If not, then indeed, it would make (a bit of) sense to have ways to
> convert directly between "char *" and unicode strings, in both
> directions, assuming utf-8.  This could be done with an API like:
>
> ffi.encode_utf8(unicode_string) -> new_char*_cdata
> ffi.encode_utf8(unicode_string, target_char*_cdata, maximum_length)
> ffi.decode_utf8(char*_cdata, [length]) -> unicode_string
>
> Alternatively, we could accept unicode strings whenever a "char*" is
> expected and encode it to utf-8, but that sounds a bit too magical.
>

An API like the one you propose would be very nice, and IMHO would give 
a substantial speedup.

May i suggest, that for generality purposes, the same API functions 
should also be added for UTF-16, UTF-32 ?

Thanks Armin and Maciej for looking into this,

l.