[pypy-dev] Unicode encode/decode speed

Mon Feb 11 16:48:22 CET 2013

Hi,

We have been following the nightly builds of PyPy, with our testing 
workload (first described in the "CFFI speed results" thread).

The news are very good. The performance of PyPy + CFFI has gone up 
considerably (~30% faster) since the last time we wrote about it!

By adding on that speed up also our optimizations of the CFFI based 
SQLite3 wrapper (MSPW) that we are developing, the end result is that 
most of our test queries are at the same speed or faster than CPython + 
APSW now.

Unfortunately, one of the queries where PyPy is slower [*] than CPython 
+ APSW, is very central to all of our workflows, which means that we 
cannot fully convert to using PyPy.

The main culprit of PyPy's slowness is the conversion (encoding, 
decoding) from PyPy's unicodes to UTF-8. It is the only thing, with a 
big percentage (~48%), remaining at the top of  our performance profiles .

Right now we are using PyPy's "codecs.utf_8_encode" and 
"codecs.utf_8_decode" to do this conversion.

It there a faster way to do these conversions (encoding, decoding) in 
PyPy? Does CPython do something more clever than PyPY, like storing 
unicodes with full ASCII char content, in an ASCII representation?

Thank you very much,

lefteris.

[*]
  For 1M rows:
  CPython + APSW: 10.5 sec
  PyPy + MSPW: 15.5 sec