RE Module Performance

Thu Jul 25 01:56:46 EDT 2013

On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF:

> His most recent argument that Python should use UTF as a representation
> is very strange to be honest.

He's not arguing for anything, he is just hating on anything that gives 
even the tiniest benefit to ASCII users. This isn't about Python 3.3. 
hurting non-ASCII users, because that is demonstrably untrue: they are 
*better off* in Python 3.3. This is about denying even a tiny benefit to 
ASCII users.

In Python 3.3, non-ASCII users have these advantages compared to previous 
versions:

- strings will usually take less memory, and aside from trivial changes 
to the object header, they never take more memory than a wide build would 
use;

- consequently nearly all objects will take less memory (especially 
builtins and standard library objects, which are all ASCII), since 
objects contain dozens of internal strings (attribute and method names in 
__dict__, class name, etc.);

- consequently whole-application benchmarks show most applications will 
use significantly less memory, which leads to faster speeds;

- you cannot break surrogate pairs apart by accident, which you can do in 
narrow builds;

- in previous versions, code which works when run in a wide build may 
fail in a narrow build, but that is no longer an issue since the 
distinction between wide and narrow builds is gone;

- Latin1 users, which includes JMF himself, will likewise see memory 
savings, since Latin1 strings will take half the size of narrow builds 
and a quarter the size of wide builds.

The cost of all these benefits is a small overhead when creating a string 
in the first place, and some purely internal added complication to the 
string implementation.

I'm the first to argue against complication unless there is a 
corresponding benefit. This is a case where the benefit has proven itself 
doubly: Python 3.3's Unicode implementation is *more correct* than 
before, and it uses less memory to do so.

> The cons of UTF are apparent and widely
> known.  The main con is that UTF strings are O(n) for indexing a
> position within the string.

Not so for UTF-32.

-- 
Steven