String performance regression from python 3.2 to 3.3

Roy Smith roy at panix.com
Sat Mar 16 09:29:01 EDT 2013


In article <51440235$0$29965$c3e8da3$5496439d at news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:

> UTF-32 is a *fixed width* storage mechanism where every code point takes 
> exactly four bytes. Since the entire Unicode range will fit in four 
> bytes, that ensures that every code point is covered, and there is no 
> need to walk the string every time you perform an indexing operation. But 
> it means that if you're one of the 99.9% of users who mostly use 
> characters in the BMP, your strings take twice as much space as 
> necessary. If you only use Latin1 or ASCII, your strings take four times 
> as much space as necessary.

I suspect that eventually, UTF-32 will win out.  I'm not sure when 
"eventually" is, but maybe sometime in the next 10-20 years.

When I was starting out, the computer industry had a variety of 
character encodings designed to take up less than 8 bits per character.  
Sixbit, Rad-50, BCD, and so on.  Each of these added complexity and took 
away character set richness, but saved a few bits.  At the time, memory 
was so expensive and so precious, it was worth it.

Over the years, memory became cheaper, address spaces grew from 16 to 32 
to 64 bits, and the pressure to use richer character sets kept 
increasing.  So, now we're at the point where people are (mostly) using 
Unicode, but are still arguing about which encoding to use because the 
"best" complexity/space tradeoff isn't obvious.

At some point in the future, memory will be so cheap, and so ubiquitous, 
that people will be wondering why us neanderthals bothered worrying 
about trying to save 16 bits per character.  Of course, by then, we'll 
be migrating to Mongocode and arguing about UTF-64 :-)



More information about the Python-list mailing list