String performance regression from python 3.2 to 3.3

rusi rustompmody at gmail.com
Sat Mar 16 12:39:29 EDT 2013


On Mar 16, 6:29 pm, Roy Smith <r... at panix.com> wrote:
> In article <51440235$0$29965$c3e8da3$54964... at news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.pyt... at pearwood.info> wrote:
>
> > UTF-32 is a *fixed width* storage mechanism where every code point takes
> > exactly four bytes. Since the entire Unicode range will fit in four
> > bytes, that ensures that every code point is covered, and there is no
> > need to walk the string every time you perform an indexing operation. But
> > it means that if you're one of the 99.9% of users who mostly use
> > characters in the BMP, your strings take twice as much space as
> > necessary. If you only use Latin1 or ASCII, your strings take four times
> > as much space as necessary.
>
> I suspect that eventually, UTF-32 will win out.  I'm not sure when
> "eventually" is, but maybe sometime in the next 10-20 years.

There is an article by Tim O'Reilly IIRC that talks of a certain
prognostication that went wrong.
[If someone knows this article please give me the link]

The gist as I remember it was:
First there were audio cassettes and LPs.
Then came CDs with far better fidelity.
As Moore's law went its relentless way, the audio industry puts its
hope into formats that would double CD quality.  Whereas the public
went with mp3s, ie a distinctly lower quality format, because putting
a thousand CDs into my pocket beats the pants of some super-duper hi-
fi new CD.
So while Moore's law takes its course, public demand and therefore big
money and therefore new standards may go some other way, including
reverse.

I believe that there are many things about unicode that are less than
satisfactory. Some are downright asinine like the 'prime-real-estate'
devoted to the control characters and never used.

In short, I am not betting on UTF-32.
Of course the reverse side also is there: Some of the world's most un-
optimal standards are also the most ubiquitous, like the qwerty
keyboard.



More information about the Python-list mailing list