Flexible string representation, unicode, typography, ...

Thu Aug 30 12:27:04 EDT 2012

On Thu, Aug 30, 2012 at 2:51 AM,  <wxjmfauth at gmail.com> wrote:
> But as soon as you introduce artificially a "latin-1"
> bottleneck, all this machinery just become useless.

How is this a bottleneck?  If you removed the Latin-1 encoding
altogether and limited the flexible representation to just UCS-2 /
UCS-4, I doubt very much that you would see any significant speed
gains. The flexibility is the part that makes string creation slower,
not the Latin-1 option in particular.

> This flexible representation is working absurdly.
> It optimizes the characters you are not using (in one
> sense), it defaults to a non optimized form for the
> characters you wish to use.

I'm sure that if you wanted to you could patch Python to use Latin-9
instead.  Just be prepared for it to be slower than UCS-2, since it
would mean having to encode the code points rather than merely
truncating them.

> Pick up a random text and see the probability this
> text match the most optimized case 1 char / 1 byte,
> practically never.

Pick up a random text and see that this text matches the next most
optimized case, 1 char / 2 bytes: practically always.

> If a user will use exclusively latin-1, she/he is  better
> served by using a dedicated tool for "latin-1"

Speaker as a user who almost exclusively uses Latin-1, I strongly
disagree.  What you're describing is Python 2.x.  The user is always
almost better served by not having to worry about the full extent of
the character set their program might use.  That's why we moved to
Unicode strings in Python 3 in the first place.

> If a user will comfortably work with Unicode, she/he is
> better served by using one of this tools which is using
> properly one of the available Unicode schemes.
>
> In a funny way, this is what Python was doing and it
> performs better!

Seriously, please show us just one *real world* benchmark in which
Python 3.3 performs demonstrably worse than Python 3.2.  All you've
shown so far is this one microbenchmark of string creation that is
utterly irrelevant to actual programs.