Py 3.3, unicode / upper()

Thu Dec 20 19:34:03 EST 2012

On Thu, Dec 20, 2012 at 12:19 PM,  <wxjmfauth at gmail.com> wrote:
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
>
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.

As Terry and Steven have already pointed out, there is no such thing
as a "non-ascii" user.  Here I will take the complementary approach
and point out that there is also no such thing as an "ascii" user.
There are only users whose strings are 99.99% (or more) ASCII.  A user
may think that his program will never be given any non-ASCII input to
deal with, but experience tells us that this thought is probably
wrong.

Suppose you were to split the Unicode representation into separate
"ASCII-only" and "wide" data types.  Then which data type is the
correct one to choose for an "ascii" user?  The correct answer is
*always* the wide data type, for the reason stated above.  If the user
chooses the ASCII-only data type, then as soon his program encounters
non-ASCII data, it breaks.  The only users of the ASCII-only data type
then would be the authors of buggy programs.  The same issue applies
to narrow (UTF-16) data types.  So there really are only two viable,
non-buggy options for Unicode representations: FSR, or always wide
(UTF-32).  The latter is wildly inefficient in many cases, so Python
went with FSR.

A third option might be proposed, which would be to have a build
switch between FSR or always wide, with the promise that the two will
be indistinguishable at the Python level (apart from the amount of
memory used).  This is probably not on the table, however, as it would
have a non-negligible maintenance cost, and it's not clear that
anybody other than you would actually want it.

> A solution à la FSR can not work or not work in a optimized way.
> It is not a coding scheme, it is a composite of coding schemes
> handling several characters sets. Hard to imagine something worse.

It is not a composite of coding schemes.  The str type deals with
exactly *one* character set -- the UCS.  The different representations
are not different coding schemes.  They are *all* UTF-32.  The only
significant difference between the representations is that the leading
zero bytes of each character are made implicit (i.e. truncated) if the
nature of the string allows it.

> Contrary to what has been said, the bad cases I presented here are
> not corner cases.

The only significantly regressive case that you've presented here has
been str.replace on inputs engineered for bad performance.  That's why
people characterize them as corner cases -- because that's exactly
what they are.

> There is practically and systematically a regression
> in Py33 compared to Py32.
> That's very easy to test. I did all my tests at the light of what
> I explained above. I was not a suprise for me to this expectidly
> bad behaviour.

Have you run stringbench.py yet?  When I ran it on my system, the full
set of Unicode benchmarks ran in 268.15 seconds for Python 3.2 versus
198.77 seconds for Python 3.3.  That's a 26% overall speedup for the
covered benchmarks, which seem reasonably thorough.  That does not
demonstrate a "systematic regression".  If anything, that shows a
systematic improvement.

Your cherry-picking of benchmarks is like a driver who has two routes
to their destination; one takes ten minutes on average but has one
annoyingly long traffic light, while the second takes fifteen minutes
on average but has no traffic lights (and a correspondingly higher
accident rate).  Yet for some reason you insist that the second route
is better because the traffic light makes the first route
"systematically" slower.