Py 3.3, unicode / upper()

wxjmfauth at gmail.com wxjmfauth at gmail.com
Thu Dec 20 14:19:50 EST 2012


Fact.
In order to work comfortably and with efficiency with a "scheme for
the coding of the characters", can be unicode or any coding scheme,
one has to take into account two things: 1) work with a unique set
of characters and 2) work with a contiguous block of code points.

At this point, it should be noticed I did not even wrote about
the real coding, only about characters and code points.

Now, let's take a look at what happens when one breaks the rules
above and, precisely, if one attempts to work with multiple
characters sets or if one divides - artificially - the whole range
of the unicode code points in chunks.

The first (and it should be quite obvious) consequence is that
you create bloated, unnecessary and useless code. I simplify
the flexible string representation (FSR) and will use an "ascii" / 
"non-ascii" model/terminology.

If you are an "ascii" user, a FSR model has no sense. An
"ascii" user will use, per definition, only "ascii characters".

If you are a "non-ascii" user, the FSR model is also a non
sense, because you are per definition a n"on-ascii" user of
"non-ascii" character. Any optimisation for "ascii" user just
become irrelevant. 

In one sense, to escape from this, you have to be at the same time
a non "ascii" user and a non "non-ascii" user. Impossible.
In both cases, a FSR model is useless and in both cases you are
forced to use bloated and unnecessary code.

The rule is to treat every character of a unique set of characters
of a coding scheme in, how to say, an "equal way". The problematic
can be seen the other way, every coding scheme has been built
to work with a unique set of characters, otherwhile it is not
properly working!

The second negative aspect of this splitting, is just the 
splitting itsself. One can optimize every subset of characters,
one will always be impacted by the "switch" between the subsets.
One more reason to work with a unique set characters or this is
the reason why every coding scheme handle a unique set of
characters.

Up to now, I spoke only about the characters and the sets of
characters, not about the coding of the characters.
There is a point which is quite hard to understand and also hard
to explain. It becomes obvious with some experience.

When one works with a coding scheme, one always has to think
characters / code points. If one takes the perspective of encoded
code points, it simply does not work or may not work very well
(memory/speed). The whole problematic is that it is impossible to
work with characters, one is forced to manipulate encoded code
points as characters. Unicode is built and though to work with
code points, not with encoded code points. The serialization,
transformation code point -> encoded code point, is "only" a
technical and secondary process. Surprise, all the unicode
coding schemes (utf-8, 16, 32) are working with the same
set of characters. They differ in the serialization, but
they are all working with a unique set of characters.
The utf-16 / ucs-2 is an interesting case. Their encoding mechanisms
are quasi the same, the difference lies in the sets of characters.

There is an another way to empiricaly understand the problem.
The historical evolution of the coding of the characters. Practically,
all the coding schemes have been created to handle different sets of
characters or coding schemes have been created, because it is the
only way to work properly. If it would have been possible to work
with multiple coding schemes, I'm pretty sure a solution would
have emerged. It never happened and it would not have been necessary
to create iso10646 or unicode. Neither it would have been necessary
to create all these codings iso-8859-***, cp***, mac** which are
all *based on set of characters*.

plan9 had attempted to work with multiple characters set, it did not
work very well, main issue: the switch between the codings.

A solution à la FSR can not work or not work in a optimized way.
It is not a coding scheme, it is a composite of coding schemes
handling several characters sets. Hard to imagine something worse.

Contrary to what has been said, the bad cases I presented here are
not corner cases. There is practically and systematically a regression
in Py33 compared to Py32.
That's very easy to test. I did all my tests at the light of what
I explained above. I was not a suprise for me to this expectidly
bad behaviour.

Python is not my tool. If I'm allowing me to give an advice, a
scientifical approach.
I suggest the core devs to firstly spend their time to proof
a FSR model can beat the existing models (purely on the C level).
Then, if they succeeded, to later implement this.

My feeling is that most of the people are defending this FSR simply
because it exists, not because of its intrisic quality.

Hint: I suggest the experts to take a comprehensive look at the
cmap table of the OpenType fonts (pure unicode technology).
Those people know how to work.

I would be very happy to be wrong. Unfortunately, I'm affraid
it's not the case.

jmf



More information about the Python-list mailing list