'Straße' ('Strasse') and Python 2

Thu Jan 16 09:07:59 EST 2014

On Thu, 16 Jan 2014 10:51:42 +0000, Robin Becker wrote:

> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>> >Or are you saying thatwww.unicode.org  is wrong about the definitions
>>> >of Unicode terms?
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
> 
> I assure you that I fully understand my ignorance of unicode.

Robin, while I'm very happy to see that you have a good grasp of what you 
don't know, I'm afraid that you're misrepresenting me. You deleted the 
part of my post that made it clear that I was referring to our resident 
Unicode crank, JMF <wxjmfauth at gmail.com>.

> Until
> recently I didn't even know that the unicode in python 2.x is considered
> broken and that str in python 3.x is considered 'better'.

No need for scare quotes.

The unicode type in Python 2.x is less-good because:

- it is not the default string type (you have to prefix the string 
  with a u to get Unicode);

- it is missing some functionality, e.g. casefold;

- there are two distinct implementations, narrow builds and wide builds;

- wide builds take up to four times more memory per string as needed;

- narrow builds take up to two times more memory per string as needed;

- worse, narrow builds have very naive (possibly even "broken") 
  handling of code points in the Supplementary Multilingual Planes.

The unicode string type in Python 3 is better because:

- it is the default string type;

- it includes more functionality;

- starting in Python 3.3, it gets rid of the distinction between 
  narrow and wide builds;

- which reduces the memory overhead of strings by up to a factor 
  of four in many cases;

- and fixes the issue of SMP code points.

> I can say that having made a lot of reportlab work in both 2.7 & 3.3 I
> don't understand why the latter seems slower especially since we try to
> convert early to unicode/str as a desirable internal form. 

*shrug*

Who knows? Is it slower or does it only *seem* slower? Is the performance 
regression platform specific? Have you traded correctness for speed, that 
is, does 2.7 version break when given astral characters on a narrow build?

Earlier in January, you commented in another thread that 

"I'm not sure if we have any non-bmp characters in the tests."

If you don't, you should have some.

There's all sorts of reasons why your code might be slower under 3.3, 
including the possibility of a non-trivial performance regression. If you 
can demonstrate a test case with a significant slowdown for real-world 
code, I'm sure that a bug report will be treated seriously.

> Probably I
> have some horrible error going on(eg one of the C extensions is working
> in 2.7 and not in 3.3).

Well that might explain a slowdown.

But really, one should expect that moving from single byte strings to up 
to four-byte strings will have *some* cost. It's exchanging functionality 
for time. The same thing happened years ago, people used to be extremely 
opposed to using floating point doubles instead of singles because of 
performance. And, I suppose it is true that back when 64K was considered 
a lot of memory, using eight whole bytes per floating point number (let 
alone ten like the IEEE Extended format) might have seemed the height of 
extravagance. But today we use doubles by default, and if singles would 
be a tiny bit faster, who wants to go back to the bad old days of single 
precision?

I believe the same applies to Unicode versus single-byte strings.

-- 
Steven