Py 3.3, unicode / upper()

Wed Dec 19 10:40:43 EST 2012

On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <dfnsonfsduifb at gmx.de> wrote:
> On 19.12.2012 15:23, wxjmfauth at gmail.com wrote:
>> I was using the German word "Straße" (Strasse) — German
>> translation from "street" — to illustrate the catastrophic and
>> completely wrong-by-design Unicode handling in Py3.3, this
>> time from a memory point of view (not speed):
>>
>>>>> sys.getsizeof('Straße')
>> 43
>>>>> sys.getsizeof('STRAẞE')
>> 50
>>
>> instead of a sane (Py3.2)
>>
>>>>> sys.getsizeof('Straße')
>> 42
>>>>> sys.getsizeof('STRAẞE')
>> 42
>
> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.

You may not be familiar with jmf. He's one of our resident trolls, and
he has a bee in his bonnet about PEP 393 strings, on the basis that
they take up more space in memory than a narrow build of Python 3.2
would, for a string with lots of BMP characters and one non-BMP. In
3.2 narrow builds, strings were stored in UTF-16, with *surrogate
pairs* for non-BMP characters. This means that len() counts them
twice, as does string indexing/slicing. That's a major bug, especially
as your Python code will do different things on different platforms -
most Linux builds of 3.2 are "wide" builds, storing characters in four
bytes each.

PEP 393 brings wide build semantics to all Pythons, while achieving
memory savings better than a narrow build can (with PEP 393 strings,
any all-ASCII or all-Latin-1 strings will be stored one byte per
character). Every now and then, though, jmf points out *yet again*
that his beloved and buggy narrow build consumes less memory and runs
faster than the oh so terrible 3.3 on some contrived example. It gets
rather tiresome.

Interestingly, IDLE on my Windows box can't handle the bolded
characters very well...

>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>> print(s)
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print(s)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
in position 0: Non-BMP character not supported in Tk

I think this is most likely a case of "yeah, Windows XP just sucks".
But I have no reason or inclination to get myself a newer Windows to
find out if it's any different.

ChrisA