Py 3.3, unicode / upper()

Ian Kelly ian.g.kelly at gmail.com
Wed Dec 19 13:27:38 EST 2012


On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <rosuav at gmail.com> wrote:
> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP. In
> 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> pairs* for non-BMP characters. This means that len() counts them
> twice, as does string indexing/slicing. That's a major bug, especially
> as your Python code will do different things on different platforms -
> most Linux builds of 3.2 are "wide" builds, storing characters in four
> bytes each.

>From what I've been able to discern, his actual complaint about PEP
393 stems from misguided moral concerns.  With PEP-393, strings that
can be fully represented in Latin-1 can be stored in half the space
(ignoring fixed overhead) compared to strings containing at least one
non-Latin-1 character.  jmf thinks this optimization is unfair to
non-English users and immoral; he wants Latin-1 strings to be treated
exactly like non-Latin-1 strings (I don't think he actually cares
about non-BMP strings at all; if narrow-build Unicode is good enough
for him, then it must be good enough for everybody).  Unfortunately
for him, the Latin-1 optimization is rather trivial in the wider
context of PEP-393, and simply removing that part alone clearly
wouldn't be doing anybody any favors.  So for him to get what he
wants, the entire PEP has to go.

It's rather like trying to solve the problem of wealth disparity by
forcing everyone to dump their excess wealth into the ocean.



More information about the Python-list mailing list