psss...I want to move from Perl to Python

Sun Jan 31 18:48:35 EST 2016

On Mon, Feb 1, 2016 at 9:34 AM, Fillmore <fillmore_remove at hotmail.com> wrote:
> On 01/30/2016 05:26 AM, wxjmfauth at gmail.com wrote:
>
>>> Python 2 vs python 3 is anything but "solved".
>>
>>
>>
>> Python 3.5.1 is still suffering from the same buggy
>> behaviour as in Python 3.0 .
>
>
>
> Can you elaborate?

This is jmf. His posts are suppressed from the mailing list, because
the only thing he ever says is that Python 3's "Unicode by default"
behaviour is fundamentally and mathematically wrong, on the basis of
microbenchmarks showing a performance regression compared to his
beloved - and buggy - narrow build of Python 2.7. (I'm not certain,
but I think the regression might even have been fixed now. Or maybe he
has other regressions to moan about.)

Here's a facts-only summary of Unicode handling in several different
CPython [0] builds.

* Python 2.7 comes in two flavours, selected at compile time. A "Wide"
build is the default on Unix-like platforms, and it uses 32-bit
Unicode characters. In other words, the string b"abc" takes up three
bytes, but the string u"abc" takes up twelve. [1] These builds are
perfectly consistent; a Unicode character *always* takes exactly 4
bytes, and indexing and subscripting are perfectly correct.

* A "Narrow" build of Python 2.7 (the default on Windows) uses 16-bit
Unicode characters. The string b"abc" still takes up three bytes, but
u"abc" takes only six - however, the same string with three astral
characters would take up twelve bytes. These builds are thus
inconsistent, but potentially more efficient - a thousand BMP
characters followed by a single SMP character would take up only 2004
bytes, rather than 4004 as a wide build would use.

* Starting with Python 3.0, a default quoted string is a Unicode
string. That doesn't change anything about these considerations, but
it does mean that "abc" suddenly takes up a lot more room than it used
to (because it's now equivalent to u"abc" rather than b"abc").

* Python 3.3 introduced a new "Flexible String Representation", which
you can read about in detail in PEP 393. Strings are now stored as
compactly as possible; u"Hello!" (all ASCII) takes up six bytes,
u"¡Hola!" (Latin-1) also takes up six bytes, u"Привет" (Basic
Multilingual Plane) takes up twelve, and u"Hi! 😀😁" (or u"Hi!
\U0001f600\U0001f601" if your mailer doesn't have those characters)
takes up twenty-four. Each string has a length of 6, as given by
len(x), but takes up differing amounts of space according to actual
needs.

The issue jmf has is with the way the FSR has to "widen" a string. If
you take a megabyte of all-ASCII text (stored one byte per character)
and append one astral character to it, the resulting string has to be
stored four bytes per character, even for the ASCII ones. This is to
make sure that indexing and slicing work correctly and efficiently,
but it does come at a cost - it takes time to copy all those
characters into the new wider string. On microbenchmarks doing exactly
this, it's clear that Python 3 is paying a price. But has it truly
suffered?

rosuav at sikorsky:~$ python -m timeit -s "s=u'a'*1048576" "len(s+u'\U0001f600')"
10000 loops, best of 3: 197 usec per loop
rosuav at sikorsky:~$ python3 -m timeit -s "s=u'a'*1048576" "len(s+u'\U0001f600')"
10000 loops, best of 3: 148 usec per loop
rosuav at sikorsky:~$ python -m timeit -s "s=u'a'*1048576" "len(s+u'b')"
10000 loops, best of 3: 187 usec per loop
rosuav at sikorsky:~$ python3 -m timeit -s "s=u'a'*1048576" "len(s+u'b')"
10000 loops, best of 3: 31.6 usec per loop
rosuav at sikorsky:~$ python -c 'import sys; print(sys.version)'
2.7.11 (default, Jan 11 2016, 21:04:40)
[GCC 5.3.1 20160101]
rosuav at sikorsky:~$ python3 -c 'import sys; print(sys.version)'
3.6.0a0 (default:5452e4b5c007, Feb  1 2016, 07:28:50)
[GCC 5.3.1 20160121]

The other consideration is that, *on Windows only*, this operation
takes more memory under 3.6 than under 2.7, because 2.7 will keep
storing the 'a' in 16 bits and then just slap a two-code-unit smiley
to the end; but on the flip side, 3.6 has been storing that all-ASCII
string in *8* bits per character. Most of your programs will be full
of ASCII strings - remember, all your variable names are string keys
into some dictionary [2], and every time you call up a built-in
function or standard library module, you'll be using an ASCII-only
name to reference it. Halving their storage space makes a significant
difference; and doubling the size of a very few strings in a very few
programs is worth the correctness we gain by not having to worry about
string index bugs.

So in summary: Take no notice of jmf; he's a crank.

ChrisA

[0] Other Python implementations may be very different, but it's
CPython that most people are looking at.
[1] If you use sys.getsizeof() on these strings, you'll find that they
actually take up a lot more space than I'm talking about. That's
because there's overheads on string objects, which dominate tiny
strings. But for large strings, where the performance difference
actually matters, the storage space of the characters themselves
dominates the overhead.
[2] Local names in functions might get compiled out and replaced with
numeric slot indices. But module-level names, names of built-ins,
attribute names, etc, are all stored in the code as actual strings.