The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)

Wed Mar 23 19:55:40 EDT 2016

On Thu, 24 Mar 2016 03:24 am, Random832 wrote:

> On Wed, Mar 23, 2016, at 12:08, Mark Lawrence wrote:
>> > And doing it 'Pythonically' can lead to suggestions such as the
>> > following the other day:
>> >
>> >   c, psource = psource[0], psource[1:]
>> >
>> > (where psource is a very long string), which even I could tell, from
>> > knowing what goes on behind the scenes, wasn't going to work well
>> > (duplicating the rest of the string roughly every other character).
>> >
>> 
>> It would work perfectly. How would it duplicate the rest of the string
>> roughly every other character?
> 
> Er, I think he's suggesting that this would be in an inner loop
> (something like while psource: c, psource = psource[0], psource[1:]).
> What I'm not sure of is why he thinks this is pythonic.

Because somebody here (Dennis) criticised his earlier code for passing "your
entire source string along with the character from it to the function", and
suggested splitting the string into its head and tail instead. Dennis' code
started:

        while psource:
                c, psource = psource[0], psource[1:]
                lxsymbol = disptable[min(ord(c), 256)](c, psource)

But one positive: this conclusively proves that "Pythonic" is in the eye of
the beholder. Dennis thinks that c, psource = psource[0], psource[1:] is
reasonable Python code; you think it's not ("I'm not sure why [Bart] thinks
this is pythonic").

Pythonic or not, Bart is correct to be concerned about the performance,
since this ends up copying the string over and over again. If we start with
the string "aardvark", say, it gets split:

"a", "ardvark"
"a", "rdvark"
"r", "dvark"
"d", "vark"
"v", "ark"
"a", "rk"
"r", "k"
"k", ""

each of which involves creating two string objects. Ignoring the effects of
string interning (caching), an 8-character string gets split up into:

8 x 1 character strings;
1 x 7 character string;
1 x 6 character string;
1 x 5 character string;
1 x 4 character string;
1 x 3 character string;
1 x 2 character string;
1 x 1 character string;
1 x 0 character string;

thus creating 16 string objects in total, and copying 8+7+6+5+4+3+2+1+0
characters, or 36 characters in total.

For an N character string, we have:

N x 1 character strings;
1 x N-1 character string;
1 x N-2 character string;
1 x N-3 character string;
...
1 x 2 character string;
1 x 1 character string;
1 x 0 character string;

giving a total of sum(range(N+1)) which is N*(N+1)/2. In big-oh notation,
this is quadratic behaviour: O(N**2), and Bart is right to be concerned.

This same sort of thing is precisely why building up a string by repeated
string concatenation can be horribly slow. This is the same, only in
reverse: we're demolishing the string, one character at a time.

Now, there are practical reasons why this may not be *quite* as bad as Bart
fears:

- caching of small strings may mean that we're more like O( (N-1)**2 ),
which is (barely) better;

- the coefficient is in our favour (1/2);

- string copying may be quite efficient (although not as efficient as in C,
where they can be implemented by just a memcopy, more or less);

- the strings are getting smaller all the time, which makes memory
management easier, when compared to having to allocate growing strings;

- although against that, perhaps this ends up fragmenting memory even more?

- it *may* turn out that the cost of the rest of the work done dwarfs the
cost of copying the string in the first place

- but then again it might not.

-- 
Steven