Speed revisited

Sun Jan 9 13:53:04 EST 2005

On 8 Jan 2005 18:25:56 -0800, "John Machin" <sjmachin at lexicon.net>
wrote:

>> Both versions use local variables, etc. Both have their
>> lists initially sorted. Both essentially use a loop with
>> conditional for comparison,
>> then process the record in the
>> same way.
>
>"process the record in the same way"??? That's an interesting use of
>"same".

Meaning when the records with the same 'english' key is found, just
write the comparison result to the result file.

>> The overhead of second version is that it also
>> uses cmp() function and two additional integer
>> variables - that should not slow the program _so much_.

>> I have measured the run of snippet 2 with time checkpoints
>> written to a log (write time delta to log every 200 records),
>> even made a graph of time deltas in spreadsheet and in fact
>> snippet 2 seems after initial slowdown looks exactly linear,
>> like  that:
>>
>> ^ (time)
>> |
>> |  /-----------
>> | /
>> |/
>> ---------------> (# of records written)
>>
>> So yes, it would scale to big files.
>
>On your empirical evidence, as presented. However, do read on ...

>>However, why is it so
>> frigging slow?!

>Mainly, because you are (unnecessarily) deleting the first item of a
>list. This requires copying the remaining items. It is O(N), not O(1).
>You are doing this O(N) times, so the overall result is O(N**2). Your
>graph has no obvious explanation; after how many cycles does the speed
>become constant?

After some 2000 records. I was "tracking" it with printouts of
variables and initially the main loop was taking advantage of 
records being sorted, with the first condition (with keys
being equal) quickly writing the records into a file, and the 
further into the dataset, the greater the distance was between 
the beginning of the list where the record was sought to the
appropriate record, i.e. either o or n were becoming steadily 
higher. This could  explain linearity further down the path: the 
list is becoming gradually shorter, so the deletion times 
were decreasing linearly, but in the meantime the search time 
was linearly increasing. 

>Secondly, you are calling cmp() up to THREE times when once is enough.
>Didn't it occur to you that your last elif needed an else to finish it
>off, and the only possible action for the else suite was "assert
>False"?

Sure, but I was trying to make it shorter and absolutely clear
what I mean in this place (when people see the comparison in every
place, they see immediately what it was, they don't have to
recall the variable). Obviously I'd optimise it in practice. It
was supposed to be "proof of concept" rather than any working
code.

BTW, this particular example didn't work out, but I still found 
Python to be the most convenient language for prototyping, I 
wrote smth like five versions of this thing, just to learn
dictionaries, lists, sets, etc. In no other language I'd do
it so quickly.

>It would appear after reading your "snippet 2" a couple of times that
>you are trying to implement the old 3-tape update method.

Well, ahem, I admit you got me curious just how it would work
with tapes (never used them), so I was sort of  trying to simulate
that - it's just a bit weird undertaking, I did it rather to explore
the issue and try to learn smth rather than to get working code. 

Deleting the first item from a list was to be a convenient 
equivalent of forwarding the  reading head on the tape one 
record ahead, because I assumed that any deletion from the 
list was to take more or less the same time, just like reading 
heads on tapes were probably reading the records with similar 
speeds regardless of what length of tape was already read.

>It would also appear that you are assuming/hoping that there are never
>more than one instance of a phrase in either list.

Sure. Because using advice of Skip Montanaro I initially used sets 
to eliminate duplicates. It's just not shown for brevity. If the
keys are guaranteed to be unique, it makes it easier to think
about the algorithm. 

>You need something a little more like the following.

>Note that in your snippet2 it was not very obvious what you want to do
>in the case where a phrase is in "new" but not in "old", and vice versa
>-- under one circumstance (you haven't met "end of file") you do
>nothing but in the the other circumstance you do something but seem to
>have not only a polarity problem but also a copy-paste-edit problem.

What exactly is a "polarity problem"?

I concede that the code is rather contrived, but I didn't bother
to do smth like writing classes or functions that would simulate
a tape reader, so the result is rather ugly. I only posted it 
because I could not explain why it were so slow. 

> In
>the following code I have abstracted the real requirements as
>handle_XXX_unmatched()

Thanks, I'll have to try that out.

--
It's a man's life in a Python Programming Association.