Complexity of standard Python data structures (Re: Speed of string += string)

Mon Apr 14 14:37:19 EDT 2003

On Mon, 14 Apr 2003 14:15:12 GMT, Alex Martelli <aleax at aleax.it> wrote:
>Maybe I should clarify that I'm not _attacking_ your statement (nor
>you for making it, Marcus)

Good, it is far too easy to misinterpret each other. Apologies
if I have sounded a bit harsh.

>While performing optimizations may be premature, _understanding_
>what's going on can hardly be

Exactly.

>However, I still don't understand why this would be a problem
>specific to Python rather than any other language -- including
>C++ for those operations that are not part of that subset of C++'s
>standard library for which O()-behavior is specified in the standard.

Actually, I don't think this is specific to Python---I frequently
ask myself about the complexity of the code I'm programming, even in
C/C++. You asked this several times in your answer, so I'll just
answer here once: the memory and speed complexity of common operations
is in my opinion interesting in any language, and in the long
run, an absolute necessity to understand.

Also consider that Python is a relatively new language, at least it is
a language which evolves very fast. Thus, various issues are probably
unfinished, under development and it is uncertain how they act. Common
sense suggests to tread carefully, and learn what the language
actually does. Plus, I'm personally interested in computer languages.

As an example of "unfinished" interfaces: there is no list.clear()
function, even though dict.clear() exists. (Actually I'll send
python-dev@ a mail about this soon, when I get the time to dive into
the python source code).

Don't mistake any of this for python-bashing, it's _great_ that the
language developers are keen on discussing and implementing new things,
and as far as I see, there are a lot of very bright people there.
Perhaps I just was more interested in "how does it do it?" than
"what does it do?" :-)

>What's the O()-behavior of C+'s dynamic_cast<>, for example, and
>on what parameters does that performance behavior depend?  I do

Ah, but here's an interesting part: it probably doesn't matter, you're
probably going to use it only a few times. You do use data structures
and various algorithms much more, and that's why they matter. If I were
to continuously use dynamic_cast<> in lot's of loops, I'd probably check
its behaviour closer.

In a way, it's the same thing as said by higher-level language people:
optimize where it matters. I just draw the line of optimization to
include data structures and algorithms, and leave dynamic_cast<> out.

>not know, and I don't think I *CAN* know -- every implementation is
>free to do its best (or its worst...) to optimize dynamic_cast's
>performance, time-vs-space tradeoffs, whatever.

True, but sometimes there are only so few sane choices. People have
come to expect a certain behaviour from e.g. C/C++ compilers. I
_expect_ good performance from the C++ and the STL algorithms and
their corresponding implementation, since that's what it was designed
for. When seeing e.g. python's list.__getitem__() and list.insert(), I
got confused: either Python is doing something very very odd to get
both methods fast, or then a list is really just a vector and
list.insert() is O(N) --- something which I would've expected a
"warning" about.

But naturally, sometimes expectations bite you, no matter what the
language: http://dforce.sh.cvut.cz/~seli/en/linking2/

>So why did you perceive Python as being different in this respect?
>C doesn't specify the O()-behavior of, e.g., qsort (indeed it's
>often pretty horrible...!) -- C++ doesn't specify performance for

Exactly, but how did you know about qsort? Legacy. Previous
experience. Things that I, quite naturally, lacked regarding Python.

>such things as RTTI, exception handling, or even for that matter
>for the dispatch of virtual methods

And for the record, C++ is by far not clear to me in all respects:
e.g. I've recently become interested in exception handling in C++, and
wonder how fast the actual stack unwinding is, and how much "extra"
code it creates. (Not done any measurements) They are not however the
common path of execution, and so they are less interesting than
e.g. virtual method dispatching.

>But -- DO Python beginners typically start learning the language with
>a *REFERENCE* manual?  Probably not -- yet you explain that the time
>when you were left wondering was just when you STARTED learning.  So,
>perhaps, the document to retouch with a few small mentions of typical
>performance behavior (perhaps in a footnote or two) could be the
>tutorial, rather than either of the reference manuals (language and
>library).  Clearly, the tutorial is being descriptive (of one specific
>implementation it comes with) and not prescriptive (it does not give
>any constraints to what future implementations might choose), which
>may make it more acceptable to give such indications there.
>
>So what do you think -- is it the tutorial that such changes should
>be targeting...?

Um, no, perhaps people new to programming don't care about performance
at all and would be confused.

On the other hand, I've always wished for a document "language X for
people know language Y / programming in general / think they know". We
could skip all that "let's hold hands while I take you though the
mysteries of a dictionary" nonsense, and just concentrate on the issue
at hand: how does our knowledge from language X map to this new
language? What pitfalls do we have in the language at the moment
(e.g. python default arguments are in my humble opinion still
counter-intuitive) ?

>In "Python in a Nutshell", I took a more traditional approach --
>*assuming* that the target reader would want to understand the
>*functionality* of operations first, the *performance
>characteristics* of opeations only quite a bit later.  So, I
>have descriptions of fundamental functionality quite early, in
>chapter 4, with almost no mention of performance issues (except
>a very few ones, such as the fact that pow(a,b,c) returns the
>same result as (a**b)%c "but faster" -- or the fact that
>passing a comparison function to the sort method of a list

All this does sound very good for learning to program.

>"slows down the sort").  Only much later (in Chapter 17,
>"Testing, Debugging, and Optimizing") do I cover the big-O
>behavior of operations on lists, strings and dicts -- including
>the key facts, from which almost all others follow, that lists
>are internally implemented with vectors, dicts with hash tables.

Hmm. "Only much later...the key facts...from which almost all other
[big-O facts] follow". Is there a specific reason to hide the
key facts until much later? Why would they be confusing?

If somebody would say "python's list is a really a vector/array",
that's more or less the only documentation one will ever need,
help(list) tells the rest. The key characteristics are obvious from
that one single sentence, right?

>Do you think that this traditional approach (teach functionality
>first, performance issues only later) is fundamentally flawed?

No, people usually want to get things done, it doesn't matter how
fast or elegant it is. *However*, I don't consider *anybody*
to actually have learnt programming if they don't know that
- programming is about meeting requirements
- those requirements usually involve correctness to a certain
degree, _but_ also memory usage, speed, user interface functionality,
error handling, modularity, documentation, extendability.

So it's not just a functionality-performance relationship.

>That's basically what I see the "first thing I wondered about"
>comment as potentially implying, and what I'm trying to get a
>deeper understanding of, particularly when you seem to single
>out Python wrt other languages in this respect.  Of course, in

This was not my intention: I would have the same concern for
any language :-)

Regarding the implication, I really couldn't say.

>the Nutshell I was constrained by other considerations too (it's
>primarily a *reference* work, only secondarily intended to be
>read from the start onwards -- having decided to devote a part
>of a chapter to optimization, it seemed to me that optimization
>related information was best grouped there, rather than spread
>around -- and _repeating_ information would go against the grain
>of what's meant above all to be a *concise* reference...).  But,
>that book may well not be the last time I have occasion to write
>about Python (or anyway teach it...) -- so, I think I need to
>understand these issues more deeply.

Right, that can't be easy. As you said, repeating is almost out of the
question, but on the other hand, sometimes you have to see problems
from different aspects, and you'd still like to repeat information...

>in the cache.  (Incidentally, cache behavior and other such issues of
>memory hierarchies are by far the most complicated performance aspects
>to understand, analyze or predict -- and yet far too often they can at
>the same time be _dominating_ the performance behavior of your program...
>and no language spec is going to be very helpful there;-).

Indeed.

"Almost all programming can be viewed as an exercise in caching"
  --Terje Mathisen

Hopefully this hasn't turned into incoherent rambling, cheers,
Marcus

-- 
Marcus Alanen
maalanen at abo.fi