how to store great array?

Quinn Dunkan quinn at regurgitate.ugcs.caltech.edu
Wed Jan 31 17:21:45 EST 2001


On Tue, 30 Jan 2001 10:21:05 +0100, Thomas Wouters <thomas at xs4all.net> wrote:
>On Mon, Jan 29, 2001 at 08:13:58PM +0000, Quinn Dunkan wrote:
>
>> But python's lists are arrays.  Don't ask me why this confusing
>> terminology was chosen, probably because some Dutch people doing usability
>> tests thought 'list' sounded nicer for ABC a million years ago.
>
>I'm not sure where list comes from, but I'm sure glad they're called lists
>and not arrays! To C programmers, arrays mean 'lists of a single type'. To
>people who never heard of arrays (like newbie programmers), arrays mean
>'wtf?'.

That's true... the term 'list' is more likely to immediately be obvious to
someone completely new to programming.  Of course, 'array' should also be
intuitive for anyone who knows english resonably well (arrays of satellites,
bottles arrayed in a display case, etc.).  Whether or not that is a good thing
is debatable since computer list/arrays are different in many ways from
real-world list/arrays.  But I conceed that newbies might prefer 'list'.

But for anyone who knows any other language, or a newbie who then goes on to
learn another language, it causes confusion.  Or when talking with someone who
is familiar with another language (I had this problem before I knew python: I
talked with a pythoneer about 'lists' and we both were confused when thought
in cons cells and recursion and he thought in subscipting and iteration (and
we both, naturally, knew the other's approach was "wrong").

>        Only to people who judge a datatype by name instead of documentation
>*and* are used to a programming language where a list is defined as a linked
>list, might 'list' seem 'wrong'.

I don't know LambdaMOO, but in every single language I do know, a list is a
series of cons cells and an array is a glob of contiguous memory with O(1)
access.  It's true that languages use different words to mean different
things, and any attempt to come up with a "correct" defn for a term is likely
to be doomed.  But in a situation where there is a widely held convention, you
should only break it if the benefits outweigh the cost.  For example, if you
have a function that applies a function to a sequence and returns the
resulting sequence should probably be called 'map', or possibly 'collect' if
you come from smalltalk world.  Deciding 'transform' is a better name and
calling it that is probably not worth it unless you have a pretty solid
reason.  I don't think a small difference in newbie-friendly vocabulary is
good enough reason for a fundamental data structure.

Anyway, you know what a python list means, and I know what list means, and
everyone on this group knows what list means, and changing that now would be
even worse (it's worse to violate your own convention than someone elses), so
my whole arguement is irrelevant anyway :)  (as Tim would say, arguing about
imaginary newbies).

But when someone new to python is confused about whether x[2] is as fast as
x[5000], I think that perhaps the terminology has confused them.  And whenever
I want to describe the characteristics of lists vs. arrays to someone who only
really knows python, I have to be extra careful they don't get confused.

>> Dict access is also constant, as is usual for a hash table (internally, it
>> gets turned into simple array access).  But if your indices will be inteeger,
>> then a plain array aka list will have less overhead.
>
>Only if you have (near) sequential keys. If you have large gaps in your keys
>(say 60% or more 'unused' keys, but a dict expert like Tim can probably
>correct me :), you're better off with a dictionary -- you won't have to
>allocate quite as much room for it.

Yah, yah I made an assumption :)

>> Python experts could tell you the exact memory usage.  It will be something
>> like bytes_in_string+null + pointer + type_field + maybe_a_cached_hash +
>> interned_flag.  So with ~4 bytes overhead for each 5 byte string, you're
>> looking at 9*2_000_000 / 1024 / 1024 = 17MB for 2 million elements.
>
>Actually, a string object is defined like this (from Include/stringobject.h):

Argh, and I assumed a pointer is one byte :)  You see why I stick to garbage
collected languages :)



More information about the Python-list mailing list