[Tutor] A couple of somewhat esoteric questions

Steven D'Aprano steve at pearwood.info
Wed Oct 22 13:40:16 CEST 2014


On Tue, Oct 21, 2014 at 04:54:33PM -0700, Clayton Kirkwood wrote:

> As I've contemplated the usage of dictionaries, I face the question of
> efficiency. Going back before most of you were probably born;<)) if I
> remember correctly dictionaries(assoc. arrays), having hashes, are efficient
> for storing sparse arrays with the added benefit of hiding the traversal of
> the dictionary looking for the proper key, and being much less efficient if
> looking for the value and trying to retrieve its key. But it also depends on
> the hash key and algorithm and how many spaces are "available" for filling.
> If a hash is created for a small array, which is known ahead of time with
> some possible hints, the array is efficient but has the distinct difficulty
> of use when more and more pairs are created, because either the new pair
> matches an already used hash point and a trail must be descended off of that
> particular hash point or a mechanism must be used to determine how the new
> pair should be stored elsewhere in the array like at the end and matching
> becomes time consuming. 

This is standard for hash tables. Python, if I recall correctly, uses 
linear addressing rather than chaining.

Python's hash function can return approximately 4294967296 different 
values (2**32), which means that for random or arbitrary keys, you won't 
expect more than a tiny handful of collisions. Of course a hostile 
adversary might, sometimes, be able to feed you keys that all hash the 
same, but in practice that's very unlikely to happen. And even if it 
does happen, the worst that will occur is that dict look-ups will 
degenerate to being proportional to the number of items, which isn't too 
bad.

> And/or, a new hash key and perhaps a different
> algorithm must be used for creating the larger array. 

Python always uses the same hash algorithm. It doesn't try to change 
algorithms as the table gets bigger, it just scales the hash value to 
the size of the table. As the table gets full, the table is doubled in 
size to make room.

(To be pedantic, I'm talking here about the standard CPython reference 
implementation. There are other implementations, such as Jython and 
IronPython and PyPy, which may do other things.)


> Also, since the array
> allows for multiple keys pointing to possibly identical values, the array is
> not necessarily 1 to 1 (could be 1 to multiple although we use replacement
> of the value normally) and definitely cannot be onto: there is no rule that
> says the mapping from value back to key is singular and unique.

Yes, but I don't see your point. Hash tables are designed to be 
one-to-many.

If you want to understand the *precise* details of how Python dicts 
work, you will need to read the source code written in C. But as an over 
simplification:

- when you create an empty dict, it is created with a certain number 
  of free slots in a table;
- as you add items to the dict, the dict knows how many free slots
  remain;
- when the table is half full, the dict resizes; very small dicts
  are increased to four times bigger, while larger dicts are 
  doubled in size;
- that means that *on average* adding new items to a dict takes
  the same amount of time (amortized over the lifetime of the
  dict).


> I've not had much use of dictionaries in the past so maybe I am missing
> something. As I contemplated dictionary use, it seems changing the order of
> entries is best left to double single arrays - which can be modified
> together when changing order. Dicts don't seem to support moving pairs
> around and in fact would be broken if changes were attempted.

I don't understand what you are trying to say here. Hash tables are 
unordered, and "changing the order of entries" makes no sense for them. 
Can you give a concrete example of what you're trying to do?


> Kind of like
> using a real dictionary, you can have a word (key) have multiple meanings
> (values),

In Python, the way to do that is by associating a list of meanings with 
the one key:

d = {'key': ['first', 'second', 'third'],
     'another key': ['alpha', 'beta']}


> and you can have multiple words (keys) having the same meaning
> (value). Again, with the difference from a real dictionary, our dicts can't
> have keys pointing to different values. 

They certainly can.


> But there are definitely uses for dicts.

One or two, one or two. Hundred. Thousand.

Seriously, dicts are probably the most commonly used data structure in 
Python, by far. They are used for modules, classes, instances, global 
variables are stored inside a dict, built-in functions are stored inside 
a dict, modules are cached inside a dict. They are efficient and fast 
and highly tuned, and there is almost never any need to try to re-invent 
the wheel.


-- 
Steven


More information about the Tutor mailing list