[Tutor] Clarification questions about how Python uses references.

boB Stepp robertvstepp at gmail.com
Thu Jun 24 13:36:33 EDT 2021


On Thu, Jun 24, 2021 at 3:10 AM Cameron Simpson <cs at cskk.id.au> wrote:
>
> On 23Jun2021 23:13, boB Stepp <robertvstepp at gmail.com> wrote:
> >Hashes/hashing have not made it into my education yet.  I am hoping
> >that this intro CSc book I am reading will eventually address this.
> >Plus it seems that there are other contexts/meanings as in things like
> >checksums using MD5 hashing or some such, et al.
>
> They're related terms.

> A hash function takes a source value (eg a string such as you might use
> to index a dict) or, really, anything else, and computes a value,
> normally an integer. It is always the same for a given source value.

Then I suppose "MD5" and similar terms refer to the algorithm being
employed to come up with this mapping of values.

> The usual objective is that the various source values are evenly
> distributed over some range.
>
> In the case of a "hash table", this is what gives dicts and sets O(1)
> lookup times. When you store things in a hash table, the table has an
> array of "slots" where things get stored according to their key.
> You compute a hash value which dicates the slot a particular item is
> stored in. Then to see if an item is present, you only need to to look
> at the items stores in that particular slot, not at every item in the
> table as a whole.
>
> For maximum speed the number of slots is around the number of items
> stored, so that the number of items in a slot is around 1. Obviously
> that depends a bit on the source values, and this is why hash functions
> are chosen to produce evenly distributes values - to get a fairly even
> spread of slot usage regardless of the source values.

So if it should happen that more than one value can get assigned to a
slot, this is what "hash collision" is referring to?  I vaguely recall
reading concerns over potential hashing collision issues.  I don't
recall what hashing algorithm that reading was talking about, but it
seems it was a potential serious security concern.

> Imagine you're storing strings. You might compute a hash function by
> summing the ordinals of the characters in the string. Then take that
> modulo the number of slots in the hash table. Then your hash function
> computes an index into the table.
>
> This function is actually not a great one, because string values are
> often short and often have very skewed character frequencies. But it
> illustrates the idea: you're computing a number from the source value
> (the string) - it is always the same for a given source value.
>
> Suppose "abcdef" results in 6. The if you stored something keyed by
> "abcdef" it would be kept in slot 6, and to find thing thing keyed by
> "abcdef" you only need to look in slot 6.

This makes a whole lot more sense now.  Thanks for taking the time for
this brief intro!

boB Stepp


More information about the Tutor mailing list