hashing strings to integers

Chris Angelico rosuav at gmail.com
Tue May 27 15:16:24 EDT 2014


On Wed, May 28, 2014 at 3:02 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> But I know that Python is a high-level language with
> lots of high-level data structures like dicts which trade-off time and
> memory for programmer convenience, and that I'd want to see some real
> benchmarks proving that my application was too slow before giving up that
> convenience with a complicated strategy like this.

And they trade off that time and memory on the basis of X years of
development expertise. A while ago (about ten years or so, now... wow,
that's quite a while) I wrote a C++ program that needed an
ever-growing array; for simplicity, I went with a very basic system of
doubling the size every time, from a base of something like 1024 or
8192. (Note that it was storing and moving around only pointers, so
it's comparable to Python's list.) That means it has an average 25%
slack space at all times, more until its first allocation, and every
now and then it has a huge job of copying a pile of pointers into a
new array. (Imagine it's now at 16777216 and it needs to add the
16,777,217th string to the array. Bye-bye CPU caches.) These
boundaries became *user-visible pauses*, fortunately at predictable
points, but on a not-terrible computer it could cause a >1s pause just
copying heaps of pointers. How do you think a Python list will
perform, under the same workload (periodic appending of single strings
or small groups of strings, very frequent retrieval based on index -
probably about a 20:1 read:write ratio)? Not only would it be far more
convenient, it's probably going to outperform my hand-rolled code -
unless I'm so brilliant (or lucky) that I can stumble to something as
good as can be achieved with years of dedicated development. Yeah, I
don't think so.

Same goes for hashing algorithms. I can at least boast that I've never
written one of those... although I have several times written a binary
tree of one form or another, in order to provide comparable features
to a dict. And once again, now that I know how convenient and
performant high level languages can be (which may or may not have been
true 15 years ago, but I certainly didn't know it), I don't rewrite
what can be done better by just tossing in a couple of curly braces
and saying "here, map this to that".

ChrisA



More information about the Python-list mailing list