Determining number of dict key collisions in a dictionary

Tue Dec 2 23:26:54 EST 2008

Roger,

Apologies for responding directly to your email vs. the list (just
realized this now).

Thank you very much for your detailed answers and the reminder about
premature optimization.

You have answered all my questions.

Regards,

Malcolm

----- Original message -----
From: "Roger Binns" <rogerb at rogerbinns.com>
To: python at bdurham.com
Date: Tue, 02 Dec 2008 19:29:05 -0800
Subject: Re: Determining number of dict key collisions in a dictionary

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

python at bdurham.com wrote:
> Any ideas on why the Windows 64-bit version of Python has 32-bit vs.
> 64-bit longs? (Is this the case for both the 2.6 and 3.0 64-bit versions
> of Python for Windows?)

It is part of the ABI for the platform.

http://www.unix.org/version2/whatsnew/lp64_wp.html
http://en.wikipedia.org/wiki/64-bit#64-bit_data_models
http://blogs.msdn.com/oldnewthing/archive/2005/01/31/363790.aspx

> If this is the case, then it sounds like Linux/Unix are much better
> Python hosts than Windows when it comes to developing 64-bit Python
> applications - especially when it comes to working with large
> dictionaries (you point about hash collisions not withstanding)?

Python uses a type Py_ssize_t which is typedef'ed to 64 bits on a 64 bit
platform and 32 bits on a 32 bit platform for things like sizes (eg
length of strings, number of items in a container) so there won't be any
programmatic or correctness issues.

You never said what your idea of large was in the original posting.  For
example you really only need to start worrying about hash collisions on
LP64 if you have more than 4 billion items.  Even then things will still
work.

For people who have really large data sets, it is usually beneficial to
write their own custom code often resorting to assembly.  This is
because the data set is so much larger than caches and saving even a few
instructions per item pays back due to how many there are.

My advice would be to not worry about collisions, premature optimization
etc initially.  Get the code working and get it working right.  Then do
profiling and improve algorithms.  You will have test code from earlier
so you can verify that the optimizations are correctly.  Finally you may
have to rewrite core parts in assembly. but again the tests will be
there to help.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkk1/P4ACgkQmOOfHg372QQdkACg3n0CrXCbctfyKOw3DQ0uTvvt
J60AoMyikF/oJUXVrV9XOkQe6eprzPSh
=9CYk
-----END PGP SIGNATURE-----