From chris at inetd.com.au Thu May 11 08:56:34 2006 From: chris at inetd.com.au (Chris Foote) Date: Thu, 11 May 2006 16:26:34 +0930 (CST) Subject: [sapug] Large dictionaries Message-ID: Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) ------ ---------- ------- 1M 8.8s 22.2s 2M 24.0s 43.7s 5M 115.3s 105.4s Does anyone know of a fast hash module which is more optimal for large datasets ? p.s. Disk-based DBs are out of the question because most key lookups will result in a miss, and lookup time is critical for this application. Cheers, -- Chris Foote Inetd Pty Ltd T/A HostExpress Web: http://www.hostexpress.com.au Blog: http://www.hostexpress.com.au/drupal/chris Phone: (08) 8410 4566 From spam at afoyi.com Thu May 11 10:29:53 2006 From: spam at afoyi.com (Darryl Ross) Date: Thu, 11 May 2006 17:59:53 +0930 Subject: [sapug] Large dictionaries In-Reply-To: References: Message-ID: <4462F601.3020409@afoyi.com> Chris Foote wrote: > Does anyone know of a fast hash module which is more optimal for > large datasets ? > > p.s. Disk-based DBs are out of the question because most key lookups > will result in a miss, and lookup time is critical for this application. In memory sqlite database (use ':memory:' as the filename) with appropriate indexes? Make sure you put any inserts into a transaction. Using a disk based database, inserting 10,000 rows went from being a multi-minute operation to a split-second operation, just by putting them inside a BEGIN/COMMIT. Cheers D -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 187 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/sapug/attachments/20060511/da0f91ef/attachment.pgp From michael.cohen at netspeed.com.au Thu May 11 10:15:57 2006 From: michael.cohen at netspeed.com.au (Michael Cohen) Date: Thu, 11 May 2006 18:15:57 +1000 Subject: [sapug] Large dictionaries In-Reply-To: References: Message-ID: <20060511081556.GC2241@OpenWrt> Hi Chris, You probably need to balance the hash table. Normally a dict works by hashing the key into an array of a certain size. The array is indexed by the hash value and a linked list of values is attached to each slot in the array. If you load the table too much, there will be many hash collisions, which will cause the linked lists to get very long. This will slow down access time in an exponential fasion. Apart from not using a tuple for a dict key (im not sure how efficient the algorithm of getting a hash value from a python tuple struct is). I would recommend to split the single dict into a dict of dicts. This will have the effect to spreading the load on the hash table: If your key is: key = (first, second) #instead of something like this: hash = {} hash[key] = value #use (of the top of my head not necessarily correct) hash = {} #insertion: try: tmp = hash[first] except KeyError: tmp = {} hash[first] = tmp tmp[second] = value #retrieval: value = hash[first][second] This might be faster because the number of elements stored in each hash table is smaller. Let us all know how you go. Michael. On Thu, May 11, 2006 at 04:26:34PM +0930, Chris Foote wrote: > Hi all. > > I have the need to store a large (10M) number of keys in a hash table, > based on a tuple of (long_integer, integer). The standard python > dictionary works well for small numbers of keys, but starts to perform > badly for me inserting roughly 5M keys: > > # keys dictionary metakit (both using psyco) > ------ ---------- ------- > 1M 8.8s 22.2s > 2M 24.0s 43.7s > 5M 115.3s 105.4s > > Does anyone know of a fast hash module which is more optimal for > large datasets ? > > p.s. Disk-based DBs are out of the question because most key lookups > will result in a miss, and lookup time is critical for this application. > > Cheers, > > -- > Chris Foote > Inetd Pty Ltd T/A HostExpress > Web: http://www.hostexpress.com.au > Blog: http://www.hostexpress.com.au/drupal/chris > Phone: (08) 8410 4566 > _______________________________________________ > sapug mailing list > sapug at python.org > http://mail.python.org/mailman/listinfo/sapug > From Daryl.Tester at iocane.com.au Thu May 11 11:39:28 2006 From: Daryl.Tester at iocane.com.au (Daryl Tester) Date: Thu, 11 May 2006 19:09:28 +0930 Subject: [sapug] Large dictionaries In-Reply-To: References: Message-ID: <44630650.6010905@iocane.com.au> Chris Foote wrote: > I have the need to store a large (10M) number of keys in a hash table, > based on a tuple of (long_integer, integer). The standard python > dictionary works well for small numbers of keys, but starts to perform > badly for me inserting roughly 5M keys: Is this a startup operation, or are you expecting to insert that many records that often? I was going to suggest using cPickle to load a precomputed dictionary, but a quick test shows the performance is probably worse. I haven't looked at the modern dictionary code, but seem to recall that 1.5's code would expand the internal hash table and perform rebalancing operations when the collision rate got too high (the rebalancing being the heavy cost). I thought there might be parameter to {}.__init__() to predefine the hash table size, but alas this seems to have put the kibosh on that: I find writing new types for Python to be pretty straightforward; you could always find the (pre-) hash library of your dreams and bolt that to a new type (or class). > p.s. Disk-based DBs are out of the question because most key lookups > will result in a miss, and lookup time is critical for this application. A python interface to cdb perhaps? Given enough memory you could probably cache the entire file in RAM. Cheers. -- Regards, Daryl Tester, IOCANE Pty. Ltd. From Daryl.Tester at iocane.com.au Thu May 11 11:48:03 2006 From: Daryl.Tester at iocane.com.au (Daryl Tester) Date: Thu, 11 May 2006 19:18:03 +0930 Subject: [sapug] Large dictionaries In-Reply-To: <4462F601.3020409@afoyi.com> References: <4462F601.3020409@afoyi.com> Message-ID: <44630853.2040501@iocane.com.au> Darryl Ross wrote: > Make sure you put any inserts into a transaction. Using a disk based > database, inserting 10,000 rows went from being a multi-minute operation > to a split-second operation, just by putting them inside a BEGIN/COMMIT. This applies to a lot of databases (I'm thinking of Postgres and Oracle in particular) - INSERTs outside of a transaction are wrapped in their own individual transactions, instead of around a block of 'em. I think there's an option to later Postgres' to switch it off on a per-session basis, at least from psql. -- Regards, Daryl Tester, IOCANE Pty. Ltd. From chris at inetd.com.au Thu May 11 15:15:28 2006 From: chris at inetd.com.au (Chris Foote) Date: Thu, 11 May 2006 22:45:28 +0930 (CST) Subject: [sapug] Large dictionaries In-Reply-To: <4462F601.3020409@afoyi.com> References: <4462F601.3020409@afoyi.com> Message-ID: On Thu, 11 May 2006, Darryl Ross wrote: >> p.s. Disk-based DBs are out of the question because most key lookups >> will result in a miss, and lookup time is critical for this application. > > In memory sqlite database (use ':memory:' as the filename) with > appropriate indexes? > > Make sure you put any inserts into a transaction. Using a disk based > database, inserting 10,000 rows went from being a multi-minute operation > to a split-second operation, just by putting them inside a BEGIN/COMMIT. I just ran a test - the overhead of parsing SQL is too high: dictionary metakit sqlite[1] ---------- ------- --------- 1M numbers 8.8s 22.2s 89.6s 2M numbers 24.0s 43.7s 190.0s 5M numbers 115.3s 105.4s N/A [1] pysqlite V1 & sqlite V3. No go :-( -- Chris Foote Inetd Pty Ltd T/A HostExpress Web: http://www.hostexpress.com.au Blog: http://www.hostexpress.com.au/drupal/chris Phone: (08) 8410 4566 From chris at inetd.com.au Thu May 11 15:36:51 2006 From: chris at inetd.com.au (Chris Foote) Date: Thu, 11 May 2006 23:06:51 +0930 (CST) Subject: [sapug] Large dictionaries In-Reply-To: <20060511081556.GC2241@OpenWrt> References: <20060511081556.GC2241@OpenWrt> Message-ID: On Thu, 11 May 2006, Michael Cohen wrote: > Hi Chris, > You probably need to balance the hash table. Normally a dict works by hashing > the key into an array of a certain size. The array is indexed by the hash > value and a linked list of values is attached to each slot in the array. > > If you load the table too much, there will be many hash collisions, which > will cause the linked lists to get very long. This will slow down access time > in an exponential fasion. > > Apart from not using a tuple for a dict key (im not sure how efficient the > algorithm of getting a hash value from a python tuple struct is). I would > recommend to split the single dict into a dict of dicts. This will have the > effect to spreading the load on the hash table: Good thinking, but alas, the overhead of dictionary creation for the second value was too much: dictionary metakit sqlite[1] hash-in-hash ---------- ------- --------- ------------ 1M numbers 8.8s 22.2s 89.6s 21.9s 2M numbers 24.0s 43.7s 190.0s 56.8s 5M numbers 115.3s 105.4s N/A > 185s[2] [1] pysqlite V1 & sqlite V3. [2] I had to kill the process because it chewed up all 1Gig of RAM :-( Thanks for the suggestion, but no go. -- Chris Foote Inetd Pty Ltd T/A HostExpress Web: http://www.hostexpress.com.au Blog: http://www.hostexpress.com.au/drupal/chris Phone: (08) 8410 4566 From chris at inetd.com.au Thu May 11 15:47:18 2006 From: chris at inetd.com.au (Chris Foote) Date: Thu, 11 May 2006 23:17:18 +0930 (CST) Subject: [sapug] Large dictionaries In-Reply-To: <44630650.6010905@iocane.com.au> References: <44630650.6010905@iocane.com.au> Message-ID: On Thu, 11 May 2006, Daryl Tester wrote: > Chris Foote wrote: > >> I have the need to store a large (10M) number of keys in a hash table, >> based on a tuple of (long_integer, integer). The standard python >> dictionary works well for small numbers of keys, but starts to perform >> badly for me inserting roughly 5M keys: > > Is this a startup operation, or are you expecting to insert that > many records that often? At startup, and on receiving a signal every few hours. > I was going to suggest using cPickle to > load a precomputed dictionary, but a quick test shows the performance > is probably worse. You'd run out of RAM pretty quickly parsing it as well :-) > I haven't looked at the modern dictionary code, but seem to recall > that 1.5's code would expand the internal hash table and perform > rebalancing operations when the collision rate got too high (the > rebalancing being the heavy cost). I thought there might be > parameter to {}.__init__() to predefine the hash table size, > but alas this seems to have put the kibosh on that: > > Ah, that shed's some light as to why there isn't an 'nelems' argument available :-( > I find writing new types for Python to be pretty straightforward; > you could always find the (pre-) hash library of your dreams and > bolt that to a new type (or class). Yes, that sounds like the way to go, but I can't believe that someone hasn't written some already. >> p.s. Disk-based DBs are out of the question because most key lookups >> will result in a miss, and lookup time is critical for this application. > > A python interface to cdb perhaps? > > Given enough memory you could probably cache the entire file in RAM. I actually tried the dbdbm module with the 'None' file argument, but it stored the data as a temporary file on disk. Very, very slow, presumably due to flushing for every k+v pair, and Sleepycat DB 3 isn't so lightweight anymore. I imagine cdb might be faster, so I'll have to try it on a RAM disk. -- Chris Foote Inetd Pty Ltd T/A HostExpress Web: http://www.hostexpress.com.au Blog: http://www.hostexpress.com.au/drupal/chris Phone: (08) 8410 4566 From Daryl.Tester at iocane.com.au Fri May 12 01:06:26 2006 From: Daryl.Tester at iocane.com.au (Daryl Tester) Date: Fri, 12 May 2006 08:36:26 +0930 Subject: [sapug] Large dictionaries In-Reply-To: References: <44630650.6010905@iocane.com.au> Message-ID: <4463C372.7030709@iocane.com.au> Chris Foote wrote: >> Is this a startup operation, or are you expecting to insert that >> many records that often? > At startup, and on receiving a signal every few hours. Yeah, can't trust those dictionaries to hold onto anything these days. ;-) (I'm presuming the dictionary isn't being dynamically updated). >> I was going to suggest using cPickle to >> load a precomputed dictionary, but a quick test shows the performance >> is probably worse. > You'd run out of RAM pretty quickly parsing it as well :-) The parsing is pretty good in that regard - very little state is required to reconstruct a dictionary. But on subsequent rethink it's going to suffer the same insert problems that you're experiencing, so as crap solutions go, that idea is right up there with 'em. > Yes, that sounds like the way to go, but I can't believe that someone > hasn't written some already. I can see other hash table itches that people have scratched, but not that one. Looking at the 1.5.2 code for dictobject (it's the only one I have conveniently unpacked) it would be straightforward to add a resize method, possibly even to the constructor, but then you're wind up with a non-standard Python. Of course, all this assuming that it's the resize that's killing your performance. Remember the words of the Great Dilbert: PHB: "Measure twice ... cut twice ..." Wally: "And give the ruler a bad performance review?" -- Regards, Daryl Tester, IOCANE Pty. Ltd. From ryan at uanywhere.com.au Mon May 29 15:25:50 2006 From: ryan at uanywhere.com.au (Ryan Verner) Date: Mon, 29 May 2006 22:55:50 +0930 Subject: [sapug] Another meeting Message-ID: <447AF65E.6010305@uanywhere.com.au> Anytime soon, anybody? From bofh at afoyi.com Tue May 30 16:55:05 2006 From: bofh at afoyi.com (Darryl Ross) Date: Wed, 31 May 2006 00:25:05 +0930 Subject: [sapug] Another meeting In-Reply-To: <447AF65E.6010305@uanywhere.com.au> References: <447AF65E.6010305@uanywhere.com.au> Message-ID: <447C5CC9.8050002@afoyi.com> Ryan Verner wrote: > Anytime soon, anybody? Sounds good to me! Cheers D -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 187 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/sapug/attachments/20060531/69a18a8e/attachment.pgp From george.patterson at gmail.com Wed May 31 07:29:57 2006 From: george.patterson at gmail.com (George Patterson) Date: Wed, 31 May 2006 14:59:57 +0930 Subject: [sapug] Another meeting In-Reply-To: <447C5CC9.8050002@afoyi.com> References: <447AF65E.6010305@uanywhere.com.au> <447C5CC9.8050002@afoyi.com> Message-ID: <1149053397.6213.2.camel@beast64.localnet> On Wed, 2006-05-31 at 00:25 +0930, Darryl Ross wrote: > Ryan Verner wrote: > > Anytime soon, anybody? > > Sounds good to me! > > Cheers > D > Not this week but the week after next is good for me (rotating shift work is annoying for scheduling stuff) Otherwise, start having sapug meetings at the same time each month and I will be there at some. Regards George