[Tutor] (de)serialization questions

Steven D'Aprano steve at pearwood.info
Sun Oct 3 07:03:42 CEST 2010


On Sat, 2 Oct 2010 04:08:22 am Albert-Jan Roskam wrote:

> Maybe my main question is as follows: what permanent object is most
> suitable to store a large amount of entries (maybe too many to fit
> into the computer's memory), which can be looked up very fast.

"Very fast" and "cannot fit in main memory" do not go together. Reading 
from disk is 100-1000 times slower than reading from main memory, and 
writing to disk even slower.

The best you can hope for is "fast enough". And that means using a 
database. Which database will depend on what data you are storing, how 
much there is, what you plan to do with it, and many other questions.


> Eventually, I want to create two objects:
> 1-one to look up street name and city using zip code
> 2-one to look up zip code using street name, apartment number and
> city
>
> I stored object1 in a marshalled dictionary. Its length is about
> 450.000 (I live in Holland, not THAT many streets). 

That will probably fit in main memory. If not, add some more memory. 
Even if each address takes 1000 bytes of memory, that's only 450 MB in 
total. Double it for the second object and add 100 MB of overhead and 
you've only got 1 GB in total -- well within the capabilities of a 
server, or even a high-end desktop. Put 4 GB of the fastest memory you 
can find into the machine, and with luck it will be fine.

But the only way to tell for sure will be to try it, and see how much 
memory it takes. Then consider what will happen if you double the 
number of records.

Consider using pickle instead of marshal. marshal is really only 
intended for Python byte-code files.


[...]
> You suggest to simply use a file. I like simple solutions, but
> doesn't that, by definition, require a slow, linear search? I chose a
> dictionary for object1 because hash tables are very fast in terms of
> lookup times. Another option might be to use the bisect module. Not
> sure if using a binary search algoritm is much faster with these
> still relatively small numbers. It's interesting for sure, but I
> don't know whether using bisect would justify the added amount of
> complexity and bug risks.

Oh come on now, bisect is not that complicated! Yes, there's a small 
amount of added complexity, but it's still a simple algorithm. If 
you're seriously considering a linear search over 450 thousand items 
just to avoid doing a binary search, there's something seriously wrong.

Is it worth it? Log base 2 of 450000 is less than 19, so the average 
binary search will take 19 lookups. The average linear search will take 
225000 lookups. I'm confident that will be worth it.


-- 
Steven D'Aprano


More information about the Tutor mailing list