[Tutor] (de)serialization questions
Steven D'Aprano
steve at pearwood.info
Sun Oct 3 07:03:42 CEST 2010
On Sat, 2 Oct 2010 04:08:22 am Albert-Jan Roskam wrote:
> Maybe my main question is as follows: what permanent object is most
> suitable to store a large amount of entries (maybe too many to fit
> into the computer's memory), which can be looked up very fast.
"Very fast" and "cannot fit in main memory" do not go together. Reading
from disk is 100-1000 times slower than reading from main memory, and
writing to disk even slower.
The best you can hope for is "fast enough". And that means using a
database. Which database will depend on what data you are storing, how
much there is, what you plan to do with it, and many other questions.
> Eventually, I want to create two objects:
> 1-one to look up street name and city using zip code
> 2-one to look up zip code using street name, apartment number and
> city
>
> I stored object1 in a marshalled dictionary. Its length is about
> 450.000 (I live in Holland, not THAT many streets).
That will probably fit in main memory. If not, add some more memory.
Even if each address takes 1000 bytes of memory, that's only 450 MB in
total. Double it for the second object and add 100 MB of overhead and
you've only got 1 GB in total -- well within the capabilities of a
server, or even a high-end desktop. Put 4 GB of the fastest memory you
can find into the machine, and with luck it will be fine.
But the only way to tell for sure will be to try it, and see how much
memory it takes. Then consider what will happen if you double the
number of records.
Consider using pickle instead of marshal. marshal is really only
intended for Python byte-code files.
[...]
> You suggest to simply use a file. I like simple solutions, but
> doesn't that, by definition, require a slow, linear search? I chose a
> dictionary for object1 because hash tables are very fast in terms of
> lookup times. Another option might be to use the bisect module. Not
> sure if using a binary search algoritm is much faster with these
> still relatively small numbers. It's interesting for sure, but I
> don't know whether using bisect would justify the added amount of
> complexity and bug risks.
Oh come on now, bisect is not that complicated! Yes, there's a small
amount of added complexity, but it's still a simple algorithm. If
you're seriously considering a linear search over 450 thousand items
just to avoid doing a binary search, there's something seriously wrong.
Is it worth it? Log base 2 of 450000 is less than 19, so the average
binary search will take 19 lookups. The average linear search will take
225000 lookups. I'm confident that will be worth it.
--
Steven D'Aprano
More information about the Tutor
mailing list