Running queries on large data structure

Thu Aug 3 11:40:45 EDT 2006

Christoph Haas wrote:
> On Wednesday 02 August 2006 22:24, Christoph Haas wrote:
> I suppose my former posting was too long and concrete. So allow me to try
> it in a different way. :)

OK.  I'll bite.

> The situation is that I have input data that take ~1 minute to parse while
> the users need to run queries on that within seconds. I can think of two
> ways:

What is the raw data size?
Are there any effciencies to be gained in the parsing code?

> (1) Database
>     (very quick, but the input data is deeply nested and it would be
>      ugly to convert it into some relational shape for the database)

Depending on your tolerance for this ugliness.  You could use a SQLite
'memory' database.  _Might_ be faster than the PostgreSQL but you can't
tell until you profile it.

> (2) cPickle
>     (Read the data every now and then, parse it, write the nested Python
>      data structure into a pickled file. The let the other application
>      that does the queries unpickle the variable and use it time and
>      again.)

How hard would it be to create this nested structure?  I've found
pickling really large data structures doesn't really save a huge amount
of time when reloading them from disk but YMMV and you would have to
profile it to know for sure.

> So the question is: would you rather force the data into a relational
> database and write object-relational wrappers around it? Or would you
> pickle it and load it later and work on the data? The latter application
> is currently a CGI. I'm open to whatever. :)

Convert your CGI to a persistant python webserver  (I use CherryPy but
you can pick whatever works for you.) and store the nested data
structure globally.  Reload/Reparse as necessary.  It saves the
pickle/unpickle step.

In an application I'm working on, I create multiple 'views' off of a
single expensive database query.  I tuck all of these views (read as
'deeply nested python structures') into a cache with a expiration time
(currently 5 min in the future).   My data layer checks the cache
before doing any queries and uses the appropriate view according to the
request.  If the cache hit misses or is expired, I call the expensive
query and reload the cache.  This way there is a 'fat' web page every 5
minutes (load time of about 4 seconds on my dev box) and almost every
other page is sub second.

> Thanks for any enlightenment.
Just my 2 cents.

...
jay graves