Running queries on large data structure

Thu Aug 3 15:11:55 EDT 2006

Christoph Haas wrote:
> On Thursday 03 August 2006 17:40, jay graves wrote:
> > How hard would it be to create this nested structure?
> Not hard. Instead of doing "INSERT INTO" I would add values to a dictionary
> or list. That's even simpler.
> > I've found
> > pickling really large data structures doesn't really save a huge amount
> > of time when reloading them from disk but YMMV and you would have to
> > profile it to know for sure.
> Okay, that takes a bit of pickle's magic away. :)

But since it is so easy to create your nested structure, it may be
worth trying.  I've rarely used pickled files and maybe my specific
data structure caused alot of churn in the pickle/unpickle code.
Doesn't hurt to try.  You also need to try walking your data structure
to see how easy/efficient it is to get the results you want.  If you
have to do a text search for every node, it might actually be slower.
In the app I described, everytime i do a reload (equvalent to your
parse step) I interrogate each row and update multiple dictionaries
with differents sets of key tuples with the dictionary value being the
row itself. (just like indexes in a SQL db)  The row is the same object
so the only extra memory I need is for the key tuples.  It sure beats
iterating over a list with 50K entries top to bottom and testing for
the right condition, but I don't know your app so I can't tell if this
is a valid strategy.

> > > So the question is: would you rather force the data into a relational
> > > database and write object-relational wrappers around it? Or would you
> > > pickle it and load it later and work on the data? The latter
> > > application is currently a CGI. I'm open to whatever. :)
> > Convert your CGI to a persistant python webserver  (I use CherryPy but
> > you can pick whatever works for you.) and store the nested data
> > structure globally.  Reload/Reparse as necessary.  It saves the
> > pickle/unpickle step.

> Up to now I have just used CGI. But that doesn't stop me from looking at
> other web frameworks. However the reparsing as necessary makes a quick
> query take 10-30 seconds. And my users usually query the database just
> once every now and then and expect to have little delay. That time is not
> very user-friendly.

I'm not sure I made the advantages of a python web/app server clear.
The main point of CherryPy or similar web frameworks is that since they
are serving the HTTP requests (which are admittedly light), it can keep
any data you want persistent because it is always running, not
respawned on each request.

(Caveat:  These are very broad strokes and there are possible race
conditions but no worse than your Postgres solution)

Imagine if you will:

fwdata = {}
expiretime = 0

def loaddata():
    global fwdata,expiretime
    temp = {}
    parse data into temp dictionary
    expiretime = now + 5 minutes
    fwdata = temp

loaddata()
while 1:
    handle http request
    if query:
        if now > expiretime:
            loaddata()
        query fwdata and build output html

Does the underlying change every 5 minutes?  If not, you could even be
trickier and provide a 'reload' url that foces the app to reload the
data.  If you can track when the source data changes (maybe there are
sanctioned interfaces to use when editing the data), just hit the
appropriate reload URL and your app is always up to date without lots
of needless reparsing.
e.g.
fwdata = {}

def loaddata():
    global fwdata
    temp = {}
    parse data into temp dictionary
    fwdata = temp

loaddata()
while 1:
    handle http request
    if queryrequest:
        query fwdata and build output html
    elif reloadrequest:
        loaddata()

Hope this helps or clarifies my point.
...
jay graves