Running queries on large data structure

Thu Aug 3 13:15:17 EDT 2006

On Thursday 03 August 2006 17:40, jay graves wrote:
> Christoph Haas wrote:
> > The situation is that I have input data that take ~1 minute to parse
> > while the users need to run queries on that within seconds. I can
> > think of two ways:
>
> What is the raw data size?

The file containing the objects is 2.3 MB. The rules that I parse are 3.8 
MB. Both files contain different objects and properties/settings of each 
objects one per line.

> Are there any effciencies to be gained in the parsing code?

The currently working application is still Perl. I have started to rewrite 
the parser in Python. However an import of the objects alone takes ~5 
seconds. The rules probably take at least as long. I'm already using 
regular expressions very rarely and try to make it as efficient as 
possible (without implementing parts in C).

> > (1) Database
> >     (very quick, but the input data is deeply nested and it would be
> >      ugly to convert it into some relational shape for the database)
>
> Depending on your tolerance for this ugliness.  You could use a SQLite
> 'memory' database.  _Might_ be faster than the PostgreSQL but you can't
> tell until you profile it.

The current application uses PostgreSQL and parsing all the data and moving 
them into the database takes ~30 seconds at the moment. Since that's done 
every 5 minutes in the background nobody has to wait for it.

> > (2) cPickle
> >     (Read the data every now and then, parse it, write the nested
> > Python data structure into a pickled file. The let the other
> > application that does the queries unpickle the variable and use it
> > time and again.)
>
> How hard would it be to create this nested structure?

Not hard. Instead of doing "INSERT INTO" I would add values to a dictionary 
or list. That's even simpler.

> I've found 
> pickling really large data structures doesn't really save a huge amount
> of time when reloading them from disk but YMMV and you would have to
> profile it to know for sure.

Okay, that takes a bit of pickle's magic away. :)

> > So the question is: would you rather force the data into a relational
> > database and write object-relational wrappers around it? Or would you
> > pickle it and load it later and work on the data? The latter
> > application is currently a CGI. I'm open to whatever. :)
>
> Convert your CGI to a persistant python webserver  (I use CherryPy but
> you can pick whatever works for you.) and store the nested data
> structure globally.  Reload/Reparse as necessary.  It saves the
> pickle/unpickle step.

Up to now I have just used CGI. But that doesn't stop me from looking at 
other web frameworks. However the reparsing as necessary makes a quick 
query take 10-30 seconds. And my users usually query the database just 
once every now and then and expect to have little delay. That time is not 
very user-friendly.

It makes me feel bad to (ab)use PostgreSQL for just throw-away data. The 
database would just be needed for the persistance. It's not even needed 
for querying the data because it shows that SQL is not mighty enough for 
the kind of queries I need. That's where I thought that some other 
persistant storage would come handy. Up to now my mind just knew the 
equation persistence==database.

 Christoph