searching and storing large quantities of xml!

Sun Jan 17 02:35:20 EST 2010

dads, 16.01.2010 19:10:
> I work in as 1st line support and python is one of my hobbies. We get
> quite a few requests for xml from our website and its a long strung
> out process. So I thought I'd try and create a system that deals with
> it for fun.
> 
> I've been tidying up the archived xml and have been thinking what's
> the best way to approach this issue as it took a long time to deal
> with big quantities of xml. If you have 5/6 years worth of 26000+
> 5-20k xml files per year. The archived stuff is zipped but what is
> better, 26000 files in one big zip file, 26000 files in one big zip
> file but in folders for months and days, or zip files in zip files!

As Paul suggested, that doesn't sound like a lot of data at all.

Personally, I guess I'd keep separate folders of .gz files, but that also
depends on platform characteristics (e.g. I/O performance and file system).
If you already have them archived in .zip files, that's fine also. Your
files are fairly small, so combined .zip archival will likely yield better
space efficiency.

> I created an app in wxpython to search the unzipped xml files by the
> modified date and just open them up and just using the something like
> l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
> the xml?

Sounds like all you need is an index. Look at the various dbm database
modules that come with Python. They are basically persistent dictionaries.
To achieve the above, all you have to do is parse the XML files once (use
cElementTree's iterparse() for that, as Steve suggested) and store a path
(i.e. zip file name and archive entry name) for each 'modified date' that
you find in the file. When your app then needs to access data for a
specific modification date, you can look up the file path in the dbm and
parse it directly from the open zip file.

That's a simple solution, at least. Depending on your expected load and
additional feature set, e.g. full text indexing, a more complex database
setup (potentially even copying all data into the DB) may still be worth
going for.

Stefan