fast method accessing large, simple structured data

Sat Feb 2 23:42:25 EST 2008

On Feb 2, 1:50 pm, John Machin <sjmac... at lexicon.net> wrote:
> agc wrote:
> > Hi,
>
> > I'm looking for a fast way of accessing some simple (structured) data.
>
> > The data is like this:
> > Approx 6 - 10 GB simple XML files with the only elements
> > I really care about are the <title> and <article> ones.
>
> > So what I'm hoping to do is put this data in a format so
> > that I can access it as fast as possible for a given request
> > (http request, Python web server) that specifies just the title,
> > and I return the article content.
>
> > Is there some good format that is optimized for search for
> > just 1 attribute (title) and then returning the corresponding article?
>
> > I've thought about putting this data in a SQLite database because
> > from what I know SQLite has very fast reads (no network latency, etc)
> > but not as fast writes, which is fine because I probably wont be doing
> > much writing (I wont ever care about the speed of any writes).
>
> > So is a database the way to go, or is there some other,
> > more specialized format that would be better?
>
> "Database" without any further qualification indicates exact matching,
> which doesn't seem to be very practical in the context of titles of
> articles. There is an enormous body of literature on inexact/fuzzy
> matching, and lots of deployed applications -- it's not a Python-related
> question, really.

Yes, you are right that in some sense this question is not truly
Python related,
but I am looking to solve this problem in a way that plays as nicely
as
possible with Python:

I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Does anyone have any advice on the feasibility of using
just an in memory dictionary?  The dataset just seems to big,
but maybe there is a related method?

Thanks,
Alex