fast method accessing large, simple structured data

M.-A. Lemburg mal at egenix.com
Sat Feb 2 18:20:43 EST 2008


On 2008-02-02 21:36, agc wrote:
> Hi,
> 
> I'm looking for a fast way of accessing some simple (structured) data.
> 
> The data is like this:
> Approx 6 - 10 GB simple XML files with the only elements
> I really care about are the <title> and <article> ones.
> 
> So what I'm hoping to do is put this data in a format so
> that I can access it as fast as possible for a given request
> (http request, Python web server) that specifies just the title,
> and I return the article content.
> 
> Is there some good format that is optimized for search for
> just 1 attribute (title) and then returning the corresponding article?
> 
> I've thought about putting this data in a SQLite database because
> from what I know SQLite has very fast reads (no network latency, etc)
> but not as fast writes, which is fine because I probably wont be doing
> much writing (I wont ever care about the speed of any writes).
> 
> So is a database the way to go, or is there some other,
> more specialized format that would be better?

Depends on what you want to search and how, e.g. whether
a search for title substrings should give results, whether
stemming is needed, etc.

If all you want is a simple mapping of full title to article
string, an on-disk dictionary is probably the way to go,
e.g. mxBeeBase (part of the eGenix mx Base Distribution).

For more complex search, you're better off with a tool that
indexes the titles based on words, ie. a full-text search
engine such as Lucene.

Databases can also handle this, but they often have problems when
it comes to more complex queries where their indexes no longer
help them to speed up the query and they have to resort to
doing a table scan - a sequential search of all rows.

Some databases provide special full-text extensions, but
those are of varying quality. Better use a specialized
tool such as Lucene for this.

For more background on the problems of full-text search, see e.g.

http://www.ibm.com/developerworks/opensource/library/l-pyind.html

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 03 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611



More information about the Python-list mailing list