Fast file data retrieval?

Jorgen Grahn grahn+nntp at snipabacken.se
Tue Mar 13 16:44:37 EDT 2012


On Mon, 2012-03-12, MRAB wrote:
> On 12/03/2012 19:39, Virgil Stokes wrote:
>> I have a rather large ASCII file that is structured as follows
>>
>> header line
>> 9 nonblank lines with alphanumeric data
>> header line
>> 9 nonblank lines with alphanumeric data
>> ...
>> ...
>> ...
>> header line
>> 9 nonblank lines with alphanumeric data
>> EOF
>>
>> where, a data set contains 10 lines (header + 9 nonblank) and there can
>> be several thousand
>> data sets in a single file. In addition,*each header has a* *unique ID
>> code*.
>>
>> Is there a fast method for the retrieval of a data set from this large
>> file given its ID code?

[Responding here since the original is not available on my server.]

It depends on what you want to do. Access a few of the entries (what
you call data sets) from your program? Process all of them?  How fast
do you need it to be?

> Probably the best solution is to put it into a database. Have a look at
> the sqlite3 module.

Some people like to use databases for everything, others never use
them. I'm in the latter crowd, so to me this sounds as overkill, and
possibly impractical. What if he has to keep the text file around? A
database on disk would mean duplicating the data. A database in memory
would not offer any benefits over a hash.

> Alternatively, you could scan the file, recording the ID and the file
> offset in a dict so that, given an ID, you can seek directly to that
> file position.

Mmapping the file (the mmap module) is another option.
But I wonder if this really would improve things.

"Several thousand" entries is not much these days. If a line is 80
characters, 5000 entries would take ~3MB of memory. The time to move
this from disk to a Python list of 9-tuples of strings would be almost
only disk I/O.

I think he should try to do it the dumb way first: read everything
into memory once.

/Jorgen

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .



More information about the Python-list mailing list