What strategy for random accession of records in massive FASTA file?

Fri Jan 14 18:33:24 EST 2005

In article <1105569967.129284.85470 at c13g2000cwb.googlegroups.com>,
 "Chris Lasher" <chris.lasher at gmail.com> wrote:

> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
> 
> >CW127_A01
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACAT
> >CW127_A02
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
> >CW127_A03
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGG
> ...
> 
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request.

First, before embarking on any major project, take a look at 
http://www.biopython.org/ to at least familiarize yourself with what 
other people have done in the field.

The easiest thing I think would be to use the gdbm module.  You can 
write a simple parser to parse the FASTA file (or, I would imagine, find 
one already written on biopython), and then store the data in a gdbm 
map, using the tag lines as the keys and the sequences as the values.  

Even for a Python neophyte, this should be a pretty simple project.  The 
most complex part might getting the gdbm module built if your copy of 
Python doesn't already have it, but gdbm is so convenient, it's worth 
the effort.