What strategy for random accession of records in massive FASTA file?

Larry Bates lbates at syscononline.com
Wed Jan 12 18:35:39 EST 2005


You don't say how this will be used, but here goes:

1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data.  Use shelve to store the dictionary for subsequent
runs (if load time is excessive).

2) Take a look at Gadfly (gadfly.sourceforge.net).  It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.

All depends on how you use the data.

Regards,
Larry Bates
Syscon, Inc.

Chris Lasher wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
> 
> 
>>CW127_A01
> 
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACAT
> 
>>CW127_A02
> 
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
> 
>>CW127_A03
> 
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGG
> ...
> 
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for help
> once again. :-) If you could help me figure out how to code a solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)
> Thanks very much in advance,
> Chris
> 



More information about the Python-list mailing list