What strategy for random accession of records in massive FASTA file?

Chris Lasher chris.lasher at gmail.com
Wed Jan 12 17:46:07 EST 2005


Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

>CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACAT
>CW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
>CW127_A03
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris




More information about the Python-list mailing list