What strategy for random accession of records in massive FASTA file?
Roy Smith
roy at panix.com
Fri Jan 14 18:33:24 EST 2005
In article <1105569967.129284.85470 at c13g2000cwb.googlegroups.com>,
"Chris Lasher" <chris.lasher at gmail.com> wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
>
> >CW127_A01
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACAT
> >CW127_A02
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
> >CW127_A03
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
> GCATTAAACATTCCGCCTGGG
> ...
>
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request.
First, before embarking on any major project, take a look at
http://www.biopython.org/ to at least familiarize yourself with what
other people have done in the field.
The easiest thing I think would be to use the gdbm module. You can
write a simple parser to parse the FASTA file (or, I would imagine, find
one already written on biopython), and then store the data in a gdbm
map, using the tag lines as the keys and the sequences as the values.
Even for a Python neophyte, this should be a pretty simple project. The
most complex part might getting the gdbm module built if your copy of
Python doesn't already have it, but gdbm is so convenient, it's worth
the effort.
More information about the Python-list
mailing list