What strategy for random accession of records in massive FASTA file?

Wed Jan 12 23:13:11 EST 2005

On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <chris.lasher at gmail.com> wrote:

>Hello,
>I have a rather large (100+ MB) FASTA file from which I need to
>access records in a random order. The FASTA format is a standard format
>for storing molecular biological sequences. Each record contains a
>header line for describing the sequence that begins with a '>'
>(right-angle bracket) followed by lines that contain the actual
>sequence data. Three example FASTA records are below:
>
Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

>>CW127_A01
>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
>GCATTAAACAT
>>CW127_A02
>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
>GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
>>CW127_A03
>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
>GCATTAAACATTCCGCCTGGG
>...

Regards,
Bengt Richter