What strategy for random accession of records in massive FASTA file?

Thu Jan 13 11:14:01 EST 2005

>Others have probably solved your basic problem, or pointed
>the way. I'm just curious.

>Given that the information content is 2 bits per character
>that is taking up 8 bits of storage, there must be a good reason
>for storing and/or transmitting them this way? I.e., it it easy
>to think up a count-prefixed compressed format packing 4:1 in
>subsequent data bytes (except for the last byte which have
>less than 4 2-bit codes).

My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.

>I'm wondering how the data is actually used once records are
>retrieved.

This one I can answer. For my purposes, I'm just organizing the
sequences at hand, but there are all sorts of things one could actually
do with sequences: alignments, BLAST searches, gene annotations, etc.