What strategy for random accession of records in massive FASTA file?

Thu Jan 13 19:39:33 EST 2005

Chris Lasher wrote:

>>And besides, for long-term archiving purposes, I'd expect that zip et
>>al on a character-stream would provide significantly better
>>compression than a 4:1 packed format, and that zipping the packed
>>format wouldn't be all that much more efficient than zipping the
>>character stream.
> 
> This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB.  It'd be interesting to 
see how that packed-format file would compress, but I don't care 
enough to write a script to convert the FASTA file into a 
packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be) 
are outweighed by speed and conceptual simplicity (i.e. avoiding a 
huge mess of bit-masking every time a single base needs to be 
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA 
sequences, you've got at least a fifth base to represent, which means 
you need at least three bits per base, which means only two bases per 
byte (or else base-encodings split across byte-boundaries).... That 
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International