What strategy for random accession of records in massive FASTA file?

Fri Jan 14 17:48:43 EST 2005

Jeff Shannon wrote:

> Chris Lasher wrote:
> 
>>> And besides, for long-term archiving purposes, I'd expect that zip et
>>> al on a character-stream would provide significantly better
>>> compression than a 4:1 packed format, and that zipping the packed
>>> format wouldn't be all that much more efficient than zipping the
>>> character stream.
>>
>>
>> This 105MB FASTA file is 8.3 MB gzip-ed.
> 
> 
> And a 4:1 packed-format file would be ~26MB.  It'd be interesting to see 
> how that packed-format file would compress, but I don't care enough to 
> write a script to convert the FASTA file into a packed-format file to 
> experiment with... ;)
> 
If your compression algorithm's any good then both, when compressed, 
should be approximately equal in size, since the size should be 
determined by the information content rather than the representation.

> Short version, then, is that yes, size concerns (such as they may be) 
> are outweighed by speed and conceptual simplicity (i.e. avoiding a huge 
> mess of bit-masking every time a single base needs to be examined, or a 
> human-(semi-)readable display is needed).
> 
> (Plus, if this format might be used for RNA sequences as well as DNA 
> sequences, you've got at least a fifth base to represent, which means 
> you need at least three bits per base, which means only two bases per 
> byte (or else base-encodings split across byte-boundaries).... That gets 
> ugly real fast.)
> 
Right!

regards
  Steve
-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119