What strategy for random accession of records in massive FASTA file?
Jeff Shannon
jeff at ccvcorp.com
Thu Jan 13 19:39:33 EST 2005
Chris Lasher wrote:
>>And besides, for long-term archiving purposes, I'd expect that zip et
>>al on a character-stream would provide significantly better
>>compression than a 4:1 packed format, and that zipping the packed
>>format wouldn't be all that much more efficient than zipping the
>>character stream.
>
> This 105MB FASTA file is 8.3 MB gzip-ed.
And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with... ;)
Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)
Jeff Shannon
Technician/Programmer
Credit International
More information about the Python-list
mailing list