What strategy for random accession of records in massive FASTA file?
Steve Holden
steve at holdenweb.com
Fri Jan 14 17:46:31 EST 2005
Bengt Richter wrote:
> On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <chris.lasher at gmail.com> wrote:
>
[...]
> Others have probably solved your basic problem, or pointed
> the way. I'm just curious.
>
> Given that the information content is 2 bits per character
> that is taking up 8 bits of storage, there must be a good reason
> for storing and/or transmitting them this way? I.e., it it easy
> to think up a count-prefixed compressed format packing 4:1 in
> subsequent data bytes (except for the last byte which have
> less than 4 2-bit codes).
>
> I'm wondering how the data is actually used once records are
> retrieved. (but I'm too lazy to explore the biopython.org link).
>
Revealingly honest.
Of course, adopting an encoding that only used two bits per base would
make it impossible to use the re module to search for patterns in them,
for example. So the work of continuously translating between
representations might militate against more efficient representations.
Or, of course, it might not :-)
it's-only-storage-ly y'rs - steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
More information about the Python-list
mailing list