What strategy for random accession of records in massive FASTA file?

Fri Jan 14 17:46:31 EST 2005

Bengt Richter wrote:

> On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <chris.lasher at gmail.com> wrote:
> 
[...]
> Others have probably solved your basic problem, or pointed
> the way. I'm just curious.
> 
> Given that the information content is 2 bits per character
> that is taking up 8 bits of storage, there must be a good reason
> for storing and/or transmitting them this way? I.e., it it easy
> to think up a count-prefixed compressed format packing 4:1 in
> subsequent data bytes (except for the last byte which have
> less than 4 2-bit codes).
> 
> I'm wondering how the data is actually used once records are
> retrieved. (but I'm too lazy to explore the biopython.org link).
> 
Revealingly honest.

Of course, adopting an encoding that only used two bits per base would 
make it impossible to use the re module to search for patterns in them, 
for example. So the work of continuously translating between 
representations might militate against more efficient representations. 
Or, of course, it might not :-)

it's-only-storage-ly y'rs  - steve
-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119