What strategy for random accession of records in massive FASTA file?

John Lenton john at grulic.org.ar
Fri Jan 14 01:23:18 EST 2005


On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote:
> Chris Lasher wrote:
> 
> > Since the file I'm working with contains tens of thousands of these
> > records, I believe I need to find a way to hash this file such that I
> > can retrieve the respective sequence more quickly than I could by
> > parsing through the file request-by-request. However, I'm very new to
> > Python and am still very low on the learning curve for programming and
> > algorithms in general; while I'm certain there are ubiquitous
> > algorithms for this type of problem, I don't know what they are or
> > where to look for them. So I turn to the gurus and accost you for help
> > once again. :-) If you could help me figure out how to code a solution
> > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> > keep it in Python only, even though I know interaction with a
> > relational database would provide the fastest method--the group I'm
> > trying to write this for does not have access to a RDBMS.)
> 
> keeping an index in memory might be reasonable.  the following class
> creates an index file by scanning the FASTA file, and uses the "marshal"
> module to save it to disk.  if the index file already exists, it's used as is.
> to regenerate the index, just remove the index file, and run the program
> again.

the problem caught my interest, and the way you used a non-mmaped file
in a place where using mmap was pretty much obvious (IMVHO) bothered
me enough to overcome my laziness. It didn't overcome it by *much*,
mind you, so the following probably only works on python 2.3, on
Linux, in Argentina, and with FASTA data that looks like the sample I
was able to download.

However, having said that, the sample I downloaded was one 46MiB file,
and reading it in on my notebook was fast enough that I ripped out the
saving/reloading of the index and just reindex it every time. Adding
back the persistant index is trivial. [ time passes ] In fact, I just
found a several-gigabyte fasta file, and I guess you'd want the index
for that one; I put the code back in. And now I should really go to
bed, because this is very interesting but won't pay the bills.

import os, mmap, marshal
from UserDict import DictMixin

class FASTA(DictMixin):
    def __init__(self, filename):
        self.file = f = open(filename)
        fno = f.fileno()
        stat = os.fstat(fno)
        self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY)
        self.checkindex()

    def __getitem__(self, key):
        p0, pf = self.index[key]
        m = self.map
        return m[p0:pf]

    def keys(self):
        return self.index.keys()
    def __contains__(self, key):
        return self.index.__contains__(key)

    def checkindex(self):
        indexfile = self.file.name + ".idx"
        if os.path.exists(indexfile):
            # and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime:
            self.index = marshal.load(open(indexfile, "rb"))
        else:
            print 'generating index...'
            self.genindex()
            marshal.dump(self.index, open(indexfile, "wb"))
            print 'done.'

    def genindex(self):
        index = {}
        m = self.map
        last = None
        while 1:
            pos = m.find('>')
            if last is not None:
                index[last] = (index[last], pos)
            if pos == -1:
                break
            m.seek(pos)
            line = m.readline()
            pos = m.tell()
            # this is the bit that probably only works with FASTA
            # files like I was able to find on the 'net.
            sep = line.index(' ')
            if sep == -1:
                name = line[1:].strip()
            else:
                name = line[1:sep].strip()
            index[name] = pos
            last = name
        self.index = index

db = FASTA("/home/john/tmp/uniprot_sprot.fasta")
print db["104K_THEPA"]


-- 
John Lenton (john at grulic.org.ar) -- Random fortune:
"Those who believe in astrology are living in houses with foundations of
Silly Putty."
-  Dennis Rawlins, astronomer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: Digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20050114/400881bb/attachment.sig>


More information about the Python-list mailing list