What strategy for random accession of records in massive FASTA file?

Fredrik Lundh fredrik at pythonware.com
Wed Jan 12 18:19:49 EST 2005


Chris Lasher wrote:

> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for help
> once again. :-) If you could help me figure out how to code a solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)

keeping an index in memory might be reasonable.  the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk.  if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

  def __init__(self, file):
    self.file = open(file)
    self.checkindex()

  def __getitem__(self, key):
    try:
      pos = self.index[key]
    except KeyError:
      raise IndexError("no such item")
    else:
      f = self.file
      f.seek(pos)
      header = f.readline()
      assert ">" + header + "\n"
      data = []
      while 1:
        line = f.readline()
        if not line or line[0] == ">":
          break
        data.append(line)
      return data

  def checkindex(self):
    indexfile = self.file.name + ".index"

    try:
      self.index = marshal.load(open(indexfile, "rb"))
    except IOError:
      print "building index..."

      index = {}

      # scan the file
      f = self.file
      f.seek(0)
      while 1:
        pos = f.tell()
        line = f.readline()
        if not line:
          break
        if line[0] == ">":
          # save offset to header line
          header = line[1:].strip()
          index[header] = pos

      # save index to disk
      f = open(indexfile, "wb")
      marshal.dump(index, f)
      f.close()

      self.index = index

db = FASTA("myfastafile.dat")
print db["CW127_A02"]

['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

</F> 






More information about the Python-list mailing list