[Baypiggies] reading files quickly and efficiently

Wed Nov 17 21:40:21 CET 2010

I need to work on a file whose size is around 6.5 GB.  This file consists of
a protein header information and then the corresponding protein sequence.
Here are a few samples lines of this file:

-----------
>gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus lactis
subsp. lactis Il1403] gi|116513137|ref|YP_812044.1| 30S ribosomal protein
S18 [Lactococcus lactis subsp. cremoris SK11]
gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus
lactis subsp. cremoris MG1363] gi|281492845|ref|YP_003354825.1| 50S
ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147]
gi|13878750|sp|Q9CDN0.1|RS18_LACLA RecName: Full=30S ribosomal protein S18
gi|122939895|sp|Q02VU1.1|RS18_LACLS RecName: Full=30S ribosomal protein S18
gi|166220956|sp|A2RNZ2.1|RS18_LACLM RecName: Full=30S ribosomal protein S18
gi|12725253|gb|AAK06287.1|AE006448_5 30S ribosomal protein S18 [Lactococcus
lactis subsp. lactis Il1403] gi|116108791|gb|ABJ73931.1| SSU ribosomal
protein S18P [Lactococcus lactis subsp. cremoris SK11]
gi|124494037|emb|CAL99037.1| 30S ribosomal protein S18 [Lactococcus lactis
subsp. cremoris MG1363] gi|281376497|gb|ADA65983.1| SSU ribosomal protein
S18P [Lactococcus lactis subsp. lactis KF147] gi|300072039|gb|ADJ61439.1|
30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
[Dictyostelium discoideum AX4] gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
Full=Calfumirin-1; Short=CAF-1 gi|793761|dbj|BAA06266.1| calfumirin-1
[Dictyostelium discoideum] gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911
[Dictyostelium discoideum AX4] gi|60470987|gb|EAL68957.1| hypothetical
protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR

-----------
My problem is that i need to filter this file so as to extract the relevant
proteins that are of my interest based on some keywords to be applied on the
header line. As a preliminary step, i wrote the following code to calculate
the total number of lines in the file:

f = open ('nr')
count = 0
for i in f.readlines():
    line = f.next().strip()
    count = count + 1
f.close()
print count

On running this program, i get the following error:

Traceback (most recent call last):
  File "C:\Users\K\Downloads\nr\nr.py", line 34, in <module>
    for i in f.readlines():
MemoryError

A slightly modified version of the above program works fine for the first 10
or 100 or 1000 lines of the file nr:

----

Any suggestions on how i can work around this 'Memory Error' problem?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20101117/25c95495/attachment.html>