[Tutor] sorting a 2 gb file- i shrunk it and turned it around

Kent Johnson kent37 at tds.net
Wed Jan 26 13:01:20 CET 2005


My guess is that your file is small enough that Danny's two-pass approach will work. You might even 
be able to do it in one pass.

If you have enough RAM, here is a sketch of a one-pass solution:

# This will map each result to a list of queries that contain that result
results= {}

# Iterate the file building results
for line in file:
   if isQueryLine(line):
     current_query = line
   elif isResultLine(line):
     key = getResultKey(line)
     results.setdefault(key, []).append(current_query) # see explanation below

# Now go through results looking for entries with more than one query
for key, queries in results.iteritems():
   if len(queries) > 1:
     print
     print key
     for query in queries:
       print query

You have to fill in the details of isQueryLine(), isResultLine() and getResultKey(); from your 
earlier posts I'm guessing you can figure them out.

What is this:
   results.setdefault(key, []).append(current_query)

setdefault() is a handy method of a dict. It looks up the value corresponding to the given key. If 
there is no value, it *sets* the value to the given default, and returns that. So after
   results.setdefault(key, [])
results[key] will have a list in it, and we will have a reference to the list.

Then appending the list adds the query to the list in the dict.

Please let us know what solution you end up using, and how much memory it needs. I'm interested...

Kent

Scott Melnyk wrote:
> Thanks for the thoughts so far.  After posting I have been thinking
> about how to pare down the file (much of the info in the big file was
> not relevant to this question at hand).
> 
> After the first couple of responses I was even more motivated to
> shrink the file so not have to set up a db. This test will be run only
> now and later to verify with another test set so the db set up seemed
> liked more work than might be worth it.
> 
> I was able to reduce my file down about 160 mb in size by paring out
> every line not directly related to what I want by some simple regular
> expressions and a couple tests for inclusion.
> 
> The format and what info is compared against what is different from my
> original examples as I believe this is more clear.
> 
> 
> my queries are named by the lines such as:
> ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
> ENSE is an exon       ENSG is the gene     ENST is a transcript
> 
> They all have the above format, they differ in in numbers above
> following ENS[E,G orT].
> 
> Each query is for a different exon.  For background each gene has many
> exons and there are different versions of which exons are in each gene
> in this dataset.  These different collections are the transcripts ie
> ENST00000339871.1
> 
> in short a transcript is a version of a gene here
> transcript 1 may be formed of  exons a,b and c 
> transcript 2 may contain exons a,b,d 
> 
> 
> 
> the other lines (results) are of the format
> hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...    44  0.001
> hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    44  0.001
> 
> "hg17_chainMm5_chr7_random range=chr10:124355404-124355687" is the
> important part here from "5'pad" on is not important at this point
> 
> 
> What I am trying to do is now make a list of any of the results that
> appear in more than one transcript
> 
> ##########
> FILE SAMPLE:
> 
> This is the number 1  query tested.
> Results for scoring against Query=
> ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
>  are: 
> 
> hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...    44  0.001
> hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    44  0.001
> hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...    44  0.001
> hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...    44  0.001
> hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...    44  0.001
> hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...    44  0.001
> 
> ....
> 
> This is the number 3  query tested.
> Results for scoring against Query=
> ENSE00001365999.1|ENSG00000187908.1|ENST00000339871.1
>  are: 
> 
> hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...    60  2e-08
> hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...    60  2e-08
> hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...    60  2e-08
> hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...    60  2e-08
> 
> ##############
> 
> I would like to generate a file that looks for any results (the
> hg17_etc  line) that occur in more than transcript (from the query
> line ENSE00001365999.1|ENSG00000187908.1|ENST00000339871.1)
> 
> 
> so if  
> hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
>  shows up again later in the file I want to know and want to record
> where it is used more than once, otherwise I will ignore it.
> 
> I am think another reg expression to capture the transcript id
> followed by  something that captures each of the results, and writes
> to another file anytime a result appears more than once, and ties the
> transcript ids to them somehow.
> 
> Any suggestions?
> I agree if I had more time and was going to be doing more of this the
> DB is the way to go.
> -As an aside I have not looked into sqlite, I am hoping to avoid the
> db right now, I'd have to get the sys admin to give me permission to
> install something again etc etc. Where as I am hoping to get this
> together in a reasonably short script.
> 
>  However I will look at it later (it could be helpful for other things for me.
> 
> Thanks again to all,  
> Scott
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 



More information about the Tutor mailing list