[Tutor] sorting a 2 gb file

Scott Melnyk melnyk at gmail.com
Tue Jan 25 09:51:07 CET 2005


Hello!

I am wondering about the best way to handle sorting some data from
some of my results.

I have an file in the form shown at the end  (please forgive any
wrapparounds due to the width of the  screen here- the lines starting
with ENS end with the e-12 or what have you on same line.)

What I would like is to generate an output file of  any other
ENSE000...e-4 (or whathaveyou) lines that appear in more than one
place and for each of those the queries they appear related to.

So if the first line
ENSE00001098330.2|ENSG00000013573.6|ENST00000350437.2 assembly=N...
etc appears as a result in any other query I would like it and the
queries it appears as a result to (including the score if possible).

My data set the below is taken from is over 2.4 gb so speed and memory
considerations come into play.  Are sets more effective than lists for
this?  To save space in the new file I really only need the name of
the result up to the | and the score at the end for each.
to simplify things, the score could be dropped, and I could check it
out as needed later.

As always all feedback is very appreciated. 

Thanks,
Scott

FILE:

This is the number 1  query tested.
Results for scoring against Query= hg17_chainMm5_chr17
range=chr1:2040-3330 5'pad=0 3'pad=0
 are: 

ENSE00001098330.2|ENSG00000013573.6|ENST00000350437.2 assembly=N...    72  1e-12
ENSE00001160046.1|ENSG00000013573.6|ENST00000251758.3 assembly=N...    72  1e-12
ENSE00001404464.1|ENSG00000013573.6|ENST00000228264.4 assembly=N...    72  1e-12
ENSE00001160046.1|ENSG00000013573.6|ENST00000290818.3 assembly=N...    72  1e-12
ENSE00001343865.2|ENSG00000013573.6|ENST00000350437.2 assembly=N...    46  8e-05
ENSE00001160049.1|ENSG00000013573.6|ENST00000251758.3 assembly=N...    46  8e-05
ENSE00001343865.2|ENSG00000013573.6|ENST00000228264.4 assembly=N...    46  8e-05
ENSE00001160049.1|ENSG00000013573.6|ENST00000290818.3 assembly=N...    46  8e-05

This is the number 2  query tested.
Results for scoring against Query= hg17_chainMm5_chr1
range=chr1:82719-95929 5'pad=0 3'pad=0
 are: 

ENSE00001373792.1|ENSG00000175182.4|ENST00000310585.3 assembly=N...    80  6e-14
ENSE00001134144.2|ENSG00000160013.2|ENST00000307155.2 assembly=N...    78  2e-13
ENSE00001433065.1|ENSG00000185480.2|ENST00000358383.1 assembly=N...    78  2e-13
ENSE00001422761.1|ENSG00000183160.2|ENST00000360503.1 assembly=N...    74  4e-12
ENSE00001431410.1|ENSG00000139631.6|ENST00000308926.3 assembly=N...    74  4e-12
ENSE00001433065.1|ENSG00000185480.2|ENST00000358383.1 assembly=N...    72  1e-11
ENSE00001411753.1|ENSG00000126882.4|ENST00000358329.1 assembly=N...    72  1e-11
ENSE00001428167.1|ENSG00000110497.4|ENST00000314823.4 assembly=N...    72  1e-11
ENSE00001401130.1|ENSG00000160828.5|ENST00000359898.1 assembly=N...    72  1e-11
ENSE00001414900.1|ENSG00000176920.4|ENST00000356650.1 assembly=N...    72  1e-11
ENSE00001428167.1|ENSG00000110497.4|ENST00000314823.4 assembly=N...    72  1e-11
ENSE00001400942.1|ENSG00000138670.5|ENST00000356373.1 assembly=N...    72  1e-11
ENSE00001400116.1|ENSG00000120907.6|ENST00000356368.1 assembly=N...    70  6e-11
ENSE00001413546.1|ENSG00000184209.6|ENST00000344033.2 assembly=N...    70  6e-11
ENSE00001433572.1|ENSG00000124243.5|ENST00000355583.1 assembly=N...    70  6e-11
ENSE00001423154.1|ENSG00000125875.4|ENST00000354200.1 assembly=N...    70  6e-11
ENSE00001400109.1|ENSG00000183785.3|ENST00000339190.2 assembly=N...    70  6e-11
ENSE00001268950.4|ENSG00000084112.4|ENST00000303438.2 assembly=N...    68  2e-10
ENSE00001057279.1|ENSG00000161270.6|ENST00000292886.2 assembly=N...    68  2e-10
ENSE00001434317.1|ENSG00000171453.2|ENST00000304004.2 assembly=N...    68  2e-10


More information about the Tutor mailing list