comparison of files using set function

Sat May 17 21:08:42 EDT 2008

On Sun, 18 May 2008 00:47:55 +1000, Beema shafreen  
<beema.shafreen at gmail.com> wrote:

> I have files with two column, column 1 is with id and column 2 is with
> data(sequence)
> My goal is to create a table in such a way, the column one of the table
> should have all the id from the files and next column will be have the
> respective seq of the file1 with correlation to the id and the third  
> column
> will be sequence information of the next file with respective to the id
> original files look like this
>
> 45    ytut
> 46    erete
> 37   dfasf
> 45  dassdsd
>
>
> and so on  for all the 10 files that is it has two column as mentioned
> above.
>
> The output should look like this:
>
> Id    file1      file2     file3     file4   file5
> 43    ytuh    ytuh     ytuh    ytuh    ytuh
> 46   erteee   rty       ryyy              ertyu
> 47   yutio    rrr                    eeerr
>
>
>
> The goal is if the pick all the common id in the files and with their
> respective information in the adjacent rows.
> the various conditons ca also prevails
> 1) common id present in all the files, which have same information
> 2)common id present in all the files, which donot have same information
> 3) common id may not be present in all the files
>
> But the goal is exactly find the common id in all the files and add their
> corresponding information in the file to the table as per the view
>  my script :
> def file1_search(*files1):
>     for file1 in files1:
>         gi1_lis = []
>         fh = open(file1,'r')
>         for line in fh.readlines():
>             data1 = line.strip().split('\t')
>             gi1 = data1[0].strip()
>             seq1 = data1[1].strip()
>             gi1_lis.append(gi1)
>         return gi1_lis
> def file2_search(**files2):
>     for file2 in files2:
>         for file in files2[file2]:
>             gi2_lis = []
>             fh1 = open(file,'r')
>             for line1 in fh1.readlines():
>                 data2 = line1.strip().split('\t')
>                 gi2 = data2[0].strip()
>                 seq2 = data2[1].strip()
>                 gi2_lis.append(gi2)
>
>             return gi2_lis
> def set_compare(data1,data2,*files1,**files2):
>     A = set(data1)
>     B = set(data2)
>     I = A&B # common between thesetwo sets
>
>     D = A-B #57 is the len of D
>     C = B-A #176 is  the len of c
> #    print len(C)
>  #   print len(D)
>     for file1 in files1:
>         for gi in D:
>             fh = open(file1,'r')
>             for line in fh.readlines():
>                 data1 = line.strip().split('\t')
>                 gi1 = data1[0].strip()
>                 seq1 = data1[1].strip()
>             if gi == gi1:
> #                print line.strip()
>                     pass
>
>     for file2 in files2:
>         for file in files2[file2]:
>             for gi in C:
>                 fh1 = open(file,'r')
>                 for line1 in fh1.readlines():
>                     data2 = line1.strip().split('\t')
>                     gi2 = data2[0].strip()
>                     seq2 = data2[1].strip()
>                 if gi == gi2:
>                    # print line1.strip()
>                     pass
> if __name__ == "__main__":
>     files1 = ["Fr20.txt",\
>               "Fr22.txt",\
>               "Fr24.txt",\
>               "Fr60.txt",\
>               "Fr62.txt"]
>     files2 = {"data":["Fr64.txt",\
>               "Fr66.txt",\
>               "Fr68.txt",\
>               "Fr70.txt",\
>               "Fr72.txt"]}
>     data1 = file1_search(*files1)
>
>     """113 is the total number of gi"""
>     data2 = file2_search(**files2)
>     #for j in data2:
>      #   print j
>     """232 is the total number of gi found"""
>     result = set_compare(data1,data2,*files1,**files2)
>
>  It doesnot work fine... some body please suggest me the way i can  
> proceed .
> Thanks a lot
>

1. Test with a small number of short files with a clear idea of the  
expected result.
2. Use better variable names.  Names such as file1_search, file2_search,  
gi, gi2, A, B, C and D make it nearly impossible to understand your code.

-- 
Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a>