[Tutor] variation of Unique items question

Fri Feb 4 13:22:17 CET 2005

You need to reset your items_dict when you see an hg17 line.

Here is one way to do it. I used a class to make it easier to break the problem into functions. 
Putting the functions in a class makes it easy to share the header and counts.

class Grouper:
     ''' Process a sequence of strings of the form
         Header
         Data
         Data

         Header
         ...

         Look for repeated Data items under a single Header. When found, print
         the Header and the repeated item.

         Possible usage:
         out = open('outfile.txt', 'w')
         Grouper().process(open('infile.txt'), 'hg17', out)
         out.close()
     '''

     def reset(self, header='No header'):
         ''' Reset the current header and counts '''
         self.currHeader = header
         self.counts = {}

     def process(self, data, headerStart, out):
         ''' Find duplicates within groups of lines of data '''
         self.reset()

         for line in data:
             line = line.strip() # get rid of newlines from file input

             if line.startswith(headerStart):
                 # Found a new header line, show the current group and restart
                 self.showDups(out)
                 self.reset(line)

             elif line:
                 # Found a data line, count it
                 self.counts[line] = self.counts.get(line, 0) + 1

         # Show the last group
         self.showDups(out)

     def showDups(self, out):
         # Get list of items with count > 1
         items = [ (k, cnt) for k, cnt in self.counts.items() if cnt > 1 ]

         # Show the items
         if items:
             items.sort()
             print >> out, self.currHeader
             for k, cnt in sorted(items):
                 print >> out, '%s occurs %d times' % (k, cnt)
             print >> out

if __name__ == '__main__':
     import sys

     data = '''hg17_chainMm5_chr15 range=chr7:148238502-148239073
     ENST00000339563.1
     ENST00000342196.1
     ENST00000339563.1
     ENST00000344055.1

     hg17_chainMm5_chr13 range=chr5:42927967-42928726
     ENST00000279800.3
     ENST00000309556.3
     ENST00000279800.3

     hg17_chainMm5_chr6 range=chr1:155548627-155549517
     ENST00000321157.3
     ENST00000256324.4'''.split('\n')

     Grouper().process(data, 'hg17', sys.stdout)

Kent

Scott Melnyk wrote:
> Hello once more.
> 
> I am stuck on how best to tie the finding Unique Items in Lists ideas to my file
> 
> I am stuck at level below:  What I have here taken from the unique
> items thread does not work as I need to separate each grouping to the
> hg chain it is in (see below for examples)
> 
> import sys
> WFILE=open(sys.argv[1], 'w') 
> def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
>     a_list=open(fname, 'r')
>    #print "beginning get_list_dup"
>     items_dict, dup_dict = {}, {}
>     
>     for i in a_list:
>         items_dict[i] = items_dict.get(i, 0) + 1
> 
>     for k, v in items_dict.iteritems():
>         if v==threshold:
>             dup_dict[k] = v    
> 
>     return dup_dict
> 
> def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
>     #print "Beginning report generation"
>     dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
>     for k, v in sorted(dup_dict.iteritems()):
>         print WFILE,'%s occurred %s times' %(k, v)
> 
> if __name__ == '__main__':
>         print_list_dup_report()
> 
> 
> My issue is that my file is as follows:
> hg17_chainMm5_chr15 range=chr7:148238502-148239073
> ENST00000339563.1
> ENST00000342196.1
> ENST00000339563.1
> ENST00000344055.1
> 
> hg17_chainMm5_chr13 range=chr5:42927967-42928726
> ENST00000279800.3
> ENST00000309556.3
> 
> hg17_chainMm5_chr6 range=chr1:155548627-155549517
> ENST00000321157.3
> ENST00000256324.4
>   
> I need a print out that would give the line hg17.... and then any
> instances of the ENST that occur more than once only for that chain
> section.  Even better it only prints the hg17 line if it is followed
> by an instance of ENST that occurs more than once
> 
> I am hoping for something that gives me an out file roughly like:
> 
> hg17_chainMm5_chr15 range=chr7:148238502-148239073
> ENST00000339563.1 occurs 2 times
> 
> hg17_chainMm5_chr13 range=chr5:42927967-42928726
> ENST00000279800.3 occurs 2 times
>  
> 
> All help and ideas appreciated, I am trying to get this finished as
> soon as possible, the output file will be used to go back to my 2 gb
> file and pull out the rest of the data I need.
> 
> Thanks,
> Scott
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>