[Tutor] variation of Unique items question
Kent Johnson
kent37 at tds.net
Fri Feb 4 13:22:17 CET 2005
You need to reset your items_dict when you see an hg17 line.
Here is one way to do it. I used a class to make it easier to break the problem into functions.
Putting the functions in a class makes it easy to share the header and counts.
class Grouper:
''' Process a sequence of strings of the form
Header
Data
Data
Header
...
Look for repeated Data items under a single Header. When found, print
the Header and the repeated item.
Possible usage:
out = open('outfile.txt', 'w')
Grouper().process(open('infile.txt'), 'hg17', out)
out.close()
'''
def reset(self, header='No header'):
''' Reset the current header and counts '''
self.currHeader = header
self.counts = {}
def process(self, data, headerStart, out):
''' Find duplicates within groups of lines of data '''
self.reset()
for line in data:
line = line.strip() # get rid of newlines from file input
if line.startswith(headerStart):
# Found a new header line, show the current group and restart
self.showDups(out)
self.reset(line)
elif line:
# Found a data line, count it
self.counts[line] = self.counts.get(line, 0) + 1
# Show the last group
self.showDups(out)
def showDups(self, out):
# Get list of items with count > 1
items = [ (k, cnt) for k, cnt in self.counts.items() if cnt > 1 ]
# Show the items
if items:
items.sort()
print >> out, self.currHeader
for k, cnt in sorted(items):
print >> out, '%s occurs %d times' % (k, cnt)
print >> out
if __name__ == '__main__':
import sys
data = '''hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST00000339563.1
ENST00000342196.1
ENST00000339563.1
ENST00000344055.1
hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST00000279800.3
ENST00000309556.3
ENST00000279800.3
hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST00000321157.3
ENST00000256324.4'''.split('\n')
Grouper().process(data, 'hg17', sys.stdout)
Kent
Scott Melnyk wrote:
> Hello once more.
>
> I am stuck on how best to tie the finding Unique Items in Lists ideas to my file
>
> I am stuck at level below: What I have here taken from the unique
> items thread does not work as I need to separate each grouping to the
> hg chain it is in (see below for examples)
>
> import sys
> WFILE=open(sys.argv[1], 'w')
> def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
> a_list=open(fname, 'r')
> #print "beginning get_list_dup"
> items_dict, dup_dict = {}, {}
>
> for i in a_list:
> items_dict[i] = items_dict.get(i, 0) + 1
>
> for k, v in items_dict.iteritems():
> if v==threshold:
> dup_dict[k] = v
>
> return dup_dict
>
> def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
> #print "Beginning report generation"
> dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
> for k, v in sorted(dup_dict.iteritems()):
> print WFILE,'%s occurred %s times' %(k, v)
>
> if __name__ == '__main__':
> print_list_dup_report()
>
>
> My issue is that my file is as follows:
> hg17_chainMm5_chr15 range=chr7:148238502-148239073
> ENST00000339563.1
> ENST00000342196.1
> ENST00000339563.1
> ENST00000344055.1
>
> hg17_chainMm5_chr13 range=chr5:42927967-42928726
> ENST00000279800.3
> ENST00000309556.3
>
> hg17_chainMm5_chr6 range=chr1:155548627-155549517
> ENST00000321157.3
> ENST00000256324.4
>
> I need a print out that would give the line hg17.... and then any
> instances of the ENST that occur more than once only for that chain
> section. Even better it only prints the hg17 line if it is followed
> by an instance of ENST that occurs more than once
>
> I am hoping for something that gives me an out file roughly like:
>
> hg17_chainMm5_chr15 range=chr7:148238502-148239073
> ENST00000339563.1 occurs 2 times
>
> hg17_chainMm5_chr13 range=chr5:42927967-42928726
> ENST00000279800.3 occurs 2 times
>
>
> All help and ideas appreciated, I am trying to get this finished as
> soon as possible, the output file will be used to go back to my 2 gb
> file and pull out the rest of the data I need.
>
> Thanks,
> Scott
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list