something about performence

king6cong at gmail.com king6cong at gmail.com
Mon Jun 20 22:59:48 EDT 2011


Hi,
   I have two large files,each has more than 200000000 lines,and each line
consists of two fields,one is the id and the other a value,
the ids are sorted.

for example:

file1
(uin_a y)
1 10000245
2  12333
3 324543
5 3464565
....


file2
(uin_b gift)
1 34545
3 6436466
4 35345646
5 463626
....

I want to merge them and get a file,the lines of which consists of an id and
the sum of the two values in file1 and file2。
the codes are as below:

uin_y=open('file1')
uin_gift=open(file2')

y_line=uin_y.next()
gift_line=uin_gift.next()

while 1:
    try:
        uin_a,y=[int(i) for i in y_line.split()]
        uin_b,gift=[int(i) for i in gift_line.split()]
        if uin_a==uin_b:
            score=y+gift
            print uin_a,score
            y_line=uin_y.next()
            gift_line=uin_gift.next()
        if uin_a<uin_b:
            print uin_a,y
            y_line=uin_y.next()
        if uin_a>uin_b:
            print uin_b,gift
            gift_line=uin_gift.next()
    except StopIteration:
        break


the question is that those code runs 40+ minutes on a server(16 core,32G
mem),
the time complexity is O(n),and there are not too much operations,
I think it should be faster.So I want to ask which part costs so much.
I tried the cProfile module but didn't get too much.
I guess maybe it is the int() operation that cost so much,but I'm not sure
and don't know how to solve this.
Is there a way to avoid type convertion in Python such as scanf in C?
Thanks for your help :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/67b72f31/attachment.html>


More information about the Python-list mailing list