something about performence

Tue Jun 21 01:50:07 EDT 2011

On 6/20/2011 10:31 PM, Ken Seehart wrote:
> On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:
>> Hi，
>> I have two large files,each has more than 200000000 lines,and each
>> line consists of two fields,one is the id and the other a value,
>> the ids are sorted.
>>
>> for example:
>>
>> file1
>> (uin_a y)
>> 1 10000245
>> 2 12333
>> 3 324543
>> 5 3464565
>> ....
>>
>>
>> file2
>> (uin_b gift)
>> 1 34545
>> 3 6436466
>> 4 35345646
>> 5 463626
>> ....
>>
>> I want to merge them and get a file,the lines of which consists of an
>> id and the sum of the two values in file1 and file2。
>> the codes are as below:
>>
>> uin_y=open('file1')
>> uin_gift=open(file2')
>>
>> y_line=uin_y.next()
>> gift_line=uin_gift.next()
>>
>> while 1:
>> try:
>> uin_a,y=[int(i) for i in y_line.split()]
>> uin_b,gift=[int(i) for i in gift_line.split()]
>> if uin_a==uin_b:
>> score=y+gift
>> print uin_a,score
>> y_line=uin_y.next()
>> gift_line=uin_gift.next()
>> if uin_a<uin_b:
>> print uin_a,y
>> y_line=uin_y.next()
>> if uin_a>uin_b:
>> print uin_b,gift
>> gift_line=uin_gift.next()
>> except StopIteration:
>> break
>>
>>
>> the question is that those code runs 40+ minutes on a server(16
>> core,32G mem),
>> the time complexity is O(n),and there are not too much operations,
>> I think it should be faster.So I want to ask which part costs so much.
>> I tried the cProfile module but didn't get too much.
>> I guess maybe it is the int() operation that cost so much,but I'm not
>> sure
>> and don't know how to solve this.
>> Is there a way to avoid type convertion in Python such as scanf in C?
>> Thanks for your help ：）
>
> Unfortunately python does not have a scanf equivalent AFAIK. Most use
> cases for scanf can be handled by regular expressions, but that would
> clearly useless for you, and just slow you down more since it does not
> perform the int conversion for you.
>
> Your code appears to have a bug: I would expect that the last entry
> will be lost unless both files end with the same index value. Be sure
> to test your code on a few short test files.
>
> I recommend psyco to make the whole thing faster.
>
> Regards,
> Ken Seehart
>
Another thought (a bit of extra work, but you might find it worthwhile
if psyco doesn't yield a sufficient boost):

Write a couple filter programs to convert to and from binary data (pairs
of 32 or 64 bit integers depending on your requirements).

Modify your program to use the subprocess module to open two instances
of the binary conversion process with the two input files. Then pipe the
output of that program into the binary to text filter.

This might turn out to be faster since each process would make use of a
core. Also it gives you other options, such as keeping your data in
binary form for processing, and converting to text only as needed.

Ken Seehart

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110620/f11ce57e/attachment-0001.html>