something about performence

Tue Jun 21 06:51:44 EDT 2011

On 6/20/2011 11:56 PM, king6cong at gmail.com wrote:
> Thanks for your reply,Ken :)
> I found the two files ends with the same id……so I am lazy^-^
> I tried psyco,and unfortunately it costs nearly the same time as before.
> Is it true that we can only get str from files in Python?
>
Nope^_* . There are many applications such as image processing that
involve working with binary data.

(^_* well, technically yes actually: read() does in fact return str, but
the str can contain binary data)

But in order to do this, you need to use any of several modules that
allow python to operate on flat data.

Two standard modules exist for this purpose: *array *and *struct*. In
addition there are others such as *numpy *(for mathematical
applications) and *ctypes *(for interoperability between python and C/C++).

For your application, the *struct *module is sufficient.

>>> fout = open('junk.dat', 'wb') # open for writing binary
>>> fout.write(struct.pack('LL', 123,234))
>>> fout.write(struct.pack('LL', 123,234))
>>> fout.write(struct.pack('LL', 3,4))
>>> fout.close()

>>> fin = open('junk.dat', 'rb') # open for reading binary
>>> print struct.unpack('LL', fin.read(8))
(123, 234)
>>> print struct.unpack('LL', fin.read(8))
(123, 234)
>>> print struct.unpack('LL', fin.read(8))
(3, 4)
>>> print struct.unpack('LL', fin.read(8)) # raises struct.error at end
of file (because 0 bytes were read)
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
struct.error: unpack requires a string argument of length 8
>>>

>
> 在 2011年6月21日 下午1:50，Ken Seehart <ken at seehart.com
> <mailto:ken at seehart.com>>写 道：
>
>     On 6/20/2011 10:31 PM, Ken Seehart wrote:
>>     On 6/20/2011 7:59 PM, king6cong at gmail.com
>>     <mailto:king6cong at gmail.com> wrote:
>>>     Hi，
>>>     I have two large files,each has more than 200000000 lines,and
>>>     each line consists of two fields,one is the id and the other a
>>>     value,
>>>     the ids are sorted.
>>>
>>>     for example:
>>>
>>>     file1
>>>     (uin_a y)
>>>     1 10000245
>>>     2 12333
>>>     3 324543
>>>     5 3464565
>>>     ....
>>>
>>>
>>>     file2
>>>     (uin_b gift)
>>>     1 34545
>>>     3 6436466
>>>     4 35345646
>>>     5 463626
>>>     ....
>>>
>>>     I want to merge them and get a file,the lines of which consists
>>>     of an id and the sum of the two values in file1 and file2。
>>>     the codes are as below:
>>>
>>>     uin_y=open('file1')
>>>     uin_gift=open(file2')
>>>
>>>     y_line=uin_y.next()
>>>     gift_line=uin_gift.next()
>>>
>>>     while 1:
>>>     try:
>>>     uin_a,y=[int(i) for i in y_line.split()]
>>>     uin_b,gift=[int(i) for i in gift_line.split()]
>>>     if uin_a==uin_b:
>>>     score=y+gift
>>>     print uin_a,score
>>>     y_line=uin_y.next()
>>>     gift_line=uin_gift.next()
>>>     if uin_a<uin_b:
>>>     print uin_a,y
>>>     y_line=uin_y.next()
>>>     if uin_a>uin_b:
>>>     print uin_b,gift
>>>     gift_line=uin_gift.next()
>>>     except StopIteration:
>>>     break
>>>
>>>
>>>     the question is that those code runs 40+ minutes on a server(16
>>>     core,32G mem),
>>>     the time complexity is O(n),and there are not too much operations,
>>>     I think it should be faster.So I want to ask which part costs so
>>>     much.
>>>     I tried the cProfile module but didn't get too much.
>>>     I guess maybe it is the int() operation that cost so much,but
>>>     I'm not sure
>>>     and don't know how to solve this.
>>>     Is there a way to avoid type convertion in Python such as scanf
>>>     in C?
>>>     Thanks for your help ：）
>>
>>     Unfortunately python does not have a scanf equivalent AFAIK. Most
>>     use cases for scanf can be handled by regular expressions, but
>>     that would clearly useless for you, and just slow you down more
>>     since it does not perform the int conversion for you.
>>
>>     Your code appears to have a bug: I would expect that the last
>>     entry will be lost unless both files end with the same index
>>     value. Be sure to test your code on a few short test files.
>>
>>     I recommend psyco to make the whole thing faster.
>>
>>     Regards,
>>     Ken Seehart
>>
>     Another thought (a bit of extra work, but you might find it
>     worthwhile if psyco doesn't yield a sufficient boost):
>
>     Write a couple filter programs to convert to and from binary data
>     (pairs of 32 or 64 bit integers depending on your requirements).
>
>     Modify your program to use the subprocess module to open two
>     instances of the binary conversion process with the two input
>     files. Then pipe the output of that program into the binary to
>     text filter.
>
>     This might turn out to be faster since each process would make use
>     of a core. Also it gives you other options, such as keeping your
>     data in binary form for processing, and converting to text only as
>     needed.
>
>     Ken Seehart
>
>
>     --
>     http://mail.python.org/mailman/listinfo/python-list
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/4fc20239/attachment-0001.html>