Creating Long Lists

Mon Feb 21 23:29:47 EST 2011

On Mon, Feb 21, 2011 at 7:24 PM, Dan Stromberg <drsalists at gmail.com> wrote:

>
> On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack <
> zawackkfb at gis.a-star.edu.sg> wrote:
>
>> I have a large (10gb) data file for which I want to parse each line into
>> an object and then append this object to a list for sorting and further
>> processing.  I have noticed however that as the length of the list increases
>> the rate at which objects are added to it decreases dramatically.  My first
>> thought was that  I was nearing the memory capacity of the machine and the
>> decrease in performance was due to the os swapping things in and out of
>> memory.  When I looked at the memory usage this was not the case.  My
>> process was the only job running and was consuming 40gb of the the total
>> 130gb and no swapping processes were running.  To make sure there was not
>> some problem with the rest of my code, or the servers file system, I ran my
>> program again as it was but without the line that was appending items to the
>> list and it completed without problem indicating that the decrease in
>> performance is the result of some part of the process of appending to the
>> list.  Since other people have observed this problem as well (
>> http://tek-tips.com/viewthread.cfm?qid=1096178&page=13,
>> http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i)
>> I did not bother to further analyze or benchmark it.  Since the answers in
>> the above forums do not seem very definitive  I thought  I would inquire
>> here about what the reason for this decrease in performance is, and if there
>> is a way, or another data structure, that would avoid this problem.<http://mail.python.org/mailman/listinfo/python-list>
>
>
> Do you have 130G of physical RAM, or 130G of virtual memory?  That makes a
> big difference.  (Yeah, I know, 130G of physical RAM is probably pretty rare
> today)
>
> Disabling garbage collection is a good idea, but if you don't have well
> over 10G of physical RAM, you'd probably better also use a (partially)
> disk-based sort.  To do otherwise would pretty much beg for swapping and a
> large slowdown.
>
> Merge sort works very well for very large datasets.
> http://en.wikipedia.org/wiki/Merge_sort  Just make your sublists be disk
> files, not in-memory lists - until you get down to a small enough sublist
> that you can sort it in memory, without thrashing.  Timsort (list_.sort())
> is excellent for in memory sorting.
>
> Actually, GNU sort is very good at sorting huge datasets - you could
> probably just open a subprocess to it, as long as you can make your data fit
> the line-oriented model GNU sort expects, and you have enough temporary disk
> space.
>

Depending on what you're doing after the sort, you might also look at
bsddb.btopen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110221/e55fe57f/attachment-0001.html>