Creating Long Lists

Mon Feb 21 22:24:12 EST 2011

On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack
<zawackkfb at gis.a-star.edu.sg>wrote:

> I have a large (10gb) data file for which I want to parse each line into an
> object and then append this object to a list for sorting and further
> processing.  I have noticed however that as the length of the list increases
> the rate at which objects are added to it decreases dramatically.  My first
> thought was that  I was nearing the memory capacity of the machine and the
> decrease in performance was due to the os swapping things in and out of
> memory.  When I looked at the memory usage this was not the case.  My
> process was the only job running and was consuming 40gb of the the total
> 130gb and no swapping processes were running.  To make sure there was not
> some problem with the rest of my code, or the servers file system, I ran my
> program again as it was but without the line that was appending items to the
> list and it completed without problem indicating that the decrease in
> performance is the result of some part of the process of appending to the
> list.  Since other people have observed this problem as well (
> http://tek-tips.com/viewthread.cfm?qid=1096178&page=13,
> http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i)
> I did not bother to further analyze or benchmark it.  Since the answers in
> the above forums do not seem very definitive  I thought  I would inquire
> here about what the reason for this decrease in performance is, and if there
> is a way, or another data structure, that would avoid this problem.<http://mail.python.org/mailman/listinfo/python-list>

Do you have 130G of physical RAM, or 130G of virtual memory?  That makes a
big difference.  (Yeah, I know, 130G of physical RAM is probably pretty rare
today)

Disabling garbage collection is a good idea, but if you don't have well over
10G of physical RAM, you'd probably better also use a (partially) disk-based
sort.  To do otherwise would pretty much beg for swapping and a large
slowdown.

Merge sort works very well for very large datasets.
http://en.wikipedia.org/wiki/Merge_sort  Just make your sublists be disk
files, not in-memory lists - until you get down to a small enough sublist
that you can sort it in memory, without thrashing.  Timsort (list_.sort())
is excellent for in memory sorting.

Actually, GNU sort is very good at sorting huge datasets - you could
probably just open a subprocess to it, as long as you can make your data fit
the line-oriented model GNU sort expects, and you have enough temporary disk
space.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110221/5ea87116/attachment-0001.html>