Sorting Large File (Code/Performance)

Asim asim.ihsan at gmail.com
Fri Jan 25 09:23:38 EST 2008


On Jan 24, 4:26 pm, Ira.Ko... at gmail.com wrote:
> Thanks to all who replied. It's very appreciated.
>
> Yes, I had to doublecheck line counts and the number of lines is ~16
> million (insetead of stated 1.6B).
>
> Also:
>
> >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
>
> The file is UTF-8
>
> > Do the first two characters always belong to the ASCII subset?
>
> Yes, first two always belong to ASCII subset
>
> > What are you going to do with it after it's sorted?
>
> I need to isolate all lines that start with two characters (zz to be
> particular)
>
> > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
> > Google "GnuWin32" and see if their sort does what you want.
>
> Will do, thanks for the tip.
>
> > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
>
> I am limited with resources. Unfortunately.
>

Since the OP has stated that they are running Windows XP, and more
than one poster has suggested installing more RAM in the box, I
thought people should know that WinXP has certain limitations on the
amount of memory that may be used:

http://msdn2.microsoft.com/en-us/library/aa366778.aspx

Firstly, the maximum amount of physical memory that may be installed
is 4GB.  Secondly, with the "4 gigabyte tuning" and
"IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
virtual memory (phyical memory + swapfile size) that may be assigned
to user processes is 2GB.

Hence, even if you made a 100GB swap file with 4GB RAM installed, by
default only a maximum of 2GB would ever be assigned to a user-
process.  With the two flags enabled, the maximum becomes 3GB.

If the OP finds performance to be limited and thinks more RAM would
help trying a later version of Windows would be a start, but better
would be to try Linux or Mac OSX out.

Cheers,
Asim


> Cheers,
>
> Ira




More information about the Python-list mailing list