Sorting Large File (Code/Performance)

Asim asim.ihsan at gmail.com
Fri Jan 25 09:46:48 EST 2008


On Jan 25, 9:23 am, Asim <asim.ih... at gmail.com> wrote:
> On Jan 24, 4:26 pm, Ira.Ko... at gmail.com wrote:
>
>
>
> > Thanks to all who replied. It's very appreciated.
>
> > Yes, I had to doublecheck line counts and the number of lines is ~16
> > million (insetead of stated 1.6B).
>
> > Also:
>
> > >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
>
> > The file is UTF-8
>
> > > Do the first two characters always belong to the ASCII subset?
>
> > Yes, first two always belong to ASCII subset
>
> > > What are you going to do with it after it's sorted?
>
> > I need to isolate all lines that start with two characters (zz to be
> > particular)
>
> > > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
> > > Google "GnuWin32" and see if their sort does what you want.
>
> > Will do, thanks for the tip.
>
> > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
>
> > I am limited with resources. Unfortunately.
>
> Since the OP has stated that they are running Windows XP, and more
> than one poster has suggested installing more RAM in the box, I
> thought people should know that WinXP has certain limitations on the
> amount of memory that may be used:
>
> http://msdn2.microsoft.com/en-us/library/aa366778.aspx
>
> Firstly, the maximum amount of physical memory that may be installed
> is 4GB.  Secondly, with the "4 gigabyte tuning" and
> "IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
> virtual memory (phyical memory + swapfile size) that may be assigned
> to user processes is 2GB.
>
> Hence, even if you made a 100GB swap file with 4GB RAM installed, by
> default only a maximum of 2GB would ever be assigned to a user-
> process.  With the two flags enabled, the maximum becomes 3GB.
>
> If the OP finds performance to be limited and thinks more RAM would
> help trying a later version of Windows would be a start, but better
> would be to try Linux or Mac OSX out.
>
> Cheers,
> Asim
>
> > Cheers,
>
> > Ira

Sorry, just to clarify my response.  Any 32-bit OS will only be able
to assign 4GB of virtual memory to a single processes, the argument
being that since processes can only issue 32-bit instructions the
process can only address a maximum of 2^32 bytes of addresses
(assuming the architecture is using byte-addressed memory).

Another link that's easier to grok:

http://www.codinghorror.com/blog/archives/000811.html

However, a 32-bit OS may support more than 4GB of virtual memory
(using "Physical Address Extension", or PAE) and split it more
intelligently between processes than Windows XP or Vista does:

http://www.ibm.com/developerworks/linux/library/l-memmod/

So allocating more than 4GB of virtual memory to your sort application
could be achieved through splitting your task into more than one
process on an appropriate OS.  AFAIK, such memory limitations are
dependent on the particular Linux distro you're using, and I'm not
sure about Mac OSX limitations.  This applies doubly for 64-bit
architectures and OS's.

Please correct me, with references, if my conclusions are wrong.

Cheers,
Asim



More information about the Python-list mailing list