Sorting in huge files

Adam DePrince adam at cognitcorp.com
Wed Dec 8 14:54:31 EST 2004


On Tue, 2004-12-07 at 16:47, Paul wrote:
> I really do need to sort. It is complicated and I haven't said why, but
> it will help in finding similar keys later on. Sorry I can't be more
> precise, this has to do with my research.

Precision is precisely what we require to give you an answer more
meaningful than "write a script to load it into your favorite database
and type 'select * from table order by column;'	"  

Now unless you have an NDA with an employer or are working on something
classified, (in which case you have already given us too much
information and should start looking for another job and lawyer) I would
venture a guess that you have more to gain than lose from giving us more
information.  Decisions are hard sometimes ... is the help worth the
risk that somebody in this forum will look at your question, say "hey
that is a neat idea," duplicate all of your research and publish before
you shaming you to a life of asking "do you want fries with that" and
pumping gas.  

> 
> Your two other suggestions with itertools and operator are more useful,
> but I was mostly wondering about performance issue.

What performance issue?  Nowadays any decent laptop should be able to
handle this dataset (from disk) without too much trouble. 

c = make_a_cursor_for_my_favoriate_database()
f = open( "mydata" )
for line in f.xreadlines():
	c.execute( "insert into table( fields) values (%s,%s ... )",
line.split() )
c.commit()
print "I'm done loading, feel free to hit control+C if you get tired"
c.execute( "select * from table order by field" )
while 1:
	print c.fetchone()

Then, from your shell:

myloadscript.py | gzip -9 > results.txt 

Start it up Friday night and take the weekend off.  Just make sure you
plug your laptop into the wall before you go home.

> 
> Is this reasonnable to do on 10^8 elements with repeats in the keys? I
> guess I should just try and see for myself.

Repeats in the keys don't matter.  


Adam DePrince 





More information about the Python-list mailing list