[Tutor] sorting a 'large' datafile

Bob Gailer bgailer@alum.rpi.edu
Tue Jul 22 12:41:49 2003


--=======516411C5=======
Content-Type: multipart/alternative; x-avg-checked=avg-ok-6751694E; boundary="=====================_8123250==.ALT"


--=====================_8123250==.ALT
Content-Type: text/plain; x-avg-checked=avg-ok-6751694E; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 8bit

At 03:59 PM 7/22/2003 +0200, Wilhelmsen Jan wrote:

>I need to sort records in a datafile by specific positions in the file.
>The records in the datafile look like this:
>
>3000010711004943      200211052002110520021120015980050378
>3000010711004943      200211052002110520021120015980050378
>3200010711004943      200211052002110520021120015980050378
>3100010711004943      200211052002110520021120015980050378
>3000010711004943      200211052002110520021120015980050378
>3000014211015894      200211052002110520021205030700161
>3200014211015894      000000000451606+24             BD30599 
>
>3200014211015894      000000000158000+24             BD30599 
>
>3200014211015894      000000000033600+24             BD30599 
>
>3200014211015894      000000000025900+24             BD30599 
>
>3100014211015894      000000000038400+24             BD30599 
>
>
>I've created two variables for the fields I want to sort by:
>         rt = line[:2]
>         dok = line[6:16]
>First I want to sort by 'dok' then by 'rt'.
>Do I have to read the whole file into memory before I can begin sorting 
>the records?
>I tried to read in all the lines but everything crashes when I do.
>The file is only 250kb so this should be possibly, or I'm I wrong?
>Can anyone give me some tips?

Consider using the sqlite database. It will handle large files and sort 
very efficiently.
http://www.sqlite.org for the database.
http://pysqlite.sourceforge.net/ for the Python wrapper.

Bob Gailer
bgailer@alum.rpi.edu
303 442 2625


--=====================_8123250==.ALT
Content-Type: text/html; x-avg-checked=avg-ok-6751694E; charset=us-ascii
Content-Transfer-Encoding: 8bit

<html>
<body>
At 03:59 PM 7/22/2003 +0200, Wilhelmsen Jan wrote:<br><br>
<blockquote type=cite class=cite cite><font face="arial" size=2>I need to
sort records in a datafile by specific positions in the file.<br>
The records in the datafile look like this:<br>
</font><br>
<font face="arial" size=2>3000010711004943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021120015980050378<br>
3000010711004943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021120015980050378<br>
3200010711004943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021120015980050378<br>
3100010711004943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021120015980050378<br>
3000010711004943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021120015980050378<br>
3000014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
200211052002110520021205030700161&nbsp;&nbsp;&nbsp; <br>
3200014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
000000000451606+24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
BD30599&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
3200014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
000000000158000+24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
BD30599&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
3200014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
000000000033600+24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
BD30599&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
3200014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
000000000025900+24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
BD30599&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
3100014211015894&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
000000000038400+24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
BD30599&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<br>
</font><br>
<font face="arial" size=2>I've created two variables for the fields I
want to sort by:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; rt = line[:2]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; dok = line[6:16]<br>
First I want to sort by 'dok' then by 'rt'.<br>
Do I have to read the whole file into memory before I can begin sorting
the records?<br>
I tried to read in all the lines but everything crashes when I do.<br>
The file is only 250kb so this should be possibly, or I'm I wrong?<br>
Can anyone give me some tips?</font></blockquote><br>
Consider using the sqlite database. It will handle large files and sort
very efficiently.<br>
<font size=2 color="#008000"><a href="http://www.sqlite.org/" eudora="autourl">http://www.sqlite.org</a></font>
for the database.<br>
<font size=2 color="#008000"><a href="http://pysqlite.sourceforge.net/" eudora="autourl">http://pysqlite.sourceforge.net/</a></font> for the Python wrapper.<br>
<x-sigsep><p></x-sigsep>
Bob Gailer<br>
bgailer@alum.rpi.edu<br>
303 442 2625<br>
</body>
</html>


--=====================_8123250==.ALT--

--=======516411C5=======
Content-Type: text/plain; charset=us-ascii; x-avg=cert; x-avg-checked=avg-ok-6751694E
Content-Disposition: inline


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003

--=======516411C5=======--