[Tutor] alphabetizing a file by lines

Magnus Lyckå magnus at thinkware.se
Sun Jul 18 01:36:58 CEST 2004


At 17:28 2004-07-17 -0400, Brian van den Broek wrote:
>the advice at the top is good. The only thing I would add is that this 
>will sort your lines by their entire contents and will perform a case 
>sensitive sort.

Another problem is that is has to read the whole file into
memory at once.

>It seemed from your problem description that you wanted only the first 
>letter taken into account regardless of case so that the lines:
>
>abc
>ABC
>azz
>acd
>
>would not be rearranged.

This is really good if the file is so large that you don't want
to read it into memory at once.

>Both the case-insensitivity and the consideration of just the first letter 
>can easily be obtained by writing a custom compare function to pass in 
>with the sort method call.
>
>The case insensitivity part can be accomplished like so:
>
>def alphabetical_sort(first, second):
>     return cmp(first.lower(), second.lower())
>
>(You could easily extend this to look at just the leading n-characters of 
>the two strings being compared.)
>
>You'd use it like so:
>
>lines = open(file).readlines()
>lines.sort(alphabetical_sort)
>print lines

the problem with this is that it's slow for big lists,
since you have to call a Python function over and over
from inside the sort function.

The common approach is to do something like this.

sort_list = []
for line in open(file):
     sort_list.append((line[0].lower(), line))
sort_list.sort()
for first_char, line in sort_list:
     print line

In the coming Python 2.4, there will be more features added
to the sort method which will make this simpler.

Here we still have a problem with large files, but if we
only need to sort on the first character, that's easy to
overcome.

Just do something like this (untested) to store lines for
each starting letter in a separate file and cat the files
together in the end:

files = {}

def store(line):
     name = line[0].lower()
     if not name in files:
         f = open(name, 'w+')
         files[name] = f
     files[name].write(line+'\n')

for line in open(file):
     line = line.rstrip()
     if line:
         store(line)

file_names = files.keys()
file_names.sort()

big_file = open('big_file.txt', 'w')
for file_name in file_names:
         files[file_name].seek(0)
         chunk = files[file_name].read()
         big_file.write(chunk)
big_file.close()

This approach never reads need to use up more memory than
required for all lines starting with one letter. For a more
memory conservative approach, make an inner loop in the
last loop, and read one line at a time, or a few lines
using something like files[file_name].readlines(100).

If I recall correctly, Bentley's "Programming Pearls" has
a chapter on something like this.


--
Magnus Lycka (It's really Lyckå), magnus at thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The Agile Programming Language 



More information about the Tutor mailing list