number of different lines in a file

Fri May 19 12:54:21 EDT 2006

Bill Pursell wrote:
> Have you tried
> cat file | sort | uniq | wc -l ?

The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:

  sort < file | uniq | wc -l

Sort has the uniq functionality built in:

  sort -u < file | wc -l

> sort might choke on the large file, and this isn't python, but it
> might work.

Solid implementations of sort can use external storage for large files,
and perform a poly-phase type sort, rather than doing the entire sort
in memory.

I seem to recall that GNU sort does something like this, using
temporary files.

Naively written Python code is a lot more likely to choke on a large
data set.

> You might try breaking the file into
> smaller peices, maybe based on the first character, and then
> process them seperately.

No, the way this is done is simply to read the file and insert the data
into an ordered data structure until memory fills up. After that, you
keep reading the file and inseting, but each time you insert, you
remove the smallest element and write it out to the segment file.  You
keep doing it until it's no longer possible to extract a smallest
element which is greater than all that have been already written to the
file. When that happens, you start a new file.  That does not happen
until you have filled memory at least twice. So for instance with half
a gig of RAM, you can produce merge segments on the order of a gig.