number of different lines in a file

Fri May 19 13:02:05 EDT 2006

Bill Pursell wrote:
> Have you tried
> cat file | sort | uniq | wc -l ?

The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:

  sort < file | uniq | wc -l

And sort also takes a filename argument:

  sort file | uniq | wc -l

And sort has the uniq functionality built in:

  sort -u file | wc -l

Really, all this piping between little utilities is inefficient
bullshit, isn't it!  All that IPC through the kernel, copying the data.

Why can't sort also count the damn lines?

There should be one huge utility which can do it all in a single
address space.

> sort might choke on the large file, and this isn't python, but it
> might work.

Solid implementations of sort can use external storage for large files,
and perform a poly-phase type sort, rather than doing the entire sort
in memory.

I seem to recall that GNU sort does something like this, using
temporary files.

Naively written Python code is a lot more likely to choke on a large
data set.

> You might try breaking the file into
> smaller peices, maybe based on the first character, and then
> process them seperately.

No, the way this is done is simply to read the file and insert the data
into an ordered data structure until memory fills up. After that, you
keep reading the file and inseting, but each time you insert, you
remove the smallest element and write it out to the segment file.  You
keep doing it until it's no longer possible to extract a smallest
element which is greater than all that have been already written to the
file. When that happens, you start a new file.  That does not happen
until you have filled memory at least twice. So for instance with half
a gig of RAM, you can produce merge segments on the order of a gig.