number of different lines in a file

Kaz Kylheku kkylheku at gmail.com
Fri May 19 15:19:08 EDT 2006


Paddy wrote:
> If the log has a lot of repeated lines in its original state then
> running uniq twice, once up front to reduce what needs to be sorted,
> might be quicker?

Having the uniq and sort steps integrated in a single piece of software
allows for the most optimization opportunities.

The sort utility, under -u, could squash duplicate lines on the input
side /and/ the output side.

>  uniq log_file | sort| uniq | wc -l

Now you have two more pipeline elements, two more tasks running, and
four more copies of the data being made as it travels through two extra
pipes in the kernel.

Or, only two more copies if you are lucky enough to have a "zero copy"
pipe implementation whcih allows data to go from the writer's buffer
directly to the reader's one without intermediate kernel buffering.




More information about the Python-list mailing list