number of different lines in a file

Tim Chase python.list at tim.thechases.com
Fri May 19 12:04:51 EDT 2006


 > I actually had this problem a couple of weeks ago when I
 > discovered that my son's .Xsession file was 26 GB and had
 > filled the disk partition (!).  Apparently some games he was
 > playing were spewing out a lot of errors, and I wanted to find
 > out which ones were at fault.
 >
 > Basically, uniq died on this task (well, it probably was
 > working, but not completed after over 10 hours).  I was using
 > it something like this:
 >
 > cat Xsession.errors | uniq > Xsession.uniq

A couple things I noticed that may play into matters:

1) uniq is a dedicated tool for the task of uniquely identifying
*neighboring* lines in the file.  It doesn't get much faster than
that, *if* that's your input.  This leads to #4 below.

2) (uneventfully?) you have a superfluous use of cat.  I don't
know if that's bogging matters down, but you can just use

     uniq < Xsession.errors > Xsession.uniq

which would save you from having each line touched twice...once
by cat, and once by uniq.

3) as "uniq" doesn't report on its progress, if it's processing a
humongous 26 gig file, it may just sit there churning for a long
time before finishing.  It looks like it may have taken >10hr :)

4) "uniq" requires sorted input.  Unless you've sorted your
Xsession.errors before-hand, your output isn't likely to be as
helpful.  The python set/generator scheme may work well to keep
you from having to sort matters first--especially if you only
have a fairly scant handful of unique errors.

5) I presume wherever you were writing Xsession.uniq had enough
space...you mentioned your son filling your HDD.  It  may gasp,
wheeze and die if there wasn't enough space...or it might just
hang.  I'd hope it would be smart enough to gracefully report
"out of disk-space" errors in the process.

6)  unless I'm experiencing trouble, I just tend to keep my
.xsession-errors file as a soft-link to /dev/null, especially as
(when I use KDE rather than Fluxbox) KDE likes to spit out
mountains of KIO file errors.  It's easy enough to unlink it and
let it generate the file if needed.

7)  With a file this large, you most certainly want to use a
generator scheme rather than trying to load each of the lines in
the file :)  (Note to Bruno...yes, *this* would be one of those
places you mentioned to me earlier about *not* using readlines()
;)

If you're using 2.3.x, and don't have 2.4's nice syntax for

     len(set(line.strip() for line in file("xsession.errors")))

you should be able to bypass reading the whole file into memory
(and make use of sets) with

	from sets import Set as set
	s = set()
	for line in file("xsession.errors"):
		s.add(line.strip())
	return len(s)

In your case, you likely don't have to call strip() and can just
get away with adding each line to the set.

Just a few ideas for the next time you have a multi-gig
Xsession.errors situation :)

-tkc








More information about the Python-list mailing list