How to count lines in a text file ?

Wed Sep 22 09:37:37 EDT 2004

Christos TZOTZIOY Georgiou <tzot at sil-tec.gr> wrote:
   ...
> >memory at once.  If you must be able to deal with humungoug files, too
> >big to fit in memory at once, try something like:
> >
> >numlines = 0
> >for line in open('text.txt'): numlines += 1
> 
> And a short story of premature optimisation follows...

Thanks for sharing!

> def count_lines(filename):
>     fp = open(filename)
>     count = 1 + max(enumerate(fp))[0]
>     fp.close()
>     return count

Cute, actually!

> containing Alex' code.  Guess what?  My code was slower... (and I should
> put a try: except Value: clause to cater for empty files)
> 
> Of course, on second thought, the reason must be that enumerate
> generates one tuple for every line in the file; in any case, I'll mark

I thought built-ins could recycle their tuples, sometimes, but you may
in fact be right (we should check with Raymong Hettinger, though).

With 2.4, I measure 30 msec with your approach, and 24 with mine, to
count the 45425 lines of /usr/share/dict/words on my Linux box
(admittedly not a creat example of 'humungous file'); and similarly
kjv.txt, a King James' Bible (31103 lines, but 10 times the size of the
words file), 41 with yours, 36 with mine.  They're pretty close.  At
least they beat len(file(...).readlines()), which takes 33 on words, 62
on kjv.txt...

If one is really in a hurry counting lines, a dedicated C extension
might help.  E.g.:

static PyObject *count(PyObject *self, PyObject *args)
{
    PyObject* seq;
    PyObject* item;
    int result;

    /* get one argument as an iterator */
    if(!PyArg_ParseTuple(args, "O", &seq))
        return 0;
    seq = PyObject_GetIter(seq);
    if(!seq)
        return 0;

    /* count items */
    result = 0;
    while((item=PyIter_Next(seq))) {
        result += 1;
        Py_DECREF(item);
    }    

    /* clean up and return result */
    Py_DECREF(seq);
    return Py_BuildValue("i", result);
}

Using this count-items-in-iterable thingy, words takes 10 msec, kjv
takes 26.

Happier news is that one does NOT have to learn C to gain this.
Consider the Pyrex file:

def count(seq):
    cdef int i
    it = iter(seq)
    i = 0
    for x in it:
        i = i + 1
    return i

pyrexc'ing this and building the Python extension from the resulting C
file gives just about the same performance as the pure-C coding: 10 msec
on words, 26 on kjv, the same to within 1% as pure-C coding (there is a
systematic speedup of a bit less than 1% for the C-coded function).

And if one doesn't even want to use pyrex?  Why, that's what psyco is
for...:

import psyco
def count(seq):
    it = iter(seq)
    i = 0
    for x in it:
        i = i + 1
    return i
psyco.bind(seq)

Again to the same level of precision, the SAME numbers, 10 and 26 msec
(actually, in this case the less-than-1% systematic bias is in favour of
psyco compared to pure-C coding...!-)

So: your instinct that C-coded loops are faster weren't too badly off...
and you can get the same performance (just about) with Pyrex or (on an
intel or compatible processor, only -- sigh) with psyco.

Alex