Seek the one billionth line in a file containing 3 billion lines.

Ben Finney bignose+hates-spam at benfinney.id.au
Wed Aug 8 06:57:58 EDT 2007


Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> writes:

> On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> > Sullivan WxPyQtKinter <sullivanz.... at gmail.com> writes:
> > > This program:
> > > for i in range(1000000000):
> > >       f.readline()
> > > is absolutely every slow....
> >
> > There are two problems:
> >
> >  1) range(1000000000) builds a list of a billion elements in memory
[...]
> >
> >  2) f.readline() reads an entire line of input
[...]
> 
> Thank you for pointing out these two problem. I wrote this program
> just to say that how inefficient it is to use a seemingly NATIVE way
> to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

    logfile = open("foo.log", 'r')
    for line in logfile:
        do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

    logfile = open("foo.log", 'r')
    target_line_num = 10**9
    for (line_num, line) in enumerate(file):
        if line_num < target_line_num:
            continue
        else:
            do_stuff(line)
            break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

-- 
 \      "I have never made but one prayer to God, a very short one: 'O |
  `\       Lord, make my enemies ridiculous!' And God granted it."  -- |
_o__)                                                         Voltaire |
Ben Finney



More information about the Python-list mailing list