Seek the one billionth line in a file containing 3 billion lines.

Chris Mellon arkanes at gmail.com
Wed Aug 8 09:58:50 EDT 2007


On 8/8/07, Ben Finney <bignose+hates-spam at benfinney.id.au> wrote:
> Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> writes:
>
> > On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> > > Sullivan WxPyQtKinter <sullivanz.... at gmail.com> writes:
> > > > This program:
> > > > for i in range(1000000000):
> > > >       f.readline()
> > > > is absolutely every slow....
> > >
> > > There are two problems:
> > >
> > >  1) range(1000000000) builds a list of a billion elements in memory
> [...]
> > >
> > >  2) f.readline() reads an entire line of input
> [...]
> >
> > Thank you for pointing out these two problem. I wrote this program
> > just to say that how inefficient it is to use a seemingly NATIVE way
> > to seek a such a big file. No other intention........
>
> The native way isn't iterating over 'range(hugenum)', it's to use an
> iterator. Python file objects are iterable, only reading eaach line as
> needed and not creating a companion list.
>
>     logfile = open("foo.log", 'r')
>     for line in logfile:
>         do_stuff(line)
>
> This at least avoids the 'range' issue.
>
> To know when we've reached a particular line, use 'enumerate' to
> number each item as it comes out of the iterator.
>
>     logfile = open("foo.log", 'r')
>     target_line_num = 10**9
>     for (line_num, line) in enumerate(file):
>         if line_num < target_line_num:
>             continue
>         else:
>             do_stuff(line)
>             break
>
> As for reading each line: that's unavoidable if you want a specific
> line from a stream of variable-length lines.
>

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.



More information about the Python-list mailing list