Seek the one billionth line in a file containing 3 billion lines.

Wed Aug 8 11:26:51 EDT 2007

On 8/8/07, Steve Holden <steve at holdenweb.com> wrote:
> Chris Mellon wrote:
> > On 8/8/07, Ben Finney <bignose+hates-spam at benfinney.id.au> wrote:
> >> Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> writes:
> >>
> >>> On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> >>>> Sullivan WxPyQtKinter <sullivanz.... at gmail.com> writes:
> >>>>> This program:
> >>>>> for i in range(1000000000):
> >>>>>       f.readline()
> >>>>> is absolutely every slow....
> >>>> There are two problems:
> >>>>
> >>>>  1) range(1000000000) builds a list of a billion elements in memory
> >> [...]
> >>>>  2) f.readline() reads an entire line of input
> >> [...]
> >>> Thank you for pointing out these two problem. I wrote this program
> >>> just to say that how inefficient it is to use a seemingly NATIVE way
> >>> to seek a such a big file. No other intention........
> >> The native way isn't iterating over 'range(hugenum)', it's to use an
> >> iterator. Python file objects are iterable, only reading eaach line as
> >> needed and not creating a companion list.
> >>
> >>     logfile = open("foo.log", 'r')
> >>     for line in logfile:
> >>         do_stuff(line)
> >>
> >> This at least avoids the 'range' issue.
> >>
> >> To know when we've reached a particular line, use 'enumerate' to
> >> number each item as it comes out of the iterator.
> >>
> >>     logfile = open("foo.log", 'r')
> >>     target_line_num = 10**9
> >>     for (line_num, line) in enumerate(file):
> >>         if line_num < target_line_num:
> >>             continue
> >>         else:
> >>             do_stuff(line)
> >>             break
> >>
> >> As for reading each line: that's unavoidable if you want a specific
> >> line from a stream of variable-length lines.
> >>
> >
> > The minimum bounds for a line is at least one byte (the newline) and
> > maybe more, depending on your data. You can seek() forward the minimum
> > amount of bytes that (1 billion -1) lines will consume and save
> > yourself some wasted IO.
>
> Except that you will have to count the number of lines in that first
> billion characters in order to determine when to stop.
>

True. Perhaps you can tell from the data itself what line you want.