Seek the one billionth line in a file containing 3 billion lines.

Steve Holden steve at holdenweb.com
Wed Aug 8 11:22:54 EDT 2007


Chris Mellon wrote:
> On 8/8/07, Ben Finney <bignose+hates-spam at benfinney.id.au> wrote:
>> Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> writes:
>>
>>> On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>>>> Sullivan WxPyQtKinter <sullivanz.... at gmail.com> writes:
>>>>> This program:
>>>>> for i in range(1000000000):
>>>>>       f.readline()
>>>>> is absolutely every slow....
>>>> There are two problems:
>>>>
>>>>  1) range(1000000000) builds a list of a billion elements in memory
>> [...]
>>>>  2) f.readline() reads an entire line of input
>> [...]
>>> Thank you for pointing out these two problem. I wrote this program
>>> just to say that how inefficient it is to use a seemingly NATIVE way
>>> to seek a such a big file. No other intention........
>> The native way isn't iterating over 'range(hugenum)', it's to use an
>> iterator. Python file objects are iterable, only reading eaach line as
>> needed and not creating a companion list.
>>
>>     logfile = open("foo.log", 'r')
>>     for line in logfile:
>>         do_stuff(line)
>>
>> This at least avoids the 'range' issue.
>>
>> To know when we've reached a particular line, use 'enumerate' to
>> number each item as it comes out of the iterator.
>>
>>     logfile = open("foo.log", 'r')
>>     target_line_num = 10**9
>>     for (line_num, line) in enumerate(file):
>>         if line_num < target_line_num:
>>             continue
>>         else:
>>             do_stuff(line)
>>             break
>>
>> As for reading each line: that's unavoidable if you want a specific
>> line from a stream of variable-length lines.
>>
> 
> The minimum bounds for a line is at least one byte (the newline) and
> maybe more, depending on your data. You can seek() forward the minimum
> amount of bytes that (1 billion -1) lines will consume and save
> yourself some wasted IO.

Except that you will have to count the number of lines in that first 
billion characters in order to determine when to stop.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------




More information about the Python-list mailing list