Seek the one billionth line in a file containing 3 billion lines.

Jay Loden python at jayloden.com
Wed Aug 8 03:15:05 EDT 2007


Paul Rubin wrote:
> Sullivan WxPyQtKinter <sullivanz.pku at gmail.com> writes:
>> This program:
>> for i in range(1000000000):
>>       f.readline()
>> is absolutely every slow....
> 
> There are two problems: 
> 
>  1) range(1000000000) builds a list of a billion elements in memory,
>     which is many gigabytes and probably thrashing your machine.
>     You want to use xrange instead of range, which builds an iterator
>     (i.e. something that uses just a small amount of memory, and
>     generates the values on the fly instead of precomputing a list).
> 
>  2) f.readline() reads an entire line of input which (depending on
>     the nature of the log file) could also be of very large size.
>     If you're sure the log file contents are sensible (lines up to
>     several megabytes shouldn't cause a problem) then you can do it
>     that way, but otherwise you want to read fixed size units.

If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
    count = count + 1
    if count == '1000000000':
        #do something


My first suggestion would be to split the file into smaller more manageable
chunks, because any type of manipulation of a multi-billion line log file is
going to be a nightmare. For example, you could try the UNIX 'split' utility to
break the file into individual files of say, 100000 lines each. split is likely
to be faster than anything in Python, since it is written in C with no
interpreter overhead etc.

Is there a reason you specifically need to get to line 1 billion, or are you
just trying to trim the file down? Do you need a value that's on that particular
line, or is there some other reason? Perhaps if you can provide the use case the
list can help you solve the problem itself rather than looking for a way to seek
to the one billionth line in a file.

-Jay



More information about the Python-list mailing list