file IO

Jeff Epler jepler at unpythonic.net
Mon Aug 2 22:34:47 EDT 2004


Are you using Windows?  That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode".  The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

On Unix, I don't find that a "while" loop with mmap.readline is any
faster than a "for" loop over a file:

[45426 lines, 409305 bytes]
$ timeit -s "..." "readspeed.read_stdio('/usr/share/dict/words')"
10 loops, best of 3: 34.9 msec per loop
$ timeit -s "..." "readspeed.read_mmap('/usr/share/dict/words')"
10 loops, best of 3: 107 msec per loop

[363416 lines, 3274440 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 0.372s  user 0.331s  sys 0.031s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 1.080s  user 1.013s  sys 0.021s

[2907328 lines, 26195520 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 2.603s  user 2.308s  sys 0.157s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 8.514s  user 7.893s  sys 0.153s

I didn't have any "bigger-than-RAM text files" around to test.

Testing "biggerfile.txt" with mode "rU" gives real 3.110s, so there is
some penalty from using universal newlines.

------------------------------------------------------------------------
# readspeed.py
from mmap import mmap, PROT_READ
import itertools, os

def consume(iterable):
    for j in iterable: pass

def read_stdio(filename):
    f = open(filename) # open(filename, "rU") is slightly slower
    consume(f)

def read_mmap(filename):
    f = open(filename)
    fd = f.fileno()
    m = mmap(fd, os.fstat(fd).st_size, prot=PROT_READ)
    while 1:
        if not m.readline(): break
------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20040802/a69090da/attachment.sig>


More information about the Python-list mailing list