Python IO performance?

Sun Jun 1 00:30:19 EDT 2003

>>>>> "Chad" == Chad Netzer <cnetzer at mail.arc.nasa.gov> writes:

> On Sat, 2003-05-31 at 00:26, Ganesan R wrote:
>> Python is over 8 times slower! Is the problem with the fileinput
>> module or is I/O just slower with python?

> Probably a few things.  One is that this case really favors Perl because
> it has operators for doing these things, and is heavily optimized for
> such cases.

I agree that this case favors perl. It's just that I am used to writing
quick hacks like this for text processing in perl. After learning python
I've been resisting my impulse to code them in perl and use python 
instead. I always had a feeling that my python scripts ran much slower.
I decided to so time timing tests to check out my perception and hence
the post. 

> Secondly, you are timing the program startup time plus the loop.  Python
> has to compile the program before executing it (don't know how Perl does
> this, probably the same), then import a module before it does the loop. 
> This adds a fixed overhead (for small input files, the startup time
> could dominate).  Note that python doesn't create a pre-compiled
> mycat.pyc file when you run a script directly on the command line like
> this (it only does it when importing a module, or when explictly told).

I made sure that this is not a problem. Doubling the size of the file
approximately doubled the time taken. So the overhead is pretty minimal in
this case. 

> Thirdly, the fileinput module itself is not the fastest method.  Here is
> my quick hack version, that goes quite a bit faster, and uses file
> iteration directly:

> ==== - mycat2.py
> import sys

> if len( sys.argv ) < 2:
>     sys.exit()

> f = file( sys.argv[1], "r" )
> for line in f:
>     print line,
> f.close()
> ====

I noticed this myself after my post. A similar version that I wrote took
< 0.3 secs compared to over 0.7 secs for the version using fileinput. Much
better but still about 3.5 times slower than the perl version.

> Python 2.3 beta1 has improved the file iteration even more.  Here are
> those timings:

I saw some posts mentioning about 25-30% improvement in performance in
general. It's good to know that file iteration is also being addressed.

> Now my version is about 2.5 times slower than perl.  It is probably not
> the case that Python will ever catch up to Perl completely for this
> benchmark (again, this benchmark happens to play to Perl's strengths in
> using language operators to efficiently handle file IO under the
> covers), or even other basic file IO benchmarks.  Perl has always
> performed better in that area, and is designed to be quick when doing
> IO.

Interestingly when I tried strace on both the perl and python versions, the
actual system calls were virtually identical (4k reads and writes). So, I
guess the issue is with the user space libraries like Aahz suggests in his
post. I do hope fileinput performance is addressed. It's the a natural
choice for writing unix filters and a 2.5 times slow down over a direct
coded version is not acceptable. 

> But as you can see, there have been big improvements made to Python's IO
> processing speed, and once the processing of the IO happens, depending
> on what is being done, these benchmarks may no longer apply.  I'd assume
> Perl is still faster for regular expression stuff, for example, but
> maybe not by much.  Others will know more about this than I (I last used
> Perl at version 4).

Actually, I first started script using regexps. After some tests I figured
out that I/O itself seemed to be bottleneck :-(. I remember Perl using an
alternative I/O library called sfio; I don't know if that's the standard in
shipping binaries. Any way, let me do some digging with the python 2.3
sources. May be there's more scope for improvement.

Ganesan