web crawler in python or C?

Magnus Lycka lycka at carmen.se
Mon Feb 20 13:33:16 EST 2006


abhinav wrote:
> I want to strke a balance between development speed and crawler speed.

"The best performance improvement is the transition from the
nonworking state to the working state."        - J. Osterhout

Try to get there are soon as possible. You can figure out what
that means. ;^)

When you do all your programming in Python, most of the code that
is relevant for speed *is* written in C already. If performance
is slow, measure! Use the profiler to see if you are spending a
lot of time in Python code. If that is your problem, take a close
look at your algorithms and perhaps your data structures and see
what you can improve with Python. In the long run, going from from
e.g. O(n^2) to O(n log n) might mean much more than going from
Python to C. A poor algorithm in machine code still sucks when you
have to handle enough data. Changing your code to improve on
algorithms and structure is a lot easier in Python than in C.

If you've done all these things, still have performance problems,
and have identified a bottle neck in your Python code, it might
be time to get that piece rewritten in C. The easiest and least
intrusive way to do that might be with pyrex. You might also want
to try Psyco before you do this.

Even if you end up writing a whole program in C, it's not unlikely
that you will get to your goal faster if your first version is
written in Python.

Good luck!

P.S. Why someone would want to write yet another web crawler is
a puzzle to me. Surely there are plenty of good ideas that haven't
been properly implemented yet! It's probably very difficult to
beat Google on their home turf now, but I'd really like to see
a good tool to manage all that information I got from the net,
or through mail or wrote myself. I don't think they wrote that
yet--although I'm sure they are trying.



More information about the Python-list mailing list