web crawler in python or C?

Steven D'Aprano steve at REMOVETHIScyber.com.au
Thu Feb 16 12:11:32 EST 2006


On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:

> Hi guys.I have to implement a topical crawler as a part of my
> project.What language should i implement
> C or Python?Python though has fast development cycle but my concern is
> speed also.I want to strke a balance between development speed and
> crawler speed.Since Python is an interpreted language it is rather
> slow.

Python is no more interpreted than Java. Like Java, it is compiled to
byte-code. Unlike Java, it doesn't take three weeks to start the runtime
environment. (Okay, maybe it just *seems* like three weeks.)

The nice clean distinctions between "compiled" and "interpreted" languages
haven't existed in most serious programming languages for a decade or
more. In these days of tokenizers and byte-code compilers and processors
emulating other processors, the difference is more of degree than kind.

It is true that standard Python doesn't compile to platform dependent
machine code, but that is rarely an issue since the bottleneck for most
applications is I/O or human interaction, not language speed. And for
those cases where it is a problem, there are solutions, like Psycho.

After all, it is almost never true that your code must run as fast as
physically possible. That's called "over-engineering". It just needs to
run as fast as needed, that's all. And that's a much simpler problem to
solve cheaply.



> The crawler which will be working on huge set of pages should be
> as fast as possible.

Web crawler performance is almost certainly going to be I/O bound. Sounds
to me like you are guilty of trying to optimize your code before even
writing a single line of code. What you call "huge" may not be huge to
your computer. Have you tried? The great thing about Python is you can
write a prototype in maybe a tenth the time it would take you to do the
same thing in C. Instead of trying to guess what the performance
bottlenecks will be, you can write your code and profile it and find the
bottlenecks with accuracy.


> One possible implementation would be implementing
> partly in C and partly in Python so that i can have best of both
> worlds.

Sure you can do that, if you need to. 

> But i don't know to approach about it.Can anyone guide me on
> what part should i implement in C and what should be in Python?

Yes. Write it all in Python. Test it, debug it, get it working. 

Once it is working, and not before, rigorously profile it. You may find it
is fast enough.

If it is not fast enough, find the bottlenecks. Replace them with better
algorithms. We had an example on comp.lang.python just a day or two ago
where a function which was taking hours to complete was re-written with a
better algorithm which took only seconds. And still in Python.

If it is still too slow after using better algorithms, or if there are no
better algorithms, then and only then re-write those bottlenecks in C for
speed.



-- 
Steven.




More information about the Python-list mailing list