web crawler in python or C?

gene tani gene.tani at gmail.com
Thu Feb 16 08:27:01 EST 2006


Paul Rubin wrote:
> "abhinav" <abhinavduggal at gmail.com> writes:

> > maintaining huge data structures.What should be the language so as
> > not to compromise that much on speed.What is the performance of
> > python based crawlers vs C based crawlers.Should I use both the
> > languages(partly C and python).How should i decide what part to be
> > implemented in C and what should be done in python?  Please guide
> > me.Thanks.
>
> I think if you don't know how to answer these questions for yourself,
> you're not ready to take on projects of that complexity.  My advice
> is start in Python since development will be much easier.  If and when
> you start hitting performance problems, you'll have to examine many
> combinations of tactics for dealing with them, and switching languages
> is just one such tactic.

There's another potential bottleneck, parsing HTML and extracting the
text you want, especially when you hit pages that don't meet HTML 4 or
XHTML spec.
http://sig.levillage.org/?p=599

Paul's advice is very sound, given what little info you've provided.

http://trific.ath.cx/resources/python/optimization/
(and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
python, you have a lot of options.  Also look at Harvestman, mechanize,
other existing libs.




More information about the Python-list mailing list