Python optimization (was Python's "only one way to do it" philosophy isn't good?)

Wed Jun 13 22:11:26 EDT 2007

"Diez B. Roggisch" <deets at nospam.web.de> writes:
> And if only the html-parsing is slow, you might consider creating an
> extension for that. Using e.g. Pyrex.

I just tried using BeautifulSoup to pull some fields out of some html
files--about 2 million files, output of a web crawler.  It parsed very
nicely at about 5 files per second.  Of course Python being Python, I
wanted to run the program a whole lot of times, modifying it based on
what I found from previous runs, and at 5/sec each run was going to
take about 4 days (OK, I probably could have spread it across 5 or so
computers and gotten it to under 1 day, at the cost of more effort to
write the parallelizing code and to scare up extra machines).  By
simply treating the html as a big string and using string.find to
locate the fields I wanted, I got it up to about 800 files/second,
which made each run about 1/2 hour.  Simplest still would be if Python
just ran about 100x faster than it does, a speedup which is not
outlandish to hope for.