Python optimization (was Python's "only one way to do it" philosophy isn't good?)

John Nagle nagle at animats.com
Wed Jun 13 23:21:36 EDT 2007


Paul Rubin wrote:
> "Diez B. Roggisch" <deets at nospam.web.de> writes:
> 
>>And if only the html-parsing is slow, you might consider creating an
>>extension for that. Using e.g. Pyrex.
> 
> 
> I just tried using BeautifulSoup to pull some fields out of some html
> files--about 2 million files, output of a web crawler.  It parsed very
> nicely at about 5 files per second.  

     That's about what I'm seeing.  And it's the bottleneck of
"sitetruth.com".

>  By
> simply treating the html as a big string and using string.find to
> locate the fields I wanted, I got it up to about 800 files/second,
> which made each run about 1/2 hour. 

     For our application, we have to look at the HTML in some detail,
so we really need it in a tree form.

 > Simplest still would be if Python
> just ran about 100x faster than it does, a speedup which is not
> outlandish to hope for.

    Right.  Looking forward to ShedSkin getting good enough to run
BeautifulSoup.

    (Actually, the future of page parsing is probably to use some kind
of stripped-down browser that reads the page, builds the DOM,
runs the startup JavaScript, then lets you examine the DOM.  There
are too many pages now that just come through as blank if you don't
run the OnLoad JavaScript.)

				John Nagle



More information about the Python-list mailing list