Ajax Request + Write to Json Extremely Slow (Webpage Crawler)

Steven D'Aprano steve at pearwood.info
Sun Jan 3 10:42:00 EST 2016


On Sun, 3 Jan 2016 10:03 pm, jonafleuraime at gmail.com wrote:

> I'm editing a simple scraper that crawls a Youtube video's comment page.
> The crawler uses Ajax to page through comments on the page (infinite
> scroll) and then saves them to a json file. Even with small number of
> comments (< 5), it still takes 3+ min for the comments to be added to the
> json file.
> 
> I've tried including requests-cache and using ujson instead of json to see
> if there are any benefits but there's no noticeable difference.

Before making random changes to the code to see if it speeds it up, try
running it under the profiler and see what it says.

https://pymotw.com/2/profile/index.html#module-profile

https://docs.python.org/2/library/profile.html



> You can view the code here:
>
http://stackoverflow.com/questions/34575586/how-to-speed-up-ajax-requests-python-youtube-scraper



I see that you already have an answer that you should try using threads
since the process is I/O bound. (The time taken is supposedly dominated by
the time it takes to download data from the internet.) That may be true,
but I also see something which *may* be a warning sign:


    while page_token:
        [...]
        page_token, html = response
        reply_cids += extract_reply_cids(html)


`reply_cids` is a list, and repeatedly calling += on a list *may* be slow.
If += is implemented the naive way, as addition and assignment, it probably
will be slow. This may entirely be a red herring, but if it were my code,
I'd try replacing that last line with:

        reply_cids.extend(extract_reply_cids(html))


and see if it makes any difference. If it doesn't, you can keep the new
version or revert back to the version using +=, entirely up to you.



-- 
Steven




More information about the Python-list mailing list