[Tutor] Creating a webcrawler

bruce badouglas at gmail.com
Sat Jan 9 11:33:09 EST 2016


Hi Isac.

I'm not going to get into the pythonic stuff.. People on the list are
way better than I.  I've been doing a chunk of crawling, it's not too
bad, depending on what you're trying to accomplish and the site you're
targeting.

So, no offense, but I'm going to treat you like a 6 year old (google
it - from a movie!)

You need to back up, and analyze the site/pages/structure you're going
after. Use the tools - firefox - livehttpheaders/nettraffic/etc..
  -you want to be able to see what the exchange is between the
client/browser, as well as the server..
  -often, this gives you the clues/insite to crafting the request from
your client back to the server for the item/data you're going for...

Once you've gotten that together, setup the basic process with
wget/curl etc to get a feel for any weird issues - cert issues?
-security issues - are cookies required - etc.. A good deal of this
stuff can be resolved/checked out at this level, without jumping into
coding..

Once you're comfortable at this point, you can crank out some simple
code to go after the site you're targeting.

In the event you really have a javascript/dynamic site that you can't
handle in any other manner, you're going to need to go use a 'headless
browser' process.

There are a number of headless browser projects - I think most run on
the webit codebase (don't quote me). Casper/phantomjs, there are also
pythonic implementations as well...

So, there you go, should/hopefully this will get you on your way!



On Fri, Jan 8, 2016 at 9:01 PM, Whom Isac <wombingsac at gmail.com> wrote:
> Hi I want to create a web-crawler but dont have any lead to choose any
> module. I have came across the Jsoup but I am not familiar with how to use
> it in 3.5 as I tried looking at a similar web crawler codes from 3.4 dev
> version.
> I just want to build that crawler to crawl through a javascript enable site
> and automatically detect a download link (for video file)
> .
> And should I be using pickles to write the data in the text file/ save file.
> Thanks
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor


More information about the Tutor mailing list