[Tutor] Creating a webcrawler

Whom Isac wombingsac at gmail.com
Thu Jan 14 05:43:52 EST 2016


Thanks guys for your replies. I actually tried playing with my browser but
getting a web crawler to select a video and fetch the video link was not
helpful or should I say very hard for me as I am just a beginner level
programmer and python was the first language I learnt. I also learnt
javascript, ruby and html, bootstrap, C# recently. I may try this same
project in future with more knowledge.

On Sun, Jan 10, 2016 at 2:33 AM, bruce <badouglas at gmail.com> wrote:

> Hi Isac.
>
> I'm not going to get into the pythonic stuff.. People on the list are
> way better than I.  I've been doing a chunk of crawling, it's not too
> bad, depending on what you're trying to accomplish and the site you're
> targeting.
>
> So, no offense, but I'm going to treat you like a 6 year old (google
> it - from a movie!)
>
> You need to back up, and analyze the site/pages/structure you're going
> after. Use the tools - firefox - livehttpheaders/nettraffic/etc..
>   -you want to be able to see what the exchange is between the
> client/browser, as well as the server..
>   -often, this gives you the clues/insite to crafting the request from
> your client back to the server for the item/data you're going for...
>
> Once you've gotten that together, setup the basic process with
> wget/curl etc to get a feel for any weird issues - cert issues?
> -security issues - are cookies required - etc.. A good deal of this
> stuff can be resolved/checked out at this level, without jumping into
> coding..
>
> Once you're comfortable at this point, you can crank out some simple
> code to go after the site you're targeting.
>
> In the event you really have a javascript/dynamic site that you can't
> handle in any other manner, you're going to need to go use a 'headless
> browser' process.
>
> There are a number of headless browser projects - I think most run on
> the webit codebase (don't quote me). Casper/phantomjs, there are also
> pythonic implementations as well...
>
> So, there you go, should/hopefully this will get you on your way!
>
>
>
> On Fri, Jan 8, 2016 at 9:01 PM, Whom Isac <wombingsac at gmail.com> wrote:
> > Hi I want to create a web-crawler but dont have any lead to choose any
> > module. I have came across the Jsoup but I am not familiar with how to
> use
> > it in 3.5 as I tried looking at a similar web crawler codes from 3.4 dev
> > version.
> > I just want to build that crawler to crawl through a javascript enable
> site
> > and automatically detect a download link (for video file)
> > .
> > And should I be using pickles to write the data in the text file/ save
> file.
> > Thanks
> > _______________________________________________
> > Tutor maillist  -  Tutor at python.org
> > To unsubscribe or change subscription options:
> > https://mail.python.org/mailman/listinfo/tutor
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list