[Tutor] Creating a webcrawler

Alan Gauld alan.gauld at btinternet.com
Sat Jan 9 09:02:45 EST 2016


On 09/01/16 02:01, Whom Isac wrote:
> Hi I want to create a web-crawler but dont have any lead to choose any
> module. I have came across the Jsoup but I am not familiar with how to use
> it in 3.5 as I tried looking at a similar web crawler codes from 3.4 dev
> version.

I don't know Jsoup and have no idea about how it works with 3.5.
However there are some modules in the standard library you can
use including htmlib, urllib and so on.

Beautiful soup is good at parsing badly constructed html and
etree is good for xml/xhtml.

Requests is also a good bet for working with http requests.

> I just want to build that crawler to crawl through a javascript enable site
> and automatically detect a download link (for video file)

Depending on what exactly the Javascript does it might not
be possible (at least not directly) Many modern sites simply
load up the document structure before calling a Javascript
function to fetch all the data (including inks and images)
from a server via JSON.

If that's what your site does you'll need to find the call to
the server and emulate it from Python.

> And should I be using pickles to write the data in the text file/ save file.

You could. You could also use a database such as SQLite.
It really depends on what you plan on doing with it after
you save it.


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list