[Tutor] Using an XML file for web crawling

Mats Wichmann mats at wichmann.us
Fri Mar 31 10:52:58 EDT 2017


On 03/31/2017 05:23 AM, Igor Alexandre wrote:
> Hi!
> I'm a newbie in the Python/Web crawling world. I've been trying to find an answer to the following question since last week, but I couldn't so far, so I decided to ask it myself here: I have a sitemap in XML and I want to use it to save as text the various pages of the site. Do you guys know how can I do it? I'm looking for some code on the web where I can just type the xml address and wait for the crawler to do it's job, saving all the pages indicated in the sitemap as a text file in my computer. 
> Thank you!
> Best,
> Igor Alexandre

There's a surprisingly active community doing web crawling / scraping
stuff... I've gotten the impression that the scrapy project is a "big
dog" in this space, but I'm not involved in it so not sure.  A couple of
links for you to play with:

http://docs.python-guide.org/en/latest/scenarios/scrape/

the first part of this might be enough for you - lxml + Requests
I just had occasion to look over that page a few days ago, but I'm sure
a web search would turn that up easily

https://scrapy.org/

there are plenty of other resources, someone is bound to have what
you're looking for.




More information about the Tutor mailing list