[Tutor] Using an XML file for web crawling
Mats Wichmann
mats at wichmann.us
Fri Mar 31 10:52:58 EDT 2017
On 03/31/2017 05:23 AM, Igor Alexandre wrote:
> Hi!
> I'm a newbie in the Python/Web crawling world. I've been trying to find an answer to the following question since last week, but I couldn't so far, so I decided to ask it myself here: I have a sitemap in XML and I want to use it to save as text the various pages of the site. Do you guys know how can I do it? I'm looking for some code on the web where I can just type the xml address and wait for the crawler to do it's job, saving all the pages indicated in the sitemap as a text file in my computer.
> Thank you!
> Best,
> Igor Alexandre
There's a surprisingly active community doing web crawling / scraping
stuff... I've gotten the impression that the scrapy project is a "big
dog" in this space, but I'm not involved in it so not sure. A couple of
links for you to play with:
http://docs.python-guide.org/en/latest/scenarios/scrape/
the first part of this might be enough for you - lxml + Requests
I just had occasion to look over that page a few days ago, but I'm sure
a web search would turn that up easily
https://scrapy.org/
there are plenty of other resources, someone is bound to have what
you're looking for.
More information about the Tutor
mailing list