[Chennaipy] Chennaipy - Monday Module - 1 Augl 2022
selvi dct
selvi.dct at gmail.com
Mon Aug 1 14:05:32 EDT 2022
Date: 1 Aug 2022
Module: scrape
Installation: pip install scrape
About: Scrape is a rule-based web crawler and information extraction tool
capable of manipulating and merging new and existing documents. XML Path
Language (XPath) and regular expressions are used to define rules for
filtering content and web traversal. Output may be converted into text,
csv, pdf, and/or HTML formats.
Sample Source Code:
from scrape import scrape, utils
def call_scrape(cmd, filetype, num_files=None):
if not isinstance(cmd, list):
cmd = [cmd]
parser = scrape.get_parser()
args = vars(parser.parse_args(cmd))
args["overwrite"] = True # Avoid overwrite prompt
if args["crawl"] or args["crawl_all"]:
args["no_images"] = True # Avoid save image prompt when crawling
args[filetype] = True
if num_files is not None:
args[num_files] = True
return scrape.scrape(args)
call_scrape(["demo.html"], "text")
Input: demo.html
<html><body>
ADMISSION TO ONLINE COLLEGE
<P>
Aplicants are considered for admission to Online College
on the basis of their ISP, quality of their home pages and
quantity of emails exchanged per day.
<P>
It is recommended that students prepare for enrollment in
Online College by signing up for DSL service and
buying a new computer.
<P>
<A HREF="home.html">Back to Online College home page</A> </body>
</HTML>
Execution:
$ python scrape_sample.py
Output: demo.txt
ADMISSION TO ONLINE COLLEGEAplicants are considered for admission to Online
College on the basis of their ISP, quality of their home pages and quantity
of emails exchanged per day.It is recommended that students prepare for
enrollment in Online College by signing up for DSL service and buying a new
computer.
Back to Online College home page
Reference: https://pypi.org/project/scrape/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/chennaipy/attachments/20220801/85b2de90/attachment-0001.html>
More information about the Chennaipy
mailing list