[Chennaipy] Chennaipy - Monday Module - 1 Augl 2022

selvi dct selvi.dct at gmail.com
Mon Aug 1 14:05:32 EDT 2022


Date: 1 Aug 2022


Module: scrape


Installation: pip install scrape


About: Scrape is a rule-based web crawler and information extraction tool
capable of manipulating and merging new and existing documents. XML Path
Language (XPath) and regular expressions are used to define rules for
filtering content and web traversal. Output may be converted into text,
csv, pdf, and/or HTML formats.


Sample Source Code:

from scrape import scrape, utils


def call_scrape(cmd, filetype, num_files=None):

    if not isinstance(cmd, list):

cmd = [cmd]

    parser = scrape.get_parser()

    args = vars(parser.parse_args(cmd))


    args["overwrite"] = True  # Avoid overwrite prompt

    if args["crawl"] or args["crawl_all"]:

args["no_images"] = True  # Avoid save image prompt when crawling

    args[filetype] = True

    if num_files is not None:

args[num_files] = True

    return scrape.scrape(args)



call_scrape(["demo.html"], "text")


Input: demo.html

<html><body>

ADMISSION TO ONLINE COLLEGE

<P>

Aplicants are considered for admission to Online College

on the basis of their ISP, quality of their home pages and

quantity of emails exchanged per day.

<P>

It is recommended that students prepare for enrollment in

Online College by signing up for DSL service and

buying a new computer.

<P>

<A HREF="home.html">Back to Online College home page</A> </body>

</HTML>


Execution:

$ python scrape_sample.py


Output: demo.txt

ADMISSION TO ONLINE COLLEGEAplicants are considered for admission to Online
College on the basis of their ISP, quality of their home pages and quantity
of emails exchanged per day.It is recommended that students prepare for
enrollment in Online College by signing up for DSL service and buying a new
computer.

Back to Online College home page


Reference: https://pypi.org/project/scrape/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/chennaipy/attachments/20220801/85b2de90/attachment-0001.html>


More information about the Chennaipy mailing list