Short, perfect program to read sentences of webpage

Cameron Simpson cs at cskk.id.au
Wed Dec 8 16:12:17 EST 2021


Assorted remarks inline below:

On 08Dec2021 20:39, Julius Hamilton <juliushamilton100 at gmail.com> wrote:
>deepreader.py:
>
>import sys
>import requests
>import html2text
>import nltk
>
>url = sys.argv[1]

I might spell this:

    cmd, url = sys.argv

which enforces exactly one argument. And since you don't care about the 
command name, maybe:

    _, url = sys.argv

because "_" is a conventional name for "a value we do not care about".

>sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))

Neat!

># Activate an elementary reader interface for the text
>for index, sentence in enumerate(sentences):

I would be inclined to count from 1, so "enumerate(sentences, 1)".

>  # Print the sentence
>  print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
>“\n”)

Personally, since print() adds a trailing newline, I would drop the 
final +"\n". If you want an additional blank line, I would put it in the 
input() call below:

>  # Wait for user key-press
>  x = input(“\n> “)

You're not using "x". Just discard input()'s return value:

    input("\n> ")

>A lot of refining is possible, and I’d really like to see how some more
>experienced people might handle it.
>
>1. The HTML extraction is not perfect. It doesn’t produce as clean text as
>I would like. Sometimes random links or tags get left in there.

Maybe try beautifulsoup instead of html2text? The module name is "bs4".

>And the
>sentences are sometimes randomly broken by newlines.

I would flatten the newlines. Either the simple:

    sentence = sentence.strip().replace("\n", " ")

or maybe better:

    sentence = " ".join(sentence.split()

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Python-list mailing list