[Tutor] A simple sentence reader with pre-established libraries

Julius Hamilton juliushamilton100 at gmail.com
Mon Nov 15 14:02:56 EST 2021


Hey,

I would like to try to make a simple sentence-by-sentence reader, using as
many pre-established software libraries as possible, to keep the code very
simple.

First, I get the html and extract the plaintext. I can do this with wget or
python requests, then in theory html2text. However, I have found html2text
sometimes breaks sentences into different
lines, like how this sentence I am writing is broken. I’m not sure if this
is a bug or if there’s some option I don’t know about.

However, I have a Beautiful Soup method to do this as well:
https://stackoverflow.com/questions/69680184/how-do-i-retrieve-the-text-of-a-webpage-without-sentences-being-broken-by-newlin

Then, I just wanted to segment the text on sentences. I’ll probably use
Spacy, since it seems to be the most modern, either rule-based or with AI.
But I also have used NLTK in the past.

At that point, I just want to create the simplest command line application
which shows the list one element at a time, with simple navigation options
like backwards and forwards and quit, and maybe save progress.

Here’s the essential question: instead of writing this myself, what are the
chances that there is yet again some great pre-existing tool out there that
I could make use of? Is there some software library where with one line of
code I could “read” a list, or a plaintext document, at the command line,
sentence by sentence, or element by element, with or without some
navigation functionalities?

Thanks very much,
Julius


More information about the Tutor mailing list