Short, perfect program to read sentences of webpage

Jon Ribbens jon+usenet at unequivocal.eu
Wed Dec 8 17:19:59 EST 2021


On 2021-12-08, Julius Hamilton <juliushamilton100 at gmail.com> wrote:
> 1. The HTML extraction is not perfect. It doesn’t produce as clean text as
> I would like. Sometimes random links or tags get left in there. And the
> sentences are sometimes randomly broken by newlines.

Oh. Leaving tags in suggests you are doing this very wrongly. Python
has plenty of open source libraries you can use that will parse the
HTML reliably into tags and text for you.

> 2. Neither is the segmentation perfect. I am currently researching
> developing an optimal segmenter with tools from Spacy.
>
> Brevity is greatly valued. I mean, anyone who can make the program more
> perfect, that’s hugely appreciated. But if someone can do it in very few
> lines of code, that’s also appreciated.

It isn't something that can be done in a few lines of code. There's the
spaces issue you mention for example. Nor is it something that can
necessarily be done just by inspecting the HTML alone. To take a trivial
example:

  powergen<div>italia</div>          = powergen <nl> italia

but:

  powergen<span>italia</span>        = powergenitalia

but the second with the addition of:

  <style>span { dispaly: block }</style>

is back to "powergen <nl> italia". So you need to parse and apply styles
(including external stylesheets) as well. Potentially you may also need
to execute JavaScript on the page, which means you also need a JavaScript
interpreter and a DOM implementation. Basically you need a complete
browser to do it on general web pages.


More information about the Python-list mailing list