Short, perfect program to read sentences of webpage

Peter J. Holzer hjp-python at hjp.at
Wed Dec 8 18:09:47 EST 2021


On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote:
> On 08Dec2021 21:41, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> >Julius Hamilton <juliushamilton100 at gmail.com> writes:
> >>This is a really simple program which extracts the text from webpages and
> >>displays them one sentence at a time.
> >
> >  Our teacher said NLTK will not come up until next year, so
> >  I tried to do with regexps. It still has bugs, for example
> >  it can not tell the dot at the end of an abbreviation from
> >  the dot at the end of a sentence!
> 
> This is almost a classic demo of why regexps are a poor tool as a first 
> choice. You can do much with them, but they are cryptic and bug prone.

I don't think that's problem here. The problem is that natural languages
just aren't regular languages. In fact I'm not sure that they fit
anywhere within the Chomsky hierarchy (but if they aren't type-0, that
would be a strong argument against the possibility of human-level AI).

In English, if a sentence ends with an abbreviation you write only a
single dot. So if you look at these two fragments:

    For matching strings, numbers, etc. Python provides regular
    expressions.

    Let's say you want to match strings, numbers, etc. Python provides
    regular expresssions for these tasks.

In second case the dot ends a sentence in the first it doesn't. But to
distinguish those cases you need to at least parse the sentences at the
syntax level (which regular expressions can't do), maybe even understand
them semantically.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20211209/b3c841c4/attachment.sig>


More information about the Python-list mailing list