Short, perfect program to read sentences of webpage

Cameron Simpson cs at cskk.id.au
Wed Dec 8 17:42:07 EST 2021


On 08Dec2021 21:41, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>Julius Hamilton <juliushamilton100 at gmail.com> writes:
>>This is a really simple program which extracts the text from webpages and
>>displays them one sentence at a time.
>
>  Our teacher said NLTK will not come up until next year, so
>  I tried to do with regexps. It still has bugs, for example
>  it can not tell the dot at the end of an abbreviation from
>  the dot at the end of a sentence!

This is almost a classic demo of why regexps are a poor tool as a first 
choice. You can do much with them, but they are cryptic and bug prone.

I am not seeking to mock you, but trying to make apparent why regexps 
are to be avoided a lot of the time. They have their place.

You've read the whole re module docs I hope:

    https://docs.python.org/3/library/re.html#module-re

>import re
>import urllib.request
>uri = r'''http://example.com/article''' # replace this with your URI!
>request = urllib.request.Request( uri )
>resource = urllib.request.urlopen( request )
>cs = resource.headers.get_content_charset()
>content = resource.read().decode( cs, errors="ignore" )
>content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )

You're not multiline, so I would recommend a plain raw string:

    content = re.sub( r'[\r\n\t\s]+', r' ', content )

No need for \r in the class, \s covers that. From the docs:

  \s
    For Unicode (str) patterns:

      Matches Unicode whitespace characters (which includes [ 
      \t\n\r\f\v], and also many other characters, for example the 
      non-breaking spaces mandated by typography rules in many 
      languages). If the ASCII flag is used, only [ \t\n\r\f\v] is 
      matched.

>upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
>lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"

This is very fragile - you have an arbitrary set of additional uppercase 
characters, almost certainly incomplete, and visually hard to inspect 
for completeness.

Instead, consider the \b (word boundary) and \w (word character) 
markers, which will let you break strings up, and then maybe test the 
results with str.isupper().

>digit = r"[0-9]" #"[\\p{Nd}]"

There's a \d character class for this, covers nondecimal digits too.

>firstwordstart = upper;
>firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";

Again, an inline arbitrary list of characters. This is fragile.

>wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
>ñòóôõöøùúûüýþÿ0-9-]"

Again inline. Why not construct it?

    wordcharacter = upper + lower + digit

but I recommend \w instead, or for this: [\w\d]

>addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"

As a matter of good practice with regexp strings, use raw quotes:

    addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?"

even when there are no backslahes.

Seriously, doing this with regexps is difficult. A useful exercise for 
learning regexps, but in the general case not the first tool to reach 
for.

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Python-list mailing list