[Tutor] a little help with the re odule

Cameron Simpson cs at cskk.id.au
Sun Jul 19 01:55:58 EDT 2020


On 18Jul2020 13:51, nathan tech <nathan-tech at hotmail.com> wrote:
>As always, the exact answer I was looking for, thanks Alan!
>>>What I want to do is for that to match all of the following:
>>>
>>>the cat sat on the mat
>>>The dog sat on the shoe
>>While \w will work for a single word it gets more complex
>>with multiple words.
>>
>>>The dog and the cat sat on the hoverboard.
>>>The big angry mouse sat on the mat and ate it.
>>This gets more complex because you want to match multiple words between
>>The and sat. The simplest way here is probably the .* combination
>>(zero or more repetitions of any character).
>>
>>The .* sat on the .*
>>
>>Or + if you want at least 1 character.
>>
>>The .+ sat on the .+

Just some things I find handy in this kind of situation:

    The (\w.*\w) sat on the (\w.*\w).

matches strings which start and end with word characters. Good for 
phrases.

Whitespace might not always be spaces. \s+ is a general thing for 
whitespace. However, it makes your patterns hard to read (and therefore, 
hard to debug):

    The\s+(\w.*\w)\s+sat\s+on\s+the\s+(\w.*\w).

You can simplify various things by normalising the input.

For whitespace this is pretty easy:

    text = re.sub(r'\s+', ' ', text)

and now all the whitespace is a single space, letting you use your 
simpler regexps. Including assuming that every word gap is only one 
space, not several.

And finally, when you're pulling bits out of longer regexps, do not 
forget the "named groups":

    The (?P<subject>\w.*\w) sat on the (?P<object>\w.*\w).

thus:

    ptn = re.compile(r'The (?P<subject>\w.*\w) sat on the (?P<object>\w.*\w).')
    m = ptn.match(text)
    the_cat = m.group('subject')
    the_mat = m.group('object')

For you example points at group 1 and 2 is easy, but regexps tend to get 
out of handy and naming the brackets bits of interest is extremely 
useful.

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Tutor mailing list