[Tutor] a little help with the re odule
Cameron Simpson
cs at cskk.id.au
Sun Jul 19 01:55:58 EDT 2020
On 18Jul2020 13:51, nathan tech <nathan-tech at hotmail.com> wrote:
>As always, the exact answer I was looking for, thanks Alan!
>>>What I want to do is for that to match all of the following:
>>>
>>>the cat sat on the mat
>>>The dog sat on the shoe
>>While \w will work for a single word it gets more complex
>>with multiple words.
>>
>>>The dog and the cat sat on the hoverboard.
>>>The big angry mouse sat on the mat and ate it.
>>This gets more complex because you want to match multiple words between
>>The and sat. The simplest way here is probably the .* combination
>>(zero or more repetitions of any character).
>>
>>The .* sat on the .*
>>
>>Or + if you want at least 1 character.
>>
>>The .+ sat on the .+
Just some things I find handy in this kind of situation:
The (\w.*\w) sat on the (\w.*\w).
matches strings which start and end with word characters. Good for
phrases.
Whitespace might not always be spaces. \s+ is a general thing for
whitespace. However, it makes your patterns hard to read (and therefore,
hard to debug):
The\s+(\w.*\w)\s+sat\s+on\s+the\s+(\w.*\w).
You can simplify various things by normalising the input.
For whitespace this is pretty easy:
text = re.sub(r'\s+', ' ', text)
and now all the whitespace is a single space, letting you use your
simpler regexps. Including assuming that every word gap is only one
space, not several.
And finally, when you're pulling bits out of longer regexps, do not
forget the "named groups":
The (?P<subject>\w.*\w) sat on the (?P<object>\w.*\w).
thus:
ptn = re.compile(r'The (?P<subject>\w.*\w) sat on the (?P<object>\w.*\w).')
m = ptn.match(text)
the_cat = m.group('subject')
the_mat = m.group('object')
For you example points at group 1 and 2 is easy, but regexps tend to get
out of handy and naming the brackets bits of interest is extremely
useful.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Tutor
mailing list