[Tutor] newbie search and replace

Magnus Lycka magnus@thinkware.se
Mon Jan 27 20:50:01 2003


At 02:07 2003-01-28 +0100, Michael Janssen wrote:
>to avoid nasty regular expressions, one can split the file into "words"
>(defined as character[s] between spaces - now you needn't check for
>word-boundaries via regexp) and check for every word if a
>abbrevation-fullstring-dictionary has the word as a key:

But this is not the same as a regular expression word boundry.
A word in RE is a sequence of alphanumeric characters and
underscore. So, there are other word boundries than whitespace.

This means that both "Hill St." and "Hill St" will be found
by re.compile(r'\bst\b', re.IGNORECASE). If we want to eat
up a possible trailing space, change it to
re.compile(r'\bst\b\.?', re.IGNORECASE)

 >>> stPat = re.compile(r'\bst\b', re.IGNORECASE)
 >>> stPat.sub('Street', "First Liston St.")
'First Liston Street.'
 >>> stPat = re.compile(r'\bst\b\.?', re.IGNORECASE)
 >>> stPat.sub('Street', "First Liston St")
'First Liston Street'
 >>> stPat.sub('Street', "First Liston St.")
'First Liston Street'

Once a problem reaches a certain level of complexity, and I
think tidying natural language texts are a problem at that
level, using too simple tools will just lead to very
complicated, or failing, solutions. (Or both.)

The power of RE means that it takes some time to learn, but
in many situations, it's worth it.


-- 
Magnus Lycka, Thinkware AB
Alvans vag 99, SE-907 50 UMEA, SWEDEN
phone: int+46 70 582 80 65, fax: int+46 70 612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se