Suggestion for a new regular expression extension

Nicolas Lehuen nicolas.lehuen at thecrmcompany.com
Thu Nov 20 12:38:09 EST 2003


Hi Skip,

Well, that's what I am doing now, since I cannot hold my breath until my
suggestion gets implemented :). But in my case, it forces me to duplicate
each alternative in the big regexp in my normalisation function, which
causes quite tedious maintenance of the whole piece of code. It would feel
pretty much more natural just to say to the RE engine "if you match
B(?D|LD|VD|OUL(?:EVARD)) within this big ugly regexp, just return me BD,
please".

Anyway, I think I'm going to try using sre.Scanner, we'll see if it's stable
enough for that. I'll build 3 scanners that I'll call in sequence (each one
reusing the part of the string that was not scanned, handily returned in the
second part of the returned sequence of the 'scan' method) :

- one for the number (or numbers) within the street : "14", or numbers like
"14-16" or "14/16" or whatever separator the person entering the address
could imagine.

- one for the number extension : "B" or "BIS", "T" or "TER" or "TRE"
(misspelled, but that's the way some people write it...)

- one for the street/place type : most of the tricky regexp are there, most
of the rewriting will be performed by actions defined in the Scanner's
lexicon

- and the rest of the string is the street/place name.

This way the address will be processed in one pass without code duplication.

But still, this (?PR<...>...) notation would be handy. I had a look at the
sre source code, in hope that I would be able to implement it myself, but
it's a bit too much for me to handle right now ;).

Regards,

Nicolas

"Skip Montanaro" <skip at pobox.com> a écrit dans le message de
news:mailman.931.1069347639.702.python-list at python.org...
>     Nicolas> re_adresse = re.compile(r'''
>     ... [big, ugly re snipped] ...
>     Nicolas> ''',re.X)
>
>     Nicolas> Note for example the many abbreviations (correct or not) ouf
>     Nicolas> "boulevard" : BD, BLD, BVD, BOUL, BOULEVARD. For
normalisation
>     Nicolas> purposes, I need to transform all those forms into the only
>     Nicolas> correct abbreviation, BD.
>
>     Nicolas> What would be really, really neat, would be a regular
>     Nicolas> expression extension notation that would make the RE engine
to
>     Nicolas> return an arbitrary string when a substring is matched.
>
> Why not just use named groups, then pass the match's groupdict() result
> through a normalization function?  Here's a trivial example which
> "normalizes" some matches by replacing them with the matched strings'
> lengths.
>
>     >>> import re
>     >>> pat = re.compile('(?P<a>a+)(?P<b>b+)')
>     >>> mat = pat.match("aaaaaaaabbb")
>     >>> def norm(d):
>     ...   d['a'] = len(d['a'])
>     ...   d['b'] = len(d['b'])
>     ...
>     >>> d = mat.groupdict()
>     >>> d
>     {'a': 'aaaaaaaa', 'b': 'bbb'}
>     >>> norm(d)
>     >>> d
>     {'a': 8, 'b': 3}
>
> Skip
>







More information about the Python-list mailing list