How to get the "longest possible" match with Python's RE module?
Frederic Rentsch
anthra.norell at vtxmail.ch
Tue Sep 12 12:27:41 EDT 2006
Licheng Fang wrote:
> Basically, the problem is this:
>
>
>>>> p = re.compile("do|dolittle")
>>>> p.match("dolittle").group()
>>>>
> 'do'
>
> Python's NFA regexp engine trys only the first option, and happily
> rests on that. There's another example:
>
>
>>>> p = re.compile("one(self)?(selfsufficient)?")
>>>> p.match("oneselfsufficient").group()
>>>>
> 'oneself'
>
> The Python regular expression engine doesn't exaust all the
> possibilities, but in my application I hope to get the longest possible
> match, starting from a given point.
>
> Is there a way to do this in Python?
>
>
Licheng,
If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.
>>> targets = ['be', 'bee', 'been', 'being']
>>> targets.sort ()
>>> targets.reverse ()
>>> regex = '|'.join (targets)
>>> re.findall (regex, 'Having been a bee in a former life, I don\'t
mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']
You might also take a look at a stream editor I recently came out with:
http://cheeseshop.python.org/pypi/SE/2.2%20beta
It has been well received, especially by newbies, I believe because it
is so simple to use and allows very compact coding.
>>> import SE
>>> Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING')
>>> Bee_Editor ('Having been a bee in a former life, I don\'t mind
being what I am and wouldn\'t want to be a bee ever again.')
"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."
Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.
>>> SE.SE ('<EAT> be==, bee==, been==, being==,')(above_sting)
'been,bee,being,be,bee,'
You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.
>>> Key_Word_Translator = SE.SE ('''
"*INT=int substitute"
"*DECIMAL=decimal substitute"
"*FACTION=faction substitute"
"*NUMERALS=numerals substitute"
# ... etc.
''')
I don't know if that could serve.
Regards
Frederic
More information about the Python-list
mailing list