How to get the "longest possible" match with Python's RE module?

Frederic Rentsch anthra.norell at vtxmail.ch
Tue Sep 12 12:27:41 EDT 2006


Licheng Fang wrote:
> Basically, the problem is this:
>
>   
>>>> p = re.compile("do|dolittle")
>>>> p.match("dolittle").group()
>>>>         
> 'do'
>
> Python's NFA regexp engine trys only the first option, and happily
> rests on that. There's another example:
>
>   
>>>> p = re.compile("one(self)?(selfsufficient)?")
>>>> p.match("oneselfsufficient").group()
>>>>         
> 'oneself'
>
> The Python regular expression engine doesn't exaust all the
> possibilities, but in my application I hope to get the longest possible
> match, starting from a given point.
>
> Is there a way to do this in Python?
>
>   
Licheng,

   If you need regexes, why not just reverse-sort your expressions? This 
seems a lot easier and faster than writing another regex compiler. 
Reverse-sorting places the longer ones ahead of the shorter ones.

 >>> targets = ['be', 'bee', 'been', 'being']
 >>> targets.sort ()
 >>> targets.reverse ()
 >>> regex = '|'.join (targets)
 >>> re.findall (regex, 'Having been a bee in a former life, I don\'t 
mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']

You might also take a look at a stream editor I recently came out with: 
http://cheeseshop.python.org/pypi/SE/2.2%20beta

It has been well received, especially by newbies, I believe because it 
is so simple to use and allows very compact coding.

 >>> import SE
 >>> Bee_Editor = SE.SE ('be=BE bee=BEE  been=BEEN being=BEING')
 >>> Bee_Editor ('Having been a bee in a former life, I don\'t mind 
being what I am and wouldn\'t want to be a bee ever again.')

"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."

Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.


>>> SE.SE ('<EAT> be==, bee==,  been==, being==,')(above_sting)
'been,bee,being,be,bee,'

You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.

>>> Key_Word_Translator = SE.SE ('''
   "*INT=int substitute"
   "*DECIMAL=decimal substitute"
   "*FACTION=faction substitute"
   "*NUMERALS=numerals substitute"
   # ... etc.
''')

I don't know if that could serve.

Regards

Frederic





More information about the Python-list mailing list